Forum: >>> Magnum BBS <<<

How Unix Spell Ran in 64kB RAM

From Ben Collver@21:1/5 to All on Thu Jan 23 15:16:38 2025

How Unix Spell Ran in 64kB RAM
==============================

by Abhinav Upadhyay

How do you fit a 250kB dictionary in 64kB of RAM and still perform
fast lookups? For reference, even with modern compression techniques
like gzip -9, you can't compress this file below 85kB.

In the 1970s, Douglas McIlroy faced this exact challenge while
implementing the spell checker for Unix at AT&T. The constraints of
the PDP-11 computer meant the entire dictionary needed to fit in just
64kB of RAM. A seemingly impossible task.

Instead of relying on generic compression techniques, he took
advantage of the properties of the data and developed a compression
algorithm that came within 0.03 bits of the theoretical limit of
possible compression. To this day, it remains unbeaten.

The story of Unix spell is more than just historical curiosity. It's
a masterclass in engineering under constraints: how to analyze a
problem from first principles, leverage mathematical insights, and
design elegant solutions that work within strict resource limits.

If you're short on time, here's the key engineering story:

* The Unix spell started in the 1970s as an afternoon prototype by
Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve
its performance and accuracy.

* McIlroy's first innovation was a clever linguistics-based stemming
algorithm that reduced the dictionary to just 25,000 words while
improving accuracy.

* For fast lookups, he initially used a Bloom filter--perhaps one of
its first production uses. Interestingly, Dennis Ritchie provided
the implementation. They tuned it to have such a low false positive
rate that they could skip actual dictionary lookups.

* When the dictionary grew to 30,000 words, the Bloom filter approach
became impractical, leading to innovative hash compression
techniques.

* They computed that 27-bit hash codes would keep collision
probability acceptably low, but needed compression.

* McIlroy's solution was to store differences between sorted hash
codes, after discovering these differences followed a geometric
distribution.

* Using Golomb's code, a compression scheme designed for geometric
distributions, he achieved 13.60 bits per word--remarkably close to
the theoretical minimum of 13.57 bits.

* Finally, he partitioned the compressed data to speed up lookups,
trading a small memory increase (final size ~14 bits per word) for
significantly faster performance.

The rest of the article expands each of these points and gives a
detailed explanation with all the math and logic behind them.

From: <https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Ben Collver on Thu Jan 23 23:54:00 2025

On Thu, 23 Jan 2025 15:16:38 -0000 (UTC), Ben Collver wrote:

Instead of relying on generic compression techniques, he took advantage
of the properties of the data and developed a compression algorithm that
came within 0.03 bits of the theoretical limit of possible compression.
To this day, it remains unbeaten.

All of which only worked for the English language (US).

What happened when they had to support other languages?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Eli the Bearded on Fri Jan 24 00:59:03 2025

On Fri, 24 Jan 2025 00:23:44 -0000 (UTC), Eli the Bearded wrote:

When it was written, Unix' main use to the owning company was producing
phone books with the runoff tools. Those didn't need spell checking of customer names or addresses, and so had very little text _to_ spell
check.

Actually, no. Remember the addition of automatic line numbers to troff was
of great interest to AT&T’s legal department, which spent a lot of time creating things like patent applications.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to ldo@nz.invalid on Fri Jan 24 00:23:44 2025

In comp.misc, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Thu, 23 Jan 2025 15:16:38 -0000 (UTC), Ben Collver wrote:

Instead of relying on generic compression techniques, he took advantage
of the properties of the data and developed a compression algorithm that
came within 0.03 bits of the theoretical limit of possible compression.
To this day, it remains unbeaten.

All of which only worked for the English language (US).

What happened when they had to support other languages?

Pretty sure the answer is: The program was completely replaced.
No one uses spell anymore, ispell or something else gets used.

When it was written, Unix' main use to the owning company was producing
phone books with the runoff tools. Those didn't need spell checking of
customer names or addresses, and so had very little text _to_ spell
check.

Just because the solution doesn't make sense to use now, doesn't mean it
wasn't clever then.

Elijah
------
probably still works better than chapgpt

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Fri Jan 24 13:16:24 2025

Lawrence D'Oliveiro:

Actually, no. Remember the addition of automatic line
numbers to troff was of great interest to AT&T�s legal
department.

Sounds MSWord-ish to me. troff is the typesettng assem-
bler, where high-level functionality is implemented in
macros and macro packages. Line-numbering would seem to
be implementable as an output-line trap using a counter
in a numeric register... Perhaps, you were talking not
about troff, but about one of its macro packages?

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Fri Jan 24 13:18:10 2025

I wrote:

Lawrence D'Oliveiro:

Actually, no. Remember the addition of automatic line
numbers to troff was of great interest to AT&T�s legal
department.

Sounds MSWord-ish to me. troff is the typesettng assem-
bler, where high-level functionality is implemented in
macros and macro packages. Line-numbering would seem to
be implementable as an output-line trap using a counter
in a numeric register... Perhaps, you were talking not
about troff, but about one of its macro packages?

The message above was typeset in nroff from the following
sounce:

Lawrence D'Oliveiro:
.(Q
Actually, no. Remember the addition of automatic line numbers to troff was
of great interest to AT&T�s legal department.
.)Q
Sounds MSWord-ish to me.
troff is the typesettng assembler,
where high-level functionality
is implemented in macros and macro packages.
Line-numbering would seem to be implementable
as an output-line trap using
a counter in a numeric register...
Perhaps, you were talking not about troff,
but about one of its macro packages?

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Shepelev on Sat Jan 25 00:46:43 2025

On Fri, 24 Jan 2025 13:16:24 +0300, Anton Shepelev wrote:

Lawrence D'Oliveiro:

Actually, no. Remember the addition of automatic line numbers to
troff was of great interest to AT&T’s legal department, which spent
a lot of time creating things like patent applications.

Sounds MSWord-ish to me.

There was no Microsoft at the time.

Line-numbering would seem to be implementable as an output-line trap
using a counter in a numeric register...

Testimony from those who were there <https://www.gnu.org/software/groff/manual/groff.html.node/Background.html>:

When Unix was up and running on the PDP-11, Joe [Ossanna] got wind
of the legal department having installed a commercial word
processor. He went to pitch Unix as an alternative and clinched a
trial by promising to make roff able to number lines by tomorrow
in order to fulfill a patent-office requirement that the
commercial system did not support.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Salvador Mirzo@21:1/5 to Ben Collver on Sat Jan 25 12:26:49 2025

Ben Collver <bencollver@tilde.pink> writes:

How Unix Spell Ran in 64kB RAM
==============================

by Abhinav Upadhyay

Ben Collver, I enjoy your postings a lot. Keep'em coming.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Shepelev@21:1/5 to All on Sat Jan 25 21:14:31 2025

Lawrence D'Oliveiro to Anton Shepelev:

Line-numbering would seem to be implementable as an
output-line trap using a counter in a numeric
register...

Testimony from those who were there <https://www.gnu.org/software/groff/manual/groff.html.node/Background.html>:

When Unix was up and running on the PDP-11, Joe
[Ossanna] got wind of the legal department having
installed a commercial word processor. He went to
pitch Unix as an alternative and clinched a trial by
promising to make roff able to number lines by
tomorrow in order to fulfill a patent-office
requirement that the commercial system did not
support.

From aught I know of *roff, my opinion is that Osanna
implenented line-numbering in a macro package, as I said,
rather than in the core utility itself. And it is a great
compliment to *roff.

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From candycanearter07@21:1/5 to Salvador Mirzo on Mon Feb 3 19:30:02 2025

Salvador Mirzo <smirzo@example.com> wrote at 15:26 this Saturday (GMT):

Ben Collver <bencollver@tilde.pink> writes:

How Unix Spell Ran in 64kB RAM
==============================

by Abhinav Upadhyay

Ben Collver, I enjoy your postings a lot. Keep'em coming.

Agreed, learning about interesting computer tidbits is great.
--
user <candycane> is generated from /dev/urandom

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	02:51:47
Calls:	10,386
Calls today:	1
Files:	14,057
Messages:	6,416,587

How Unix Spell Ran in 64kB RAM

Who's Online

Recent Visitors

System Info