• How Unix Spell Ran in 64kB RAM

    From Ben Collver@21:1/5 to All on Thu Jan 23 15:16:38 2025
    How Unix Spell Ran in 64kB RAM
    ==============================

    by Abhinav Upadhyay

    How do you fit a 250kB dictionary in 64kB of RAM and still perform
    fast lookups? For reference, even with modern compression techniques
    like gzip -9, you can't compress this file below 85kB.

    In the 1970s, Douglas McIlroy faced this exact challenge while
    implementing the spell checker for Unix at AT&T. The constraints of
    the PDP-11 computer meant the entire dictionary needed to fit in just
    64kB of RAM. A seemingly impossible task.

    Instead of relying on generic compression techniques, he took
    advantage of the properties of the data and developed a compression
    algorithm that came within 0.03 bits of the theoretical limit of
    possible compression. To this day, it remains unbeaten.

    The story of Unix spell is more than just historical curiosity. It's
    a masterclass in engineering under constraints: how to analyze a
    problem from first principles, leverage mathematical insights, and
    design elegant solutions that work within strict resource limits.

    If you're short on time, here's the key engineering story:

    * The Unix spell started in the 1970s as an afternoon prototype by
    Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve
    its performance and accuracy.

    * McIlroy's first innovation was a clever linguistics-based stemming
    algorithm that reduced the dictionary to just 25,000 words while
    improving accuracy.

    * For fast lookups, he initially used a Bloom filter--perhaps one of
    its first production uses. Interestingly, Dennis Ritchie provided
    the implementation. They tuned it to have such a low false positive
    rate that they could skip actual dictionary lookups.

    * When the dictionary grew to 30,000 words, the Bloom filter approach
    became impractical, leading to innovative hash compression
    techniques.

    * They computed that 27-bit hash codes would keep collision
    probability acceptably low, but needed compression.

    * McIlroy's solution was to store differences between sorted hash
    codes, after discovering these differences followed a geometric
    distribution.

    * Using Golomb's code, a compression scheme designed for geometric
    distributions, he achieved 13.60 bits per word--remarkably close to
    the theoretical minimum of 13.57 bits.

    * Finally, he partitioned the compressed data to speed up lookups,
    trading a small memory increase (final size ~14 bits per word) for
    significantly faster performance.

    The rest of the article expands each of these points and gives a
    detailed explanation with all the math and logic behind them.

    From: <https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Ben Collver on Thu Jan 23 23:54:00 2025
    On Thu, 23 Jan 2025 15:16:38 -0000 (UTC), Ben Collver wrote:

    Instead of relying on generic compression techniques, he took advantage
    of the properties of the data and developed a compression algorithm that
    came within 0.03 bits of the theoretical limit of possible compression.
    To this day, it remains unbeaten.

    All of which only worked for the English language (US).

    What happened when they had to support other languages?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Eli the Bearded on Fri Jan 24 00:59:03 2025
    On Fri, 24 Jan 2025 00:23:44 -0000 (UTC), Eli the Bearded wrote:

    When it was written, Unix' main use to the owning company was producing
    phone books with the runoff tools. Those didn't need spell checking of customer names or addresses, and so had very little text _to_ spell
    check.

    Actually, no. Remember the addition of automatic line numbers to troff was
    of great interest to AT&T’s legal department, which spent a lot of time creating things like patent applications.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to ldo@nz.invalid on Fri Jan 24 00:23:44 2025
    In comp.misc, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Thu, 23 Jan 2025 15:16:38 -0000 (UTC), Ben Collver wrote:
    Instead of relying on generic compression techniques, he took advantage
    of the properties of the data and developed a compression algorithm that
    came within 0.03 bits of the theoretical limit of possible compression.
    To this day, it remains unbeaten.
    All of which only worked for the English language (US).

    What happened when they had to support other languages?

    Pretty sure the answer is: The program was completely replaced.
    No one uses spell anymore, ispell or something else gets used.

    When it was written, Unix' main use to the owning company was producing
    phone books with the runoff tools. Those didn't need spell checking of
    customer names or addresses, and so had very little text _to_ spell
    check.

    Just because the solution doesn't make sense to use now, doesn't mean it
    wasn't clever then.

    Elijah
    ------
    probably still works better than chapgpt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Shepelev@21:1/5 to All on Fri Jan 24 13:16:24 2025
    Lawrence D'Oliveiro:

    Actually, no. Remember the addition of automatic line
    numbers to troff was of great interest to AT&Tôs legal
    department.

    Sounds MSWord-ish to me. troff is the typesettng assem-
    bler, where high-level functionality is implemented in
    macros and macro packages. Line-numbering would seem to
    be implementable as an output-line trap using a counter
    in a numeric register... Perhaps, you were talking not
    about troff, but about one of its macro packages?

    --
    () ascii ribbon campaign -- against html e-mail
    /\ www.asciiribbon.org -- against proprietary attachments

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Shepelev@21:1/5 to All on Fri Jan 24 13:18:10 2025
    I wrote:

    Lawrence D'Oliveiro:

    Actually, no. Remember the addition of automatic line
    numbers to troff was of great interest to AT&Tôs legal
    department.

    Sounds MSWord-ish to me. troff is the typesettng assem-
    bler, where high-level functionality is implemented in
    macros and macro packages. Line-numbering would seem to
    be implementable as an output-line trap using a counter
    in a numeric register... Perhaps, you were talking not
    about troff, but about one of its macro packages?

    The message above was typeset in nroff from the following
    sounce:

    Lawrence D'Oliveiro:
    .(Q
    Actually, no. Remember the addition of automatic line numbers to troff was
    of great interest to AT&Tôs legal department.
    .)Q
    Sounds MSWord-ish to me.
    troff is the typesettng assembler,
    where high-level functionality
    is implemented in macros and macro packages.
    Line-numbering would seem to be implementable
    as an output-line trap using
    a counter in a numeric register...
    Perhaps, you were talking not about troff,
    but about one of its macro packages?

    --
    () ascii ribbon campaign -- against html e-mail
    /\ www.asciiribbon.org -- against proprietary attachments

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Shepelev on Sat Jan 25 00:46:43 2025
    On Fri, 24 Jan 2025 13:16:24 +0300, Anton Shepelev wrote:

    Lawrence D'Oliveiro:

    Actually, no. Remember the addition of automatic line numbers to
    troff was of great interest to AT&T’s legal department, which spent
    a lot of time creating things like patent applications.

    Sounds MSWord-ish to me.

    There was no Microsoft at the time.

    Line-numbering would seem to be implementable as an output-line trap
    using a counter in a numeric register...

    Testimony from those who were there <https://www.gnu.org/software/groff/manual/groff.html.node/Background.html>:

    When Unix was up and running on the PDP-11, Joe [Ossanna] got wind
    of the legal department having installed a commercial word
    processor. He went to pitch Unix as an alternative and clinched a
    trial by promising to make roff able to number lines by tomorrow
    in order to fulfill a patent-office requirement that the
    commercial system did not support.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvador Mirzo@21:1/5 to Ben Collver on Sat Jan 25 12:26:49 2025
    Ben Collver <bencollver@tilde.pink> writes:

    How Unix Spell Ran in 64kB RAM
    ==============================

    by Abhinav Upadhyay

    Ben Collver, I enjoy your postings a lot. Keep'em coming.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Shepelev@21:1/5 to All on Sat Jan 25 21:14:31 2025
    Lawrence D'Oliveiro to Anton Shepelev:

    Line-numbering would seem to be implementable as an
    output-line trap using a counter in a numeric
    register...

    Testimony from those who were there <https://www.gnu.org/software/groff/manual/groff.html.node/Background.html>:

    When Unix was up and running on the PDP-11, Joe
    [Ossanna] got wind of the legal department having
    installed a commercial word processor. He went to
    pitch Unix as an alternative and clinched a trial by
    promising to make roff able to number lines by
    tomorrow in order to fulfill a patent-office
    requirement that the commercial system did not
    support.

    From aught I know of *roff, my opinion is that Osanna
    implenented line-numbering in a macro package, as I said,
    rather than in the core utility itself. And it is a great
    compliment to *roff.

    --
    () ascii ribbon campaign -- against html e-mail
    /\ www.asciiribbon.org -- against proprietary attachments

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From candycanearter07@21:1/5 to Salvador Mirzo on Mon Feb 3 19:30:02 2025
    Salvador Mirzo <smirzo@example.com> wrote at 15:26 this Saturday (GMT):
    Ben Collver <bencollver@tilde.pink> writes:

    How Unix Spell Ran in 64kB RAM
    ==============================

    by Abhinav Upadhyay

    Ben Collver, I enjoy your postings a lot. Keep'em coming.


    Agreed, learning about interesting computer tidbits is great.
    --
    user <candycane> is generated from /dev/urandom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)