• Final request for feedback

    From David Newall@21:1/5 to All on Sun Feb 20 14:11:37 2022
    Hi All,

    I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
    and opinions for what should be the last time.

    What's different about what I'm finally intending to publish:

    1. I'm using a dictionary for the UNICODE encoding map, instead of
    sparse array. This isn't because it's faster -- 3ns slower seems quite acceptable -- and a dictionary is bigger -- over double the size for
    GNU's UnifontMedium. I'm doing this because it's two less files to
    publish -- I don't need to publish sparseget and I don't need to publish
    an AWK script to convert Fontforge .g2n files into a sparse array.

    2. I've replaced utf8show with utf8decode (which generates an array of
    UNICODE values) and unicodeshow.

    3. I'm not storing the map in the font, but passing it as a parameter to unicodeshow because I think it's simpler. Storing it in the font means defining a new font (definefont).

    These are the alternative programs for printing a UTF-8 string.

    This is what I think I'll publish:

    %!PS
    %%IncludeResource: procset unicodeshow
    %%IncludeResource: procset utf8decode
    /Helvetica 20 selectfont
    100 300 moveto
    (Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
    ReverseAdobeGlyphList exch unicodeshow
    showpage

    This is what I was previously intending, using a dictionary:

    %!PS
    %%IncludeResource: procset unicodefont
    %%IncludeResource: procset unicodeshow
    %%IncludeResource: procset utf8decode
    /Helvetica findfont 20 scalefont ReverseAdobeGlyphList unicodefont
    /MyFont exch definefont setfont
    100 300 moveto
    (Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
    unicodeshow
    showpage

    There's one extra line if using a sparse array instead of a dictionary:

    %%IncludeResource: procset sparseget

    I think the first is better but am open to opposing opinions.

    Thanks,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From luser droog@21:1/5 to David Newall on Tue Feb 22 07:36:22 2022
    On Saturday, February 19, 2022 at 9:11:52 PM UTC-6, David Newall wrote:
    Hi All,

    I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
    and opinions for what should be the last time.

    That looks really good to me. I'm a little sad that definefont is out,
    but it really doesn't appear to offer very much. It seems like PostScript *almost* has the pieces available to put this together seamlessly.
    But the conversion probably can't use a filtered file because of the need
    to convert from a string to an array. And packing the glyph selection
    into a composite font would be a ton of work if it's even possible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to luser droog on Sat Feb 26 01:44:36 2022
    On Tue, 22 Feb 2022 07:36:22 -0800 (PST)
    luser droog <luser.droog@gmail.com> wrote:
    [...] And packing
    the glyph selection into a composite font would be a ton of work if
    it's even possible.

    It is possible to create a tree of composite fonts, where each byte in
    a UTF-8 sequence dispatches to the next font, and the last one picks
    the glyph. The problems with this approach are 1. the complexity
    creating and populating the font tree, and 2. the fact that
    the base fonts at the leaves can only encode 64 glyphs each (since
    that's how many values the last byte in a multibyte UTF-8 sequence can
    hold), and not even at the beginning of the /Encoding array, which is a
    waste.

    A simpler approach is to reencode the UTF-8 string to a made-up UTF-24
    encoding (3 bytes per codepoint), and then use a simple chain of 8x8
    (FMapType 2) composite fonts. Here the first byte selects the Unicode
    plane (sections of 65536 codepoints; only 4 or 5 are assigned), the
    second byte the segment of 256 codepoints in that plane, and the third
    one the glyph inside that segment.

    While in theory this needs 1 comp. font to choose the plane + 256 comp.
    fonts (1 for each plane) + 265x256 base fonts = 65793 fonts, the
    majority of them are just the same empty font.

    Below is an example of this approach. You get a unicode font by calling "unicodize" on a font with CharStrings, and you reencode UTF-8 strings
    with the "u" operator:

    /Courier-Unicode /Courier findfont unicodize 12 scalefont setfont
    (oh là là)u show

    It uses the AdobeGlyphList for now -- maybe David will come up with
    something better.

    The code has probably some bugs. I only tested it with Emacs' "Hello"
    demo:

    %!PS

    /f /Arial findfont def
    /uf /UFont f unicodize def

    uf 14 scalefont setfont
    700
    [
    ( Europe: ¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu)
    ( Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა)
    ( Africa: ሠላም)
    ( Middle/Near East: שָׁלוֹם, السّلام عليكم)
    ( South Asia: નમસ્તે, नमस्ते, ನಮಸ್ಕಾರ, നമസ്കാരം, ଶୁଣିବେ,)
    ( ආයුබෝවන්, வணக்கம், నమస్కారం, བཀྲ་ཤིས་བདེ་ལེགས༎)
    ( South East Asia: ជំរាបសួរ, ສະບາຍດີ, မင်္ဂလာပါ, สวัสดีครับ,
    Chào bạn) ( East Asia: 你好, 早晨, こんにちは, 안녕하세요)
    ( Misc: Eĥoŝanĝo ĉiuĵaŭde, ⠓⠑⠇⠇⠕, ∀ p ∈ world • hello p □)
    ( CJK variety: GB(元气,开发), BIG5(元氣,開發), JIS(元気,開発),
    KSC(元氣,開發)) ( Unicode charset: Eĥoŝanĝo ĉiuĵaŭde, Γειά σας,
    שלום, Здравствуйте!) ] {
    1 index 20 exch moveto
    u show
    30 sub
    } forall

    pop
    showpage

    Here's the code. Our old friend the iterator makes an appearance :)

    %!PS

    %% create a composite font suitable for strings with UTF-24 encoding
    %: key originalfont -- newfont
    /unicodize {
    40 dict begin
    /ofont exch def
    /key exch def
    /fname key dup length string cvs def
    /basefonts 10 dict def
    /planefonts 10 dict def
    %: string string -- name
    /newname {
    /s2 exch def /s1 exch def
    /s s1 length s2 length add 1 add string def
    s 0 s1 putinterval
    s s1 length (-) putinterval
    s s1 length 1 add s2 putinterval
    s cvn
    } def
    %: int -- string
    /tohex { 16 10 string cvrs } def
    %: array element -- newarray
    /append { /e exch def [ exch aload pop e ] } def
    %: suffix -- font
    /newbasefont {
    /suffix exch def
    /name fname suffix newname def
    ofont dup length dict copy
    dup /Encoding [ 256 { /.notdef } repeat ] put
    dup /FontName name put
    dup basefonts exch name exch put
    } def
    /emptybasefont (Base-E) newbasefont def
    %: suffix -- font
    /newplanefont {
    /suffix exch def
    /name fname suffix newname def
    << /FontType 0
    /FontMatrix [ 1 0 0 1 0 0 ]
    /FontName name
    /FMapType 2
    /Encoding [ 256 { 0 } repeat ]
    /FDepVector [ emptybasefont ]
    >>
    dup planefonts exch name exch put
    } def
    /emptyplanefont (Plane-E) newplanefont def
    /mainfont << /FontType 0
    /FontMatrix [ 1 0 0 1 0 0 ]
    /FontName fname
    /FMapType 2
    /Encoding [ 256 { 0 } repeat ]
    /FDepVector [ emptyplanefont ]
    >> def
    %: font subfont code --
    /addsubfont {
    /c exch def /sf exch def /f exch def
    f /FDepVector 2 copy get sf append put
    f /Encoding get c f /FDepVector get length 1 sub put
    } def
    %: glyphname code --
    /putglyph {
    dup /plane exch 65536 idiv def
    dup /range exch 65536 mod 256 idiv def
    /code exch 256 mod def
    /glyph exch def
    /idx mainfont /Encoding get plane get def
    idx 0 eq {
    plane tohex newplanefont
    dup mainfont exch plane addsubfont
    } {
    mainfont /FDepVector get idx get
    } ifelse
    /planefont exch def
    /idx planefont /Encoding get range get def
    idx 0 eq {
    plane 256 mul range add tohex newbasefont
    dup planefont exch range addsubfont
    } {
    planefont /FDepVector get idx get
    } ifelse
    /basefont exch def
    basefont /Encoding get code glyph put
    } def
    %: glyphname -- code true | false
    /getcode {
    /g exch def
    AdobeGlyphList g known {
    AdobeGlyphList g get true
    } {
    /s g g length string cvs def
    s length 7 eq {
    s 0 3 getinterval (uni) eq {
    s 7 string copy dup 0 (16#) putinterval
    { cvi } stopped { pop false } { true } ifelse
    } {
    s 0 1 getinterval (u) eq {
    9 string dup 3 s 1 6 getinterval putinterval
    dup 0 (16#) putinterval
    { cvi } stopped { pop false } { true } ifelse
    } { false } ifelse
    } ifelse
    } { false } ifelse
    } ifelse
    } def
    % fill the fonts...
    ofont /CharStrings get { pop dup getcode { putglyph } { pop } ifelse } forall
    % register them...
    basefonts { definefont pop } forall
    planefonts { definefont pop } forall
    % register & return main font
    key mainfont definefont
    end
    } bind def

    %: string|array -- iterator ( -- nextchar true | false )
    /sequenceiterator {
    2 dict begin
    /s exch def
    /counter [ 0 ] def
    [ counter 0 /get cvx s length /lt cvx [
    s counter 0 /get cvx /get cvx true
    counter 0 2 /copy cvx /get cvx 1 /add cvx /put cvx
    ] cvx [
    false
    ] cvx /ifelse cvx
    ] cvx
    end
    } bind def

    %% reencode UTF-8 to UTF-24
    %: string -- string
    /u {
    3 dict begin
    /src exch def
    /nextch src sequenceiterator def
    % count UTF-8 sequence starts
    0 src { dup 128 lt exch 2#11000000 and 2#11000000 eq or
    { 1 } { 0 } ifelse add } forall
    3 mul string /dest exch def
    0 {
    % decode sequence
    nextch not { exit } if
    dup 128 lt {
    0 % 0xxxxxxx - 0 following bytes
    } {
    dup dup 2#11000000 ge exch 2#11011111 le and {
    2#00011111 and 1 % 110xxxxx - 1 following byte
    } {
    dup dup 2#11100000 ge exch 2#11101111 le and {
    2#00001111 and 2 % 1110xxxx - 2 following bytes
    } {
    dup dup 2#11110000 ge exch 2#11110111 le and {
    2#00000111 and 3 % 11110xxx - 3 following bytes
    } {
    pop 0 0 % invalid sequence
    } ifelse
    } ifelse
    } ifelse
    } ifelse
    { 6 bitshift nextch pop 2#00111111 and add } repeat
    % stack: index-to-dest, codepoint
    2 copy 65536 idiv dest 3 1 roll put
    exch 1 add exch 2 copy 65536 mod 256 idiv dest 3 1 roll put
    exch 1 add exch 2 copy 256 mod dest 3 1 roll put pop
    1 add
    } loop
    pop
    dest
    end
    } bind def
    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlos@21:1/5 to David Newall on Sat Feb 26 01:56:15 2022
    On Sun, 20 Feb 2022 14:11:37 +1100
    David Newall <davidn@davidnewall.com> wrote:
    [...]
    3. I'm not storing the map in the font, but passing it as a parameter
    to unicodeshow because I think it's simpler. Storing it in the font
    means defining a new font (definefont).

    I think the map problem --how to get a good map, since the AdobeGlyphMap
    is insufficient-- is the key. The interface and/or implementation IMO
    is not so important (I posted an alternative implementation in another message--but it's still limited to the meager 4K+ glyphs in the Adobe
    list plus whatever extra /uniXXXX the font has...).

    C.
    --

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to Carlos on Mon Feb 28 21:39:30 2022
    Hi Carlos,

    On 26/2/22 11:44, Carlos wrote:
    A simpler approach is to reencode the UTF-8 string

    What an elegant decoder; and I like the iterator with its clever use of
    an array.

    Invalid sequences should produce U+FFFD. Add:

    /unget {
    load 0 get dup 0 get dup 0 gt
    { 1 sub 0 exch put } { pop pop } ifelse
    } def

    and then only two changes:

    pop 16#FFFD 0 % invalid sequence

    and

    6 bitshift nextch not { pop 16#FFFD exit } if
    dup 2#11000000 and 2#10000000 ne
    { /nextch unget pop 16#FFFD exit } if
    2#00111111 and add

    It still accepts overlong sequences but gives output consistent with the
    input.

    Regards,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Newall@21:1/5 to Carlos on Mon Feb 28 22:26:53 2022
    Hi Carlos,

    On 26/2/22 11:44, Carlos wrote:
    It is possible to create a tree of composite fonts, where each byte in
    a UTF-8 sequence dispatches to the next font, and the last one picks
    the glyph.

    Thank you for the clearest example of composite fonts that I've ever
    seen. Unfortunately, they lose useful cshow (only the last byte of each character is pushed on stack) and don't work at all with kshow.

    It's an intriguing idea but I'm not sure where to go with it.

    What I'm currently working on fails when exceeding 64K glyphs (Adobe
    PostScript array and dictionary implementation limits) and a composite
    font gets past that, but not when simply transforming a standard font
    into a composite font (CharStrings limit.)

    Regards,

    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)