• What does it mean to "move characters" in the lexer?

    From Roger L Costello@21:1/5 to All on Tue Jun 21 10:27:15 2022
    Hi Folks,

    Page 89 of the dragon book says:

    Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

    Page 102 of "A Retargetable C Compiler: Design and Implementation" says:

    The lexical analyzer's main activity is moving characters, so minimizing the amount of character movement helps increase speed.

    And on page 103 it says:

    An important consequence of this design is that most of the input characters are accessed by *cp and many characters are never moved. Only identifiers (excluding keywords) and string literals that appear in executable code are copied out of the buffer into permanent storage.

    I don't understand what they mean by "moving characters". Do they mean copying characters? Do they mean reading characters from a file into memory? Would you explain what this "character movement" thing is all about, please?

    /Roger
    [Keeping in mind that this was written in the 1970s, they mean copying strings of characters from one place to another. On a PDP-11. With 64K bytes of memory.
    It is still true that character processing in a lexer consumes a large fraction of the time in compilers. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Tue Jun 21 10:30:58 2022
    On Tuesday, June 21, 2022 at 9:25:12 AM UTC-7, Roger L Costello wrote:

    (snip)
    Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

    (snip)
    I don't understand what they mean by "moving characters". Do they mean copying
    characters? Do they mean reading characters from a file into memory? Would you
    explain what this "character movement" thing is all about, please?

    Yes it is copying, and yes it can take a lot of the time.

    On many systems, the disk controller reads the data into its own buffer,
    and then the OS copies the data from the controller buffer into its buffer.

    Then when the user does an I/O (input) request, the data is copied
    into the program's own buffer, and finally into the place where the data actually goes. So maybe four copies.

    Early in the days of TCP/IP there was trailer encapsulation.
    (I never saw it used, but some have the ability to turn it on.)

    If you follow the ISO seven letter model, or even if you don't.

    The program gives data to TCP, which divides it up into
    packets to send. Each of those gets a TCP header.
    It is then passed to IP where IP puts its header on.
    And then before sending, it gets an Ethernet header.

    Since there is often something before the buffer, but
    the buffer might not be full, so there might be space at
    the end, there was trailer encapsulation. Instead of
    putting the TCP and IP header on the beginning, you
    put them on the end! Less copying!

    I believe people found other ways to reduce copying, though.

    The I/O hardware for IBM S/360 copies data directly
    from the I/O device into memory. (Memory was expensive!)
    Also, it is blocked on disk the same as it is for the user,
    unlike most systems now. It would be usual, though,
    for the last copy -- from the I/O buffer to/from the actual
    data area -- to be an actual copy. IBM has locate mode
    I/O to eliminate that one. For locate mode, instead of
    copying, the program gets a pointer to the actual buffer.
    (That works in assembly and PL/I, C hadn't been invented.)

    For write, you request the address of the output buffer,
    operate on the data there, and then request it be written.

    There has been much work over the years on reducing
    the amount of data copying, or operations needed to
    copy it. For byte addressed machines, to copy data
    a whole word at a time. (Depending on alignment.)

    There are also search algorithms like Boyer-Moore,
    to search strings without looking at every character.
    [Now you can usually ask operating systems to map a file into your
    process so there is no extra copying at all, the disk reads a block
    into a page frame and you address the data directly in that page
    frame. In a program called grepcidr that does grep-like searches for
    IP address strings, for large files it got somewhat faster when I
    switched from stdio to mapping the whole file in and treating it as
    one big string. This is pretty remote from compilers, though. tl;dr
    less copying is faster. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Wed Jun 22 00:44:05 2022
    While worrying about copying characters around in compilers isn't given
    much thought these days, it is very relevant to people implementing
    networking software and also those doing hardware accelerators and their
    device drivers. The startup I'm working with these days, spends a lot of
    time worrying about zero-copy abstractions, i.e. how to avoid moving data around. Of course, that doesn't surprise me as we are building hardware accelerators and lots of the staff has a networking background and our accelerators communicate with each other over network connections or shared memory, but the less the data moves, the faster throughput we get with less energy usage and usually with less hardware too.

    -- ******************************************************************************

    Chris Clark email:
    christopher.f.clark@compiler-resources.com
    Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Christopher F Clark on Wed Jun 22 01:13:50 2022
    On 2022-06-21, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
    While worrying about copying characters around in compilers isn't given
    much thought these days, it is very relevant to people implementing networking software and also those doing hardware accelerators and their device drivers.

    Don't write off buffering optimization'sn as yesterday's game just yet;
    e.g. this could still be relevant to someone needing to parse gigabytes
    of JSON or XML. Or maybe even just megabytes.

    I remember reading some article some years ago whereby some Javascript programmer discovered it was faster to read JSON from a file using
    dedicated JSON routines available in Javascript, than to declare the
    same syntax in the Javascript program as a literal and let it be
    scanned along with the program and available to it that way.

    (I realize there may be other reasons for the performance difference,
    because JSON isn't Javascript, but likely part of it was that the JS implementation didn't care about being efficient for large amounts of
    data: the "hot spot" isn't going to be the scanning stage that takes
    place oncem before the program has a chance to execute any sort of
    loop.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Roger L Costello on Wed Jun 22 01:05:52 2022
    On 2022-06-21, Roger L Costello <costello@mitre.org> wrote:
    Hi Folks,

    Page 89 of the dragon book says:

    Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

    It's not clear what exactly this is referring to, but probably just to
    the practice of making multiple copies of the same data in the
    processing stack. If we put an imaginary tracer on a single character,
    we may see it hopping among multiple buffers. In the Chaper 2 lexical
    analyzer, getchar is used to read a character; getchar fills a buffer in
    the stdio stream, and the program is sucking it out from there one
    character at a time. So to build a lexeme, it has to have its own buffer
    for the lexeme, which is another copy of the data.

    The technique described in chapter 3 allows bulk reads into a buffer, eliminating the stream library. The tokens are delimited right inside
    the buffer, reducing a copy.

    [The technique seems closely related to the "flip buffer",
    which can be found in places like TTY implementations of Unix
    kernels. Linux had one; there was once a "struct tty_flip_buffer".
    At a quick glance, it looks like that today there are nmore than two
    buffers used which are allocated and freed on the fly. There is still a <linux/tty_flip.h> which contains a modicum of "flip" terminology,]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Kaz Kylheku on Wed Jun 22 11:45:22 2022
    Kaz Kylheku <480-992-1380@kylheku.com> schrieb:

    I remember reading some article some years ago whereby some Javascript programmer discovered it was faster to read JSON from a file using
    dedicated JSON routines available in Javascript, than to declare the
    same syntax in the Javascript program as a literal and let it be
    scanned along with the program and available to it that way.

    This came up on comp.arch recently.

    There is an insanely fast JSON parser ad UTF-8 validator based
    on SIMD to be found at https://github.com/simdjson/simdjson .
    They select a different length of vector according to
    the CPU version they find. The algorithm is described at https://arxiv.org/pdf/1902.08318.pdf. It
    heavily relies on special-casing for JSON and for the SIMD
    instructions that are available.

    A general SIMD-based parser generator is likely to be even harder
    to write and will probably not outperform the package above (nor,
    for that case, a traditional character-at-a-time approach).

    Is there research on this?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)