Forum: >>> Magnum BBS <<<

What does it mean to "move characters" in the lexer?

From Roger L Costello@21:1/5 to All on Tue Jun 21 10:27:15 2022

Hi Folks,

Page 89 of the dragon book says:

Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

Page 102 of "A Retargetable C Compiler: Design and Implementation" says:

The lexical analyzer's main activity is moving characters, so minimizing the amount of character movement helps increase speed.

And on page 103 it says:

An important consequence of this design is that most of the input characters are accessed by *cp and many characters are never moved. Only identifiers (excluding keywords) and string literals that appear in executable code are copied out of the buffer into permanent storage.

I don't understand what they mean by "moving characters". Do they mean copying characters? Do they mean reading characters from a file into memory? Would you explain what this "character movement" thing is all about, please?

/Roger
[Keeping in mind that this was written in the 1970s, they mean copying strings of characters from one place to another. On a PDP-11. With 64K bytes of memory.
It is still true that character processing in a lexer consumes a large fraction of the time in compilers. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Tue Jun 21 10:30:58 2022

On Tuesday, June 21, 2022 at 9:25:12 AM UTC-7, Roger L Costello wrote:

(snip)

Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

(snip)

I don't understand what they mean by "moving characters". Do they mean copying
characters? Do they mean reading characters from a file into memory? Would you
explain what this "character movement" thing is all about, please?

Yes it is copying, and yes it can take a lot of the time.

On many systems, the disk controller reads the data into its own buffer,
and then the OS copies the data from the controller buffer into its buffer.

Then when the user does an I/O (input) request, the data is copied
into the program's own buffer, and finally into the place where the data actually goes. So maybe four copies.

Early in the days of TCP/IP there was trailer encapsulation.
(I never saw it used, but some have the ability to turn it on.)

If you follow the ISO seven letter model, or even if you don't.

The program gives data to TCP, which divides it up into
packets to send. Each of those gets a TCP header.
It is then passed to IP where IP puts its header on.
And then before sending, it gets an Ethernet header.

Since there is often something before the buffer, but
the buffer might not be full, so there might be space at
the end, there was trailer encapsulation. Instead of
putting the TCP and IP header on the beginning, you
put them on the end! Less copying!

I believe people found other ways to reduce copying, though.

The I/O hardware for IBM S/360 copies data directly
from the I/O device into memory. (Memory was expensive!)
Also, it is blocked on disk the same as it is for the user,
unlike most systems now. It would be usual, though,
for the last copy -- from the I/O buffer to/from the actual
data area -- to be an actual copy. IBM has locate mode
I/O to eliminate that one. For locate mode, instead of
copying, the program gets a pointer to the actual buffer.
(That works in assembly and PL/I, C hadn't been invented.)

For write, you request the address of the output buffer,
operate on the data there, and then request it be written.

There has been much work over the years on reducing
the amount of data copying, or operations needed to
copy it. For byte addressed machines, to copy data
a whole word at a time. (Depending on alignment.)

There are also search algorithms like Boyer-Moore,
to search strings without looking at every character.
[Now you can usually ask operating systems to map a file into your
process so there is no extra copying at all, the disk reads a block
into a page frame and you address the data directly in that page
frame. In a program called grepcidr that does grep-like searches for
IP address strings, for large files it got somewhat faster when I
switched from stdio to mapping the whole file in and treating it as
one big string. This is pretty remote from compilers, though. tl;dr
less copying is faster. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Wed Jun 22 00:44:05 2022

While worrying about copying characters around in compilers isn't given
much thought these days, it is very relevant to people implementing
networking software and also those doing hardware accelerators and their
device drivers. The startup I'm working with these days, spends a lot of
time worrying about zero-copy abstractions, i.e. how to avoid moving data around. Of course, that doesn't surprise me as we are building hardware accelerators and lots of the staff has a networking background and our accelerators communicate with each other over network connections or shared memory, but the less the data moves, the faster throughput we get with less energy usage and usually with less hardware too.

-- ******************************************************************************

Chris Clark email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Christopher F Clark on Wed Jun 22 01:13:50 2022

On 2022-06-21, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:

While worrying about copying characters around in compilers isn't given
much thought these days, it is very relevant to people implementing networking software and also those doing hardware accelerators and their device drivers.

Don't write off buffering optimization'sn as yesterday's game just yet;
e.g. this could still be relevant to someone needing to parse gigabytes
of JSON or XML. Or maybe even just megabytes.

I remember reading some article some years ago whereby some Javascript programmer discovered it was faster to read JSON from a file using
dedicated JSON routines available in Javascript, than to declare the
same syntax in the Javascript program as a literal and let it be
scanned along with the program and available to it that way.

(I realize there may be other reasons for the performance difference,
because JSON isn't Javascript, but likely part of it was that the JS implementation didn't care about being efficient for large amounts of
data: the "hot spot" isn't going to be the scanning stage that takes
place oncem before the program has a chance to execute any sort of
loop.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Roger L Costello on Wed Jun 22 01:05:52 2022

On 2022-06-21, Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

Page 89 of the dragon book says:

Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead to process an input character.

It's not clear what exactly this is referring to, but probably just to
the practice of making multiple copies of the same data in the
processing stack. If we put an imaginary tracer on a single character,
we may see it hopping among multiple buffers. In the Chaper 2 lexical
analyzer, getchar is used to read a character; getchar fills a buffer in
the stdio stream, and the program is sucking it out from there one
character at a time. So to build a lexeme, it has to have its own buffer
for the lexeme, which is another copy of the data.

The technique described in chapter 3 allows bulk reads into a buffer, eliminating the stream library. The tokens are delimited right inside
the buffer, reducing a copy.

[The technique seems closely related to the "flip buffer",
which can be found in places like TTY implementations of Unix
kernels. Linux had one; there was once a "struct tty_flip_buffer".
At a quick glance, it looks like that today there are nmore than two
buffers used which are allocated and freed on the fly. There is still a <linux/tty_flip.h> which contains a modicum of "flip" terminology,]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Kaz Kylheku on Wed Jun 22 11:45:22 2022

Kaz Kylheku <480-992-1380@kylheku.com> schrieb:

I remember reading some article some years ago whereby some Javascript programmer discovered it was faster to read JSON from a file using
dedicated JSON routines available in Javascript, than to declare the
same syntax in the Javascript program as a literal and let it be
scanned along with the program and available to it that way.

This came up on comp.arch recently.

There is an insanely fast JSON parser ad UTF-8 validator based
on SIMD to be found at https://github.com/simdjson/simdjson .
They select a different length of vector according to
the CPU version they find. The algorithm is described at https://arxiv.org/pdf/1902.08318.pdf. It
heavily relies on special-casing for JSON and for the SIMD
instructions that are available.

A general SIMD-based parser generator is likely to be even harder
to write and will probably not outperform the package above (nor,
for that case, a traditional character-at-a-time approach).

Is there research on this?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Plume
  Mon Jun 16 06:46:57 2025
  from Uk via SSH
- Centurion
  Mon Jun 16 05:18:46 2025
  from Berea, Ohio via Telnet
- Guest
  Sun Jun 15 18:58:11 2025
  from No via SSH
- Plume
  Sun Jun 15 15:01:03 2025
  from Uk via SSH
- Centurion
  Sun Jun 15 09:44:59 2025
  from Berea, Ohio via Telnet
- Deasl
  Sun Jun 15 08:43:59 2025
  from Foo, Bar via SSH
- Deasl
  Sun Jun 15 08:41:06 2025
  from Foo, Bar via SSH
- Plume
  Sat Jun 14 21:49:07 2025
  from Uk via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	493
Nodes:	16 (2 / 14)
Uptime:	194:51:32
Calls:	9,707
Calls today:	2
Files:	13,740
Messages:	6,180,442

What does it mean to "move characters" in the lexer?

Who's Online

Recent Visitors

System Info