Forum: >>> Magnum BBS <<<

Continuations

From Lawrence D'Oliveiro@21:1/5 to All on Sat Jul 13 07:50:42 2024

Has there ever been a hardware architecture that managed the flow of
control via “continuations”?

That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and
maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

You might have a location defined in a call frame for a “pointer to parent call frame” field, in which case “return from subroutine” just loads this pointer into the current-call-frame register. But you could just as easily
have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a different
call frame does not automatically invalidate the previous one. If it stays valid, then there is no reason why you couldn’t, at some point, come back
to it and resume execution from where it left off.

The beauty of continuations is that they are a single generalized control construct that can be used to implement specific language features like
regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)

Very few high-level languages (outside of the Lisp family, anyway) seem to
have implemented continuations as an explicit language concept. This is an integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up
with a useful example, like for instance an implementation of coroutines,
that doesn’t do my head in. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Jul 13 11:16:45 2024

Lawrence D'Oliveiro wrote:

Has there ever been a hardware architecture that managed the flow of
control via “continuations”?

That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

This is not really a hardware ISA issue and is mostly what you do with it.

If an ISA uses a Branch-and-link style subroutine call then it does not
have the concept of a push/pop stack embedded in it.

On x86 which does have a defined stack pointer register,
that user mode SP can point wherever you want the caller PC saved.
There is also the PUSH and POP instructions but just don't use them.
The only other hardware embedded stack concept would be the frame pointer.
But FP is only manipulated by the ENTER and LEAVE instructions and pretty
much no one uses them.

Everything else is about how memory is allocated and at what cost.

Each current local context could be allocated from a heap.
But be careful what you wish for.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal decrement/increment stack it used a linked list of heap allocated objects. Every call did a malloc and every return did a free. All the hardware speed improvement had been squandered on this completely unnecessary ABI overhead.

You might have a location defined in a call frame for a “pointer to parent call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as easily have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a different call frame does not automatically invalidate the previous one. If it stays valid, then there is no reason why you couldn’t, at some point, come back to it and resume execution from where it left off.

The beauty of continuations is that they are a single generalized control construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)

Very few high-level languages (outside of the Lisp family, anyway) seem to have implemented continuations as an explicit language concept. This is an integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)

I had not heard the term “continuation” until John Levine used it
in a recent post. I looked it up on Wikipedia and found it is
what I call a "callback state machine".

https://en.wikipedia.org/wiki/Continuation

This is the basic mechanism used by IO device drivers in RSX, VMS and WinNT.
I have used it for async I/O and transaction processing.

Those IO subsystems still have regular push/pop call stacks,
as well as each IO context, called an IRP or IO request Packet,
also has a deferred callback stack.

Used in device drivers the mechanism is very efficient but also somewhat difficult to program and each driver function must be broken up into a
series of subroutine calls separated by wait states, leaving a trail of callback bread crumbs behind as it travels so it can find its way home.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Sat Jul 13 17:42:00 2024

In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and
coroutines, all built from the same common abstraction. One thing
that is difficult to do with them is arbitrary gotos. (I consider
that a feature, not a bug.)

Just occasionally, you need a goto to get out of a complicated error
situation. Not often, but sometimes.

I'm also of the opinion that unless memory gets a lot faster, requiring hardware to maintain and navigate data structures in memory is a bad idea. They'll get used because they're there, but they'll be slow. One can't
avoid page tables, but they should be simple.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sat Jul 13 23:43:52 2024

On Sat, 13 Jul 2024 17:42 +0100 (BST), John Dallman wrote:

In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and coroutines,
all built from the same common abstraction. One thing that is difficult
to do with them is arbitrary gotos. (I consider that a feature, not a
bug.)

Just occasionally, you need a goto to get out of a complicated error situation. Not often, but sometimes.

That’s what exceptions are for. And they are handled by continuations as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Jul 13 23:43:18 2024

On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.

Was that the RS/6000? That was quite a step forward at the time (and for
some years after), even among all the other new RISC-based architectures appearing around that time.

What OS/software were you trying above? Because it doesn’t sound like
regular C code on a Unix-type system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to BGB on Sun Jul 14 09:38:31 2024

BGB <cr88192@gmail.com> wrote:

There is basically no real way to get dynamically allocated
stack-frames and argument copying to be performance competitive with adjusting a stack pointer and passing arguments in registers.
Function calls are also, not exactly rare.

There is a way to combine a fast implementation and allow
continuations, and I think it's been done several times.

Proceed as usual, using a stack-based ABI, until a continuation is
captured. At that point, walk the stack and create the continuation by
copying stack frame contents into dynamically-allocated continuation
frames. Pass the continuation to whoever wants it. They may evaluate
it, inspect it, or whatever.

This does require precise information about the current stack frame to
be kept at all call sites, but that doesn't have to be much more than
the usual debuginfo.

I believe this technique was invented by Peter Deutsch for Smalltalk,
and it's used today for Kotlin.

Worse still, one is very likely going to need a garbage collector
(in the off chance anyone actually uses continuations, if the
continuation is captured via call/cc or similar, can no longer
automatically free it).

It doesn't really need GC, as long as it's the recipient's job to free
the continution. But I can't immediately think of any language with continuations and without GC. They kinda go together.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sun Jul 14 14:22:23 2024

BGB <cr88192@gmail.com> writes:

On 7/13/2024 2:50 AM, Lawrence D'Oliveiro wrote:

Has there ever been a hardware architecture that managed the flow of
control via "continuations"?

RISCs usually have no architectural support for stack-based calling conventions, so they work just as well for managing the flow of
control via continuations.

Doing an ABI based around continuations would basically wreck performance; >Conventional thread-based multitasking is cheaper to implement and gives >better performance.

There is basically no real way to get dynamically allocated stack-frames
and argument copying to be performance competitive with adjusting a
stack pointer and passing arguments in registers. Function calls are
also, not exactly rare.

Quite a while ago "continuation-passing style" was a hot research
topic (see some references below; I found [appel&jim89] more inspiring
than [appel92]), and AFAIK a number of compilers used this concept
(and some probably still use it, especially Scheme compilers, where continuations are a language feature).

Around that time there was the argument (maybe also by Appel, but I
don't remember) that, in the limit, garbage collection is cheaper than stack-based allocation (never mind malloc/free): If you delay garbage collection long enough, the cost of a copying garbage collector will
be smaller than the cost of stack-deallocating the garbage. That's a
copying garbage collector only accesses and copies the non-garbage.

In practice, you don't want to provide so much memory that this point
will be reached, plus stack-based allocation has better cache
characteristicts.

@InProceedings{appel&jim89,
author = "Andrew W. Appel and Trevor Jim",
title = "Continuation-Passing, Closure-Passing style",
booktitle = "Sixteenth Annual {ACM} Symposium on Principles of
Programming Languages",
year = "1989",
pages = "293--302",
address = "Austin, Texas",
annote = "ML compiler based on continuation passing style.
After the translation into CPS and optimization
closures are made explicit and registers are
allocated. No stack is used for the closures. Instead
the compiler relies on garbage collection."
}

@Book{appel92,
author = "Andrew W. Appel",
title = "Compiling with Continuations",
publisher = "Cambridge University Press",
year = "1992"
}

@InProceedings{flanagan+93,
author = "Cormac Flanagan and Amr Sabry and Bruce F. Duba and
Matthias Felleisen",
title = "The Essence of Compiling with Continuations",
crossref = "sigplan93",
pages = "237--247",
annote = "CPS compilers convert into CPS form, optimize the
program and convert back. Isomorphic optimizations can
be performed on the original program, saving the
transformation to and from CPS form."
}

@Proceedings{sigplan93,
booktitle = "SIGPLAN '93 Conference on Programming Language
Design and Implementation",
title = "SIGPLAN '93 Conference on Programming Language
Design and Implementation",
year = "1993",
key = "SIGPLAN '93",
note = "SIGPLAN Notices 28(6)"
}

@InProceedings{bruggeman+96,
author = {Carl Bruggeman and Oscar Waddel and R. Kent Dybvig},
title = {Representing Control in the Presence of One-Shot
Continuations},
crossref = {sigplan96},
pages = {99--107},
annote = {Discuss how to represent multi-shot continuations
and one-shot continuations in a stack-based Scheme
implementation. One-shot continuations are
programmer-specified (with the \emph{call/1cc}
call). They offer a small (about 13\% in the threads
benchmarks) performance benefit over multi-shot
continuations, if they are applicable.}
}

@Proceedings{sigplan96,
booktitle = "SIGPLAN '96 Conference on Programming Language
Design and Implementation",
title = "SIGPLAN '96 Conference on Programming Language
Design and Implementation",
year = "1996",
key = "PLDI '96"
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ldo@nz.invalid on Sun Jul 14 13:51:14 2024

On Sat, 13 Jul 2024 23:43:52 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Sat, 13 Jul 2024 17:42 +0100 (BST), John Dallman wrote:

In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and coroutines,
all built from the same common abstraction. One thing that is difficult
to do with them is arbitrary gotos. (I consider that a feature, not a
bug.)

Just occasionally, you need a goto to get out of a complicated error
situation. Not often, but sometimes.

That’s what exceptions are for. And they are handled by continuations as >well.

At its simplest, a continuation is simply "the next instruction to be executed", so any time you take/save a code address and then jump to
it you have executed a continuation.

Many languages permit "callbacks", which are a form of continuation.

The notion of "continuation" in Lisp has been conflated with function
calling. Continuations which take/permit arguments are a particularly heavyweight form.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jul 14 18:14:47 2024

According to George Neuner <gneuner2@comcast.net>:

At its simplest, a continuation is simply "the next instruction to be >executed", so any time you take/save a code address and then jump to
it you have executed a continuation.

This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher. It was all
written in assembler and was very tedious to write and debug but it
got fantastic performance. SABRE hooked 1500 terminals to a pair of
7090s each with 32K words of 2us core and got good performance.

These days the syntax is a lot nicer, CPS makes it easier to say
what the context is you're saving, but it's the same idea.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ldo@nz.invalid on Sun Jul 14 13:41:33 2024

On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

Has there ever been a hardware architecture that managed the flow of
control via “continuations”?

That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and >maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you >have a “load new call frame” instruction.

You might have a location defined in a call frame for a “pointer to parent >call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as easily >have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a different >call frame does not automatically invalidate the previous one. If it stays >valid, then there is no reason why you couldn’t, at some point, come back >to it and resume execution from where it left off.

The beauty of continuations is that they are a single generalized control >construct that can be used to implement specific language features like >regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)

Very few high-level languages (outside of the Lisp family, anyway) seem to >have implemented continuations as an explicit language concept. This is an >integral part of Scheme, not so much it seems of non-Scheme Lisps. I >implemented it in my PostScript revival language ><https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up >with a useful example, like for instance an implementation of coroutines, >that doesn’t do my head in. ;)

In order for this to work well, the program needs to be compiled in Continuation Passing Style (CPS). THe problem is that CPS does not
play well with Single Static Assignment (SSA), which is the preferred
IR form now because it supports the greatest number of potential
optimizations.
[There was a rather lengthy discussion of compilation techniques in comp.lang.lisp that dealt with issues combining SSA with CPS, but it
was ~20 years ago so you'd have to dig through the archives.]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Sun Jul 14 12:31:26 2024

On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)

It's a bug.
If a programming language wants to disallow arbitrary gotos in order
to promote better quality code writing, that's great.
But the machine language of a computer has got to allow _any_ language
to be implemented on it. Including C, or old-style FORTRAN.
Otherwise, one doesn't have a general-purpose machine.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to John Levine on Sun Jul 14 22:22:41 2024

On 2024-07-14 19:14, John Levine wrote:

According to George Neuner <gneuner2@comcast.net>:

At its simplest, a continuation is simply "the next instruction to be
executed", so any time you take/save a code address and then jump to
it you have executed a continuation.

That does not sound much like "continuation passing" to me.

I think of continuation passing as always ending a subprogram by calling
some other subprogram, passing it one or more "continuation" subprograms
as parameters, and expecting that callee to do something and then to
"continue" by further calling one of those "continuations" (with further "continuation" parameters) and NOT to return to the original caller, ever.

By John's "simplest" definition, an ordinary call-and-return would be "continuation passing": the call "takes/saves" the return address
somewhere, and the return jumps to that address.

This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then explicitly saved its state and returned to the dispatcher.

I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.

The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.

There are today several real-time kernels that implement that kind of run-to-completion, non-preemptive multi-tasking.

Could you explain a bit about the "dispatcher" and how it decided which
routine to execute next? Perhaps depending on which I/O had completed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jul 14 22:29:05 2024

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.

I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.

Right, that saved context is the continuation. Like I said, the tools
these days take care of a lot of the drudge work, e.g. node.js.

The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.

Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
The context you save depends on what you're doing.

Could you explain a bit about the "dispatcher" and how it decided which >routine to execute next? Perhaps depending on which I/O had completed?

As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
characters that said what kind of request it was. The routine would
then do a bunch of work consisting of computing and disk I/O. Whenever
it scheduled a disk operation, it'd explicitly save its state and
return to the dispatcher. When the disk I/O was done, it'd schedule
the next part of the task which restored the state and continued.
Repeat until the task sends a reply to the terminal and is done.

Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual
database record.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to aph on Mon Jul 15 01:58:25 2024

On Sun, 14 Jul 2024 09:38:31 +0000, aph wrote:

But I can't immediately think of any language with
continuations and without GC. They kinda go together.

Instead of “GC”, I would say “dynamic allocation”. Consider languages like
Python and Perl, which do reference-counting as a first resort, only
falling back to full-on garbage collection when reference-counting is no
longer enough. That gives you the best of both worlds.

Of course, neither of of them has continuations as such, or at least fully-general ones: Python does have both “generator” functions and the newer async/await-style “stackless” coroutines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon Jul 15 01:59:58 2024

On Sun, 14 Jul 2024 12:31:26 -0600, John Savard wrote:

On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

One thing that is difficult to do with them is arbitrary gotos. (I
consider that a feature, not a bug.)

It's a bug.
If a programming language wants to disallow arbitrary gotos in order to promote better quality code writing, that's great.
But the machine language of a computer has got to allow _any_ language
to be implemented on it. Including C, or old-style FORTRAN. Otherwise,
one doesn't have a general-purpose machine.

It only has to be Turing-equivalent (or Lisp-equivalent) to be a “general- purpose” machine. There’s nothing in that criterion that requires gotos.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From wolfgang kern@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 07:50:54 2024

On 13/07/2024 09:50, Lawrence D'Oliveiro wrote:

Has there ever been a hardware architecture that managed the flow of
control via “continuations”?

...

with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)

more than 40 years ago I worked on 1802 MCU programs.
it was a 8 bit arch and had many single byte instructions which
immediate switched any of its 16 registers to either IP, SP or Index.
it had no call return opcodes, but it was easy to program as if it had.

so at the end of a routine it could use any prepared register to either
jump or perform a call with two preset registers.
IIRC I used R4 and R5 as my standard call/return and even modified my ASM&DISASS tools to show this two as CALL and Return.
But these were always just simple JUMP reg instructions.
__
wolfgang

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to John Levine on Mon Jul 15 11:56:32 2024

John Levine wrote:

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

Could you explain a bit about the "dispatcher" and how it decided which
routine to execute next? Perhaps depending on which I/O had completed?

As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
characters that said what kind of request it was. The routine would
then do a bunch of work consisting of computing and disk I/O. Whenever
it scheduled a disk operation, it'd explicitly save its state and
return to the dispatcher. When the disk I/O was done, it'd schedule
the next part of the task which restored the state and continued.
Repeat until the task sends a reply to the terminal and is done.

Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual database record.

I used this approach (which I later learned was called coroutines) in my
Dos terminal emulator: I wanted to process all characters as I received
them, which meant that for any of the many, many escape sequences (for
cursor positioning, font selection, foreground/background color etc) I
needed to remember where I was.

Initially I just restored the read pointer to the beginning of the
escape sequence, knowing I would be called again when one or more new characters had been received, so that I would always process the stream
in complete units. However I soon realized that it was much cheaper to
make the stream parser a separate routine which would yield when it ran
out of inputs. I only needed a tiny bit on inline assembler to do the
coroutine switching (mostly just a separate stack and IP, but no
register contents).

The receiving routine would yield to the parser when the serial port
buffer has some new data.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Mon Jul 15 14:12:32 2024

John Levine wrote:

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.

I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.

Right, that saved context is the continuation. Like I said, the tools
these days take care of a lot of the drudge work, e.g. node.js.

The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.

Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
The context you save depends on what you're doing.

Could you explain a bit about the "dispatcher" and how it decided which
routine to execute next? Perhaps depending on which I/O had completed?

As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
characters that said what kind of request it was. The routine would
then do a bunch of work consisting of computing and disk I/O. Whenever
it scheduled a disk operation, it'd explicitly save its state and
return to the dispatcher. When the disk I/O was done, it'd schedule
the next part of the task which restored the state and continued.
Repeat until the task sends a reply to the terminal and is done.

Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual database record.

In a classic state machine there is a loop around a CASE statement
based on the current state number. The last thing each state case
does is assign the next state number to jump to.

A variation on that idea has each state as an action subroutine
and the state number and CASE is replaced by a function address.

The way I did this was that each task context TaskCtx struct has a
field that is the next action routine to call NextActionRtn (*taskCtx),
and links for a double linked list.

Like the classic state machine, the last thing each action routine does is decide what to do next and assign the next action function to be called.

The TaskCtx is in a task wait list if waiting for IO.
Async IO completion moves the TaskCtx to the ready task list.

The dispatcher is a loop that pops the next TaskCtx from ready list and
calls its NextActionRtn routine field passing the *taskCtx as an arg.

The effect is that code bounces from action routine to action routine
carrying its execute context with it, waiting for something external, evaluating the result and deciding what action to do next.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 14:50:04 2024

Lawrence D'Oliveiro wrote:

On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.

Was that the RS/6000? That was quite a step forward at the time (and for
some years after), even among all the other new RISC-based architectures appearing around that time.

What OS/software were you trying above? Because it doesn’t sound like regular C code on a Unix-type system.

Yes, an RS6000. I don't remember what language, Fortran or C, likely on AIX.
I didn't run these test, I was watching others run them and remember
all our jaws dropping when the disassembler showed what it was doing.

We mostly worked in Fortran 77 on VAX, and a side group were porting code to VM/CMS on a 4331 so probably F77. We weren't porting code to RS6000 and
I don't remember why IBM loaned it to us, probably just to get our feedback.

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Mon Jul 15 18:57:06 2024

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

(Plese note that the sentence above was DeepL-filtered, on
"diplomatic" setting).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jul 15 19:25:33 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

$ ulimit -s unlimited

Should handle the default stack.

Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.

In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jul 15 19:43:34 2024

According to EricP <ThatWouldBeTelling@thevillage.com>:

Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual
database record.

In a classic state machine there is a loop around a CASE statement
based on the current state number. The last thing each state case
does is assign the next state number to jump to.

Right. This is what we'd now call an event loop, with code
dispatched in response to external events like I/O completion.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jul 15 20:07:10 2024

According to EricP <ThatWouldBeTelling@thevillage.com>:

Each current local context could be allocated from a heap.
But be careful what you wish for.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal >decrement/increment stack it used a linked list of heap allocated objects. >Every call did a malloc and every return did a free. All the hardware speed >improvement had been squandered on this completely unnecessary ABI overhead.

Maybe it was an IBM thing. In the 1960s the IBM 360 Algol F compiler,
which I always assumed was written because some European institution
had it on a checklist rather than because they expected anyone to use
it, did the same thing. Every time it entered a block it did a GETMAIN
system call, and when it left the block it did a FREEMAIN, with all of
the performance you'd expect.

Someone at Princeton patched it to do more sane allocation where it
did a big GETMAIN and suballocated from that until it filled up and
then did another one. I never wrote enough Algol to tell how well it
worked.

The compiler was a marvel of passive aggressive standards implementation. The keywords were all quoted 'BEGIN' 'END' and unless you told it otherwise it used the 48 charater set so a semicolon was ., and assignment was .= with one dot, not two. A squared was A'POWER'2. I/O was procedures that used dataset numbers, 0 was the input deck, 1 was the printer, and you skipped to the
next output record by padding out the current record to 80 spaces.

'COMMENT'TOO BAD THEY DID NOT USE 'TNEMMOC'.,

Yuck.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to John Levine on Mon Jul 15 23:06:13 2024

On 2024-07-14 23:29, John Levine wrote:

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.

I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.

Right, that saved context is the continuation. Like I said, the tools
these days take care of a lot of the drudge work, e.g. node.js.

The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.

Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
The context you save depends on what you're doing.

Yes yes, as I said, the "tasks" (analogous to routines) are
run-to-completion, not time-sliced, and the context save/load is
"specifically" in each "task".

Special systems like SAGE and SABRE can be viewed with different "eyes"
for comparing to current concepts. You look at them with "continuation
passing" eyes; I use my "run-to-completion, non-preemptive multitasking"
eyes, perhaps because I am more familiar with that kind of SW design.

For me the deciding factor in SAGE/SABRE is that the dispatcher, to
which each routine performs a call-with-continuation (in your view), is
/not/ an application function, but a system function, and is indeed a
scheduler that finds some new routine to execute /which the calling
routine did not choose/.

IMO, for something to qualify as a continuation-passing call, the caller
should select the callee -- the thing to be executed next -- but in
SAGE/SABRE it is the scheduler that selects what to execute next.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Mon Jul 15 20:54:39 2024

On Mon, 15 Jul 2024 20:07:10 +0000, John Levine wrote:

According to EricP <ThatWouldBeTelling@thevillage.com>:

Each current local context could be allocated from a heap.
But be careful what you wish for.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal >>decrement/increment stack it used a linked list of heap allocated
objects.
Every call did a malloc and every return did a free. All the hardware
speed
improvement had been squandered on this completely unnecessary ABI
overhead.

Maybe it was an IBM thing. In the 1960s the IBM 360 Algol F compiler,
which I always assumed was written because some European institution
had it on a checklist rather than because they expected anyone to use
it, did the same thing. Every time it entered a block it did a GETMAIN
system call, and when it left the block it did a FREEMAIN, with all of
the performance you'd expect.

The only thing I can comment on, here, is that 1108 Algol was much
faster than 360/67 Algol.

Someone at Princeton patched it to do more sane allocation where it
did a big GETMAIN and suballocated from that until it filled up and
then did another one. I never wrote enough Algol to tell how well it
worked.

The compiler was a marvel of passive aggressive standards
implementation. The
keywords were all quoted 'BEGIN' 'END' and unless you told it otherwise
it used
the 48 charater set so a semicolon was ., and assignment was .= with one
dot,
not two. A squared was A'POWER'2. I/O was procedures that used dataset numbers, 0 was the input deck, 1 was the printer, and you skipped to the
next output record by padding out the current record to 80 spaces.

'COMMENT'TOO BAD THEY DID NOT USE 'TNEMMOC'.,

Yuck.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 20:59:23 2024

On Sat, 13 Jul 2024 7:50:42 +0000, Lawrence D'Oliveiro wrote:

Has there ever been a hardware architecture that managed the flow of
control via “continuations”?

I want to simply ask WHY ??
Why do you think this would add value to an ISA ??
Why do you think this would add value to an HLL ??
Why do you think this would add value to a programmer ??

That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

You could learn to fly an airplane by coupling 16 or more electrodes
to your head and practicing flying until your brain figured it out !!

But there is a reason planes are not flow this way.....

You might have a location defined in a call frame for a “pointer to
parent
call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as
easily
have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a
different
call frame does not automatically invalidate the previous one. If it
stays
valid, then there is no reason why you couldn’t, at some point, come
back
to it and resume execution from where it left off.

Block structured languages allow for this, they are not "in that great
a favor" right now.

The beauty of continuations is that they are a single generalized
control
construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)

The ugliness is (as BGB indicates) their poor performance.
So, whatever they bring to the party, does not seem to be
the "guest of honor".

Very few high-level languages (outside of the Lisp family, anyway) seem
to
have implemented continuations as an explicit language concept. This is
an
integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come
up
with a useful example, like for instance an implementation of
coroutines,
that doesn’t do my head in. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Jul 15 21:23:37 2024

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

$ ulimit -s unlimited

Should handle the default stack.

Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).

Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.

In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().

Changing the standard startup sequence of programs (i.e. replacing
_start) - that would be heavy messing, and be no longer compatible
with standard startup sequences.

And for threads, this would get even messier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Mon Jul 15 17:26:44 2024

EricP wrote:

Lawrence D'Oliveiro wrote:

On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.

Was that the RS/6000? That was quite a step forward at the time (and
for some years after), even among all the other new RISC-based
architectures appearing around that time.

What OS/software were you trying above? Because it doesn’t sound like
regular C code on a Unix-type system.

Yes, an RS6000. I don't remember what language, Fortran or C, likely on
AIX.

This could also have been a RT PC as our port project was
running around 1987 and straddled the RS6000 launch in 1990.
(so I might have got my 'R's mixed up.)

And according to Wikipedia RT's were notoriously slow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to John Levine on Mon Jul 15 22:28:37 2024

On 15/07/2024 21:07, John Levine wrote:

... In the 1960s the IBM 360 Algol F compiler,
which I always assumed was written because some European institution
had it on a checklist rather than because they expected anyone to use
it, did the same thing. Every time it entered a block it did a GETMAIN
system call, and when it left the block it did a FREEMAIN, with all of
the performance you'd expect.

The /360 F level ALGOL compiler was notoriously bad at procedure calls,
and now I know why. On the Ackermann function benchmark it is thought
to have required about 820 instructions per call. Not surprising if it
was using an OS store allocation routine to set up each stack frame.

For comparison with near contemporary machines, and compilers:

IBM, F Level, /360 ALGOL 60: 820 ins/call*
NU 1108 ALGOL 60: 175 ins/call*
Delft U, /360 ALGOL 60: 142 ins/call
Wirth@Stanford U, /360 ALGOL W: 74 ins/call
Edinburgh U, EMAS /360 ALGOL 60: 21 ins/call
York U, VAX Ada: 8 ins/call

* Estimated

The VAX Ada compiler was 12 years later.

Data taken from Brian Wichmann's study.

See <http://www.findlayw.plus.com/KDF9/Documents/Benchmarking/>
and
<http://www.findlayw.plus.com/KDF9/The%20KDF9%20and%20Benchmarking.pdf>)

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jul 15 22:01:51 2024

On Mon, 15 Jul 2024 17:26:44 -0400, EricP wrote:

This could also have been a RT PC as our port project was
running around 1987 and straddled the RS6000 launch in 1990.
(so I might have got my 'R's mixed up.)

And according to Wikipedia RT's were notoriously slow.

That sounds much more like it, to be honest. What you were saying was
totally at odds with the reputation for performance that IBM POWER, and
later PowerPC, was gathering.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to All on Mon Jul 15 12:31:46 2024

A little over a decade ago, I was asked if I could track down decision
to add virtual memory to all IBM 370s and found a former staff member to executive making the decision. Basically OS/360 MVT storage management
was so bad that region sizes had to be specified four times larger than
used ... as a result only four concurrent regions could be running
concurrently on typical 1mbyte 370/165 system, insufficient to keep
system busy and justified.

Mapping MVT to a 16mbyte virtual address space (vs2/svs, similar to
running MVT in a CP67 16mbyte virtual machine), allowed concurrently
running regions increased by factor of four (limited to 15 with 4bit
storage protect keys, keeping regions isolated). However, as systems
increase, even fifteen weren't sufficient and they went to providing a
separate 16mbyte virtual address space for each region
(VS2/MVS). However OS/360 heavy pointer-passing API, resulted in mapping
a 8mbyte image of the MVS kernel into every virtual address space
(kernel API call would be able to access API parameters) ... leaving
8mbytes for application. Then because subsystems were also placed in
their own separate 16mbyte virtual address space ... for subsystems
access to caller's parameters, they mapped a 1mbyte "common segment
area" (CSA) into every virtual address space (leaving 7mbyes for
applications). However CSA space requirement was proportional to
concurrent executing applications and number of subsystems ... and
quickly became multi-mbyte "common system area". By 3033 time-frame CSA
was frequently 5-6mbytes (leaving 2-3mbytes for applications) and
threatening to become 8mbytes (leaving zero).

This was part of the mad rush to 370/XA & MVS/XA with 31-bit addressing
and "access registers" (semi-privileged subsystem could access caller's
address space "as secondary address space"). As temporary stop-gap, a
subset of "access registers" were retrofitted to 3033 for MVS, as
"dual-address space mode". However, kernel code was still required to
swap hardware address space pointers (subsystem call, passing through
kernel, moving caller's address space pointer to secondary, and loading subsystems address space pointer as primary ... and then restoring on
return). Then got hardware support with a "program call" table ... entry
for each subsystem and subsystem call could be handled all by hardware, including the address space pointer swapping (w/o overhead of kernel
software).

"program transfer" ... more like continuation,

1983 370/XA Pinciples of Operation SSA22-7085 https://bitsavers.org/pdf/ibm/370/princOps/SA22-7085-0_370-XA_Principles_of_Operation_Mar83.pdf
pg3-5 Primary & Secondary Virtual Address
pg10-22 "Program Call"
pg10-28 "Program Transfer"

"program transfer" originally could also be used to restore CPU to state
saved by previous "Program Call" (aka return, programming notes pg10-30)

Principles of Operation SA22-7832-13 (May 2022) https://publibfp.dhe.ibm.com/epubs/pdf/a227832d.pdf
pg3-23 "Address Spaces"
pg10-97 "Program Call"
pg10-110 "Program Return"
pg10-114 "Program Transfer"

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Mon Jul 15 18:28:15 2024

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

(Plese note that the sentence above was DeepL-filtered, on
"diplomatic" setting).

It should be possible to do stack chaining where if a stack allocation
exceeds the current stack bottom then a new larger discontinuous stack
memory section is allocated and chains back to the old one
and the SP and FP patched.

It needs to be thought about relatively early in the design as
the kernel needs to know how it works so it can make sure to delete
all the stack sections when a thread terminates.
Also it requires fields in the user mode thread header to store the
stack limits, plus store the backward links at the top of the stack
so access ranges can be checked cheaply.

It also needs to work both for routine entry that overflows and
dynamic stack allocation by alloca that overflows later
(in the later case could straddle the stacks -
the FP points into the old stack but SP points into the new stack).

Also it has to work for languages like Ada that allow dynamic storage
on the stack but recover the space when the block is exited.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Mon Jul 15 23:52:09 2024

On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

IMO, for something to qualify as a continuation-passing call, the caller should select the callee -- the thing to be executed next -- but in SAGE/SABRE it is the scheduler that selects what to execute next.

What you have is that SAGE/SABRE has built a higher-level scheduler
abstraction on top of lower-level coroutines/continuations.

Python’s asyncio module works in a similar way. Coroutines are a language primitive, that pass control directly to each other via await calls.
asyncio is a library module that wraps a coroutine in a higher-level
“task” object. When one task awaits another, that doesn’t necessarily mean
that control is passed directly from the first task to the second:
instead, the scheduler regains control and decides who is to run next.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Mon Jul 15 23:47:36 2024

On Sun, 14 Jul 2024 22:22:41 +0300, Niklas Holsti wrote:

I think of continuation passing as always ending a subprogram by calling
some other subprogram, passing it one or more "continuation" subprograms
as parameters, and expecting that callee to do something and then to "continue" by further calling one of those "continuations" (with further "continuation" parameters) and NOT to return to the original caller,
ever.

That’s the most general case. But it makes sense to have simpler
conventions for common cases, like subroutine call/return. So a subroutine
call implicitly passes a continuation that, when invoked, resumes the
caller. And the subroutine block can end with an implicit invocation of
this continuation. And such continuation objects are so common that it
makes sense, as an optimization, to have a special place to allocate them
in LIFO fashion--call it a “stack”.

You may have heard some people talk about “tree-like stacks”. This is starting to generalize from simple linear stacks to fully general continuations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jul 15 23:55:21 2024

On Mon, 15 Jul 2024 14:12:32 -0400, EricP wrote:

The effect is that code bounces from action routine to action routine carrying its execute context with it, waiting for something external, evaluating the result and deciding what action to do next.

The thing that coroutines bring to the table is that you don’t have to
break up the code that executes in this common context into separate
pieces that are each invoked via a callback and end in a reschedule call. Intead, you just write it as a single block, with explicit “await” calls
at the various points along the way where rescheduling happens.

The net result is the logic is much easier to follow, and there is often
less of it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jul 15 23:52:28 2024

According to EricP <ThatWouldBeTelling@thevillage.com>:

We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead. ...

This could also have been a RT PC as our port project was
running around 1987 and straddled the RS6000 launch in 1990.
(so I might have got my 'R's mixed up.)

The RT was pretty fast but AIX suffered from running atop a virtual
machine system that was originally intended to support multiple operating systems and in retrospect did more than it really needed to.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jul 15 23:57:38 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

$ ulimit -s unlimited

Should handle the default stack.

Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).

MacOS has been obsolete for a decade now. OSX has an initial
size of 8MB and a default configured hard limit of 64GB.

Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.

In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().

Changing the standard startup sequence of programs (i.e. replacing
_start) - that would be heavy messing, and be no longer compatible
with standard startup sequences.

That depends on the program, I guess. A little inline assembler
in main should suffice; An mmap'ed region will automatically
allocate on first touch, so the stack can grow as necessary.

And for threads, this would get even messier.

pthreads already get their own stack, and have standard functions
to specify stack pointer and stack size (e.g. pthread_attr_setstack(2)).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jul 15 23:56:00 2024

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

IMO, for something to qualify as a continuation-passing call, the caller
should select the callee -- the thing to be executed next -- but in
SAGE/SABRE it is the scheduler that selects what to execute next.

What you have is that SAGE/SABRE has built a higher-level scheduler >abstraction on top of lower-level coroutines/continuations.

It was an event loop, not coroutines. Every routine
returned to the dispatcher, not another routine. You
could reasonably call it continuations, implemented in a
painfully crude way with explicit save and restore of
the context.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jul 16 00:25:39 2024

On Mon, 15 Jul 2024 23:55:21 +0000, Lawrence D'Oliveiro wrote:

On Mon, 15 Jul 2024 14:12:32 -0400, EricP wrote:

The effect is that code bounces from action routine to action routine
carrying its execute context with it, waiting for something external,
evaluating the result and deciding what action to do next.

The thing that coroutines bring to the table is that you don’t have to break up the code that executes in this common context into separate
pieces that are each invoked via a callback and end in a reschedule
call.
Intead, you just write it as a single block, with explicit “await” calls at the various points along the way where rescheduling happens.

Is that still valuable if it cost 50× what a subroutine call costs ???

Is that still valuable if you can do it yourself at only 20× the cost
???

The net result is the logic is much easier to follow, and there is often
less of it.

The right answer late is the wrong answer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 01:08:41 2024

On Tue, 16 Jul 2024 00:25:39 +0000, MitchAlsup1 wrote:

On Mon, 15 Jul 2024 23:55:21 +0000, Lawrence D'Oliveiro wrote:

The thing that coroutines bring to the table is that you don’t have to
break up the code that executes in this common context into separate
pieces that are each invoked via a callback and end in a reschedule
call. Intead, you just write it as a single block, with explicit
“await” calls at the various points along the way where rescheduling
happens.

Is that still valuable if it cost 50× what a subroutine call costs ???

It doesn’t.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Jul 16 01:09:58 2024

On Mon, 15 Jul 2024 23:56:00 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

IMO, for something to qualify as a continuation-passing call, the
caller should select the callee -- the thing to be executed next --
but in SAGE/SABRE it is the scheduler that selects what to execute
next.

What you have is that SAGE/SABRE has built a higher-level scheduler
abstraction on top of lower-level coroutines/continuations.

It was an event loop, not coroutines.

The one does not preclude the other. So they lacked the coroutine
abstraction; that could have been added easily enough in the application implementation language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Jul 16 02:03:38 2024

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Mon, 15 Jul 2024 23:56:00 -0000 (UTC), John Levine wrote:

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

IMO, for something to qualify as a continuation-passing call, the
caller should select the callee -- the thing to be executed next --
but in SAGE/SABRE it is the scheduler that selects what to execute
next.

What you have is that SAGE/SABRE has built a higher-level scheduler
abstraction on top of lower-level coroutines/continuations.

It was an event loop, not coroutines.

The one does not preclude the other. So they lacked the coroutine >abstraction; that could have been added easily enough in the application >implementation language.

Just for amusement, what do you think the application implementation
languages for SAGE and SABRE were? For that matter, what computers do
you think they ran on?

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Jul 16 02:47:05 2024

On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

Just for amusement, what do you think the application implementation languages for SAGE and SABRE were?

Assembler with macros.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jul 16 05:54:44 2024

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.

I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.

$ ulimit -s unlimited

Should handle the default stack.

Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).

MacOS has been obsolete for a decade now.

Sorry for messing up my terminology.

OSX has an initial
size of 8MB and a default configured hard limit of 64GB.

The former leads to segfaults on otherwise fine programs, and
makes using gfortran's -fstack-arrays (which really brings speed
improvements) unsuitable for a lot of problems because it puts
local arrays on the stack...

Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.

In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().

Changing the standard startup sequence of programs (i.e. replacing
_start) - that would be heavy messing, and be no longer compatible
with standard startup sequences.

That depends on the program, I guess.

In this case, a general approach.

A little inline assembler
in main should suffice;

Not possible in Fortran, for example, unless the compiler does
so itself. And, of course, it should be platform-independent,
especially on systems that the maintainers of the Fortran compiler
have no access to, and no special knowledge of, plus it would
require platform-specific assembly in the startup code.

An mmap'ed region will automatically
allocate on first touch, so the stack can grow as necessary.

Slightly different topic: Unfortunately, Apple decided in their
wisdom that an mmapped region cannot grow, so you have to make a
guess at the needed size. As to why, I have no idea; they probably
wanted to follow the example of Windows. Learning from Microsoft
is learning to win!

This makes dynamic memory management for shared memory much harder
than it needs to be (been there, been bitten by it).

And for threads, this would get even messier.

pthreads already get their own stack, and have standard functions
to specify stack pointer and stack size (e.g. pthread_attr_setstack(2)).

Life would be so much easier if defaults were reasonable...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue Jul 16 15:59:19 2024

On Mon, 15 Jul 2024 18:12:32 +0000, EricP wrote:

John Levine wrote:

In a classic state machine there is a loop around a CASE statement
based on the current state number. The last thing each state case
does is assign the next state number to jump to.

You underestimate what we HW guys do with state machines::

There is a loop, yes, but inside the loop are several CASE statements
each producing its own next state, the only thing in common is a way
to exit the loop when needed.

For example 68000 µcode had such a CASE statement for the PC section,
the Data section, and the Address section, another for the pins, each
mostly independent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Tue Jul 16 10:00:59 2024

On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

It only has to be Turing-equivalent (or Lisp-equivalent) to be a �general- >purpose� machine. There�s nothing in that criterion that requires gotos.

It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.

However, the term "general-purpose" can have other meanings. The sense
in which I was using it was: a machine that is _efficient_ at running
programs written in all popular programming languages for the kinds of applications for which computers are normally used. Thus, the IBM
System/360 was general-purpose since it had string manipulation
instructions and decimal arithmetic in addition to floating-point and
binary integer arithmetic.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Tue Jul 16 17:17:57 2024

I think the underlying problem with continuations en-the-large is
that too much is in the continuation--leading to unnecessary over-
head.

Does a continuation in a block structured language NEED everything
on the block-stack, or is it more like 2-10 variables at rather
well defined locations ?? more like a functor than a continuation.

When I was writing real time coroutines on a PDP-11 I rarely needed
more than 2-3 variables and the addressing modes allowed for the
coroutine call and return to be simple PDP-11 instructions with
no more overhead than a normal subroutine call. (1975-77)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jul 16 20:53:22 2024

On Tue, 16 Jul 2024 5:54:44 +0000, Thomas Koenig wrote:

Slightly different topic: Unfortunately, Apple decided in their
wisdom that an mmapped region cannot grow, so you have to make a
guess at the needed size. As to why, I have no idea; they probably
wanted to follow the example of Windows. Learning from Microsoft
is learning to win!

Lemmings learn from the lemming right in front of them...too

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jul 16 20:55:01 2024

On Tue, 16 Jul 2024 2:47:05 +0000, Lawrence D'Oliveiro wrote:

On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

Just for amusement, what do you think the application implementation
languages for SAGE and SABRE were?

Assembler with macros.

It is rare, today, for programmers to be aware of the power of a
good macro assembler, and how close these tend to be to a real
low level language (like K&R C) but without the limitations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Jul 16 20:51:17 2024

On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

It only has to be Turing-equivalent (or Lisp-equivalent) to be a
general-
purpose machine. Theres nothing in that criterion that requires gotos.

It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.

But we also want that stuff to be fast (under some metric).

However, the term "general-purpose" can have other meanings. The sense
in which I was using it was: a machine that is _efficient_ at running programs written in all popular programming languages for the kinds of applications for which computers are normally used. Thus, the IBM
System/360 was general-purpose since it had string manipulation
instructions and decimal arithmetic in addition to floating-point and
binary integer arithmetic.

I think that history shows that those string facilities were, at best, overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7×-10× faster.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Tue Jul 16 23:51:05 2024

On Tue, 16 Jul 2024 05:54:44 -0000 (UTC), Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K or
something like that).

MacOS has been obsolete for a decade now.

Sorry for messing up my terminology.

Just a note that Apple now seems to be referring to OS X as “macOS” (with the lower-case m).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 23:54:59 2024

On Tue, 16 Jul 2024 20:55:01 +0000, MitchAlsup1 wrote:

On Tue, 16 Jul 2024 2:47:05 +0000, Lawrence D'Oliveiro wrote:

On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

Just for amusement, what do you think the application implementation
languages for SAGE and SABRE were?

Assembler with macros.

It is rare, today, for programmers to be aware of the power of a good
macro assembler, and how close these tend to be to a real low level
language (like K&R C) but without the limitations.

Precisely my point. So the idea that you cannot create a macro equivalent
to “await” seems implausible to me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 23:57:46 2024

On Tue, 16 Jul 2024 17:17:57 +0000, MitchAlsup1 wrote:

Does a continuation in a block structured language NEED everything on
the block-stack, or is it more like 2-10 variables at rather well
defined locations ?? more like a functor than a continuation.

My answer would be “it only needs to include what it needs”. Maybe in one instance it only needs a couple of integer variables, in another it might
need a (gasp) multi-megabyte dynamic array.

The only register contents the hardware needs to worry about would be:

* the program counter
* the pointer to the call frame itself

and perhaps that’s it. Everything else (e.g. the concept of a “return address”) could be defined by the ABI.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Jul 16 19:41:12 2024

On Tue, 16 Jul 2024 20:51:17 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

I will agree with that; the 68000 and the x86 don't have those things
either.

Even if my personal inclination has been to include them too.

On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- >culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7�-10� faster.

I thought this was already the case; the 8087 had transcendental
functions, and so did the 486 DX and from that onwards, and this was
also the case with the 68881 and 68882.

Of course, even if they're citizens of the ISA, they could be
second-class citizens, whatever the definition of that might be.

Here, my personal inclination is to simply include the raw CORDIC
calculation as an instruction, with bounds checking, special cases,
and any fancy stuff in software. On the grounds that some things are
too complicated for a classic old-style ISA.

That, too, is of course a bad decision from a practical perspective.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed Jul 17 03:51:50 2024

On Tue, 16 Jul 2024 20:51:17 +0000, MitchAlsup1 wrote:

On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7×-10× faster.

I can remember a Comp Sci professor waxing lyrical about p-adic number
systems, as an alternative to conventional p-ary ones. The key thing was
that division was no more complex than multiplication.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Koenig on Wed Jul 17 09:23:00 2024

In article <v7440p$sgca$1@dont-email.me>, tkoenig@netcologne.de (Thomas
Koenig) wrote:

Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).

That's the default limit, but you can raise it, provided you do so before creating the threads.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 17 16:17:22 2024

On Wed, 17 Jul 2024 1:41:12 +0000, John Savard wrote:

On Tue, 16 Jul 2024 20:51:17 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I think that history shows that those string facilities were, at best, >>overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

I will agree with that; the 68000 and the x86 don't have those things
either.

Even if my personal inclination has been to include them too.

On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- >>culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7×-10× faster.

I thought this was already the case; the 8087 had transcendental
functions, and so did the 486 DX and from that onwards, and this was
also the case with the 68881 and 68882.

Of course, even if they're citizens of the ISA, they could be
second-class citizens, whatever the definition of that might be.

While they were "in" the ISAs, they would take thousands of cycles
while FP would only take hundreds of cycles.

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Here, my personal inclination is to simply include the raw CORDIC
calculation as an instruction, with bounds checking, special cases,
and any fancy stuff in software. On the grounds that some things are
too complicated for a classic old-style ISA.

This is a tradeoff in size versus cycles. I choose to stop when
sin() was as fast as FDIV.

That, too, is of course a bad decision from a practical perspective.

Imagine a situation where::

s = sin(x);
c = cos(x);

is faster and more accurate than:

s = sin(x);
c = sqrt(1-s^2);

!!!
{{ignoring the fact that cos() can be negative where sqrt() cannot.}}

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed Jul 17 16:41:37 2024

MitchAlsup1 wrote:

On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

It only has to be Turing-equivalent (or Lisp-equivalent) to be a general-
purpose machine. Theres nothing in that criterion that requires gotos.

It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.

But we also want that stuff to be fast (under some metric).

However, the term "general-purpose" can have other meanings. The
sense in which I was using it was: a machine that is efficient at
running programs written in all popular programming languages for
the kinds of applications for which computers are normally used.
Thus, the IBM System/360 was general-purpose since it had string manipulation instructions and decimal arithmetic in addition to floating-point and binary integer arithmetic.

I think that history shows that those string facilities were, at best, overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have been put
into S/360.

2. Putting them into S/360 was the right decision at the time, but the changing technology (i.e. they wouldn't fit into the initial CISC microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.

3. Putting them into S/360 was the right decision at the time but the workloads changed. There was less requirement for things like actually
printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where
there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.

I suspect it was mostly number 3, but I think number 2 was a part of it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Jul 17 16:50:27 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Wed Jul 17 17:09:44 2024

"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:

MitchAlsup1 wrote:

On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

It only has to be Turing-equivalent (or Lisp-equivalent) to be a
general-
purpose machine. Theres nothing in that criterion that requires
gotos.

It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.

But we also want that stuff to be fast (under some metric).

However, the term "general-purpose" can have other meanings. The
sense in which I was using it was: a machine that is efficient at
running programs written in all popular programming languages for
the kinds of applications for which computers are normally used.
Thus, the IBM System/360 was general-purpose since it had string
manipulation instructions and decimal arithmetic in addition to
floating-point and binary integer arithmetic.

I think that history shows that those string facilities were, at best,
overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have been put >into S/360.

2. Putting them into S/360 was the right decision at the time, but the >changing technology (i.e. they wouldn't fit into the initial CISC >microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.

3. Putting them into S/360 was the right decision at the time but the >workloads changed. There was less requirement for things like actually >printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where
there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.

I suspect it was mostly number 3, but I think number 2 was a part of it.

I also suspect it was (3). Burroughs medium systems (B3500 etc) used
the edit instruction (EDT) extensively for COBOL.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Wed Jul 17 17:01:20 2024

Thomas Koenig <tkoenig@netcologne.de> schrieb:

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

s/done/suggested/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jul 17 17:41:49 2024

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Wed Jul 17 20:17:45 2024

On Wed, 17 Jul 2024 16:50:27 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

So, where is a story?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 17 17:45:39 2024

On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:

MitchAlsup1 wrote:

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have been put >>into S/360.

2. Putting them into S/360 was the right decision at the time, but the >>changing technology (i.e. they wouldn't fit into the initial CISC >>microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.

3. Putting them into S/360 was the right decision at the time but the >>workloads changed. There was less requirement for things like actually >>printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where >>there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.

I suspect it was mostly number 3, but I think number 2 was a part of it.

I also suspect it was (3). Burroughs medium systems (B3500 etc) used
the edit instruction (EDT) extensively for COBOL.

I suspect that as compute power grew, the kinds of things we wanted out
of EDIT changed.

negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY
..

making a fixed EDIT instruction fade into insignificance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Jul 17 17:57:34 2024

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have been put
into S/360.

Probably not, given the environment at the time.

2. Putting them into S/360 was the right decision at the time, but the changing technology (i.e. they wouldn't fit into the initial CISC microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.

3. Putting them into S/360 was the right decision at the time but the workloads changed. There was less requirement for things like actually printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where
there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.

I think it mostly has to do with the decimal machines that IBM sold
previously, and that the /360 should do everything that the old
machines could do. (They messed up with floating point, which
was much worse than for the 704ff, as previously discussed).

Plus, microcode was cheapter these days than the hideously expensive
main storage (less work for reading, because the cards for microcode
didn't have a destructive read, and no magnetic cores with wiring).

If you wanted to read a punched card into a decimal number which
is encoded on a punched card and read in as EBCDIC, or print out
a decimal number, you didn't want to spend many instructions in
precious core memory on that task.

Tasks changed, memory became cheaper and bigger, and decimal
arithmetic was not so important any more.

I suspect it was mostly number 3, but I think number 2 was a part of it.

So, basically I agree.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed Jul 17 18:30:47 2024

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took months, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to mitchalsup@aol.com on Wed Jul 17 18:28:52 2024

It appears that MitchAlsup1 <mitchalsup@aol.com> said:

applications for which computers are normally used. Thus, the IBM
System/360 was general-purpose since it had string manipulation
instructions and decimal arithmetic in addition to floating-point and
binary integer arithmetic.

I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

They made sense at the time when memories were small and CPUs were slow. Decimal arithmetic and pack/unpack and EDMK do exactly what every RPG
and Cobol program did, simple arithmetic on columns of numbers, and print
them in pretty formats. Sure, you can do all that with libraries, but
on a 4K or 8K machine, every byte was precious, and putting the libraries
in microcode made sense.

These days z still has them all, but I'm sure they're all in millicode, subroutines in the hardware subset of the instruction set that implement
the rest of it.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 17 18:42:44 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:

MitchAlsup1 wrote:

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have been put >>>into S/360.

2. Putting them into S/360 was the right decision at the time, but the >>>changing technology (i.e. they wouldn't fit into the initial CISC >>>microprocessors and RISC showed the functionality could be done other >>>ways) made putting them into newer designs a bad idea.

3. Putting them into S/360 was the right decision at the time but the >>>workloads changed. There was less requirement for things like actually >>>printing checks and general ledger reports, and programmers moved away >>>from COBOL, which was where EDIT was a natural fit, to languages where >>>there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.

I suspect it was mostly number 3, but I think number 2 was a part of it.

I also suspect it was (3). Burroughs medium systems (B3500 etc) used
the edit instruction (EDT) extensively for COBOL.

I suspect that as compute power grew, the kinds of things we wanted out
of EDIT changed.

negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY

EDIT/EDT instructions were designed from the start for
COBOL fixed point arithmetic. No Nan, No Infinities.

COBOL had verbs to assign the correct characters as
the decimal point and thousands separator (on the
Burroughs B3500 those characters were stored in reserved
low memory which was accessed the EDT instruction when
processing the edit rule(s)).

Printing in various colors or print trains were not the
responsibility of the COBOL program, but rather the operator
who mounted the print chain/train and/or ribbon for
the print job.

For certain applications, an edt instruction would still
be useful, but probably wouldn't be significantly more
time efficient than a set of risc-ish instructions (it
would likely be more space efficient).

The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Jul 17 19:19:26 2024

According to Scott Lurndal <slp53@pacbell.net>:

I suspect that as compute power grew, the kinds of things we wanted out
of EDIT changed.

negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY

EDIT/EDT instructions were designed from the start for
COBOL fixed point arithmetic. No Nan, No Infinities.

Yes, but don't forget RPG. It was never fashionable but
there sure was a lot of it. Still is on whatever they
call AS/400 aka System i these days.

COBOL had verbs to assign the correct characters as
the decimal point and thousands separator

DECIMAL POINT IS COMMA.

The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.

More important, libraries are basically free now. Every program
links to megabytes of shard libraries and nobody cares.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 19:22:52 2024

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took months, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are
serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Wed Jul 17 20:56:06 2024

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×

more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and diffusion) and quite a few chemical reactions together. So, he evaluated a huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.

A single calculation took months, and he didn't use weak
hardware.

A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the Arrhenius equation into your ISA, I would have done so already
:-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Scott Lurndal on Wed Jul 17 21:14:43 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:

MitchAlsup1 wrote:

Especially the COBOL stuff like EDIT and EDIT-and-MARK.

Regarding EDIT, etc., I think there are three possibilities:

1. They were a bad idea from the start and never should have
been put into S/360.

2. Putting them into S/360 was the right decision at the time,
but the changing technology (i.e. they wouldn't fit into the
initial CISC microprocessors and RISC showed the functionality
could be done other ways) made putting them into newer designs
a bad idea.

3. Putting them into S/360 was the right decision at the time
but the workloads changed. There was less requirement for
things like actually printing checks and general ledger
reports, and programmers moved away from COBOL, which was where
EDIT was a natural fit, to languages where there wasn't as
natural fit, so not putting EDIT into newer CPUs was the right decision.

I suspect it was mostly number 3, but I think number 2 was a
part of it.

I also suspect it was (3). Burroughs medium systems (B3500 etc)

used >> the edit instruction (EDT) extensively for COBOL.

I suspect that as compute power grew, the kinds of things we wanted
out of EDIT changed.

negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY

EDIT/EDT instructions were designed from the start for
COBOL fixed point arithmetic. No Nan, No Infinities.

COBOL had verbs to assign the correct characters as
the decimal point and thousands separator (on the
Burroughs B3500 those characters were stored in reserved
low memory which was accessed the EDT instruction when
processing the edit rule(s)).

Printing in various colors or print trains were not the
responsibility of the COBOL program, but rather the operator
who mounted the print chain/train and/or ribbon for
the print job.

For certain applications, an edt instruction would still
be useful, but probably wouldn't be significantly more
time efficient than a set of risc-ish instructions (it
would likely be more space efficient).

The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.

There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide. The other
question is about whether such should be implemented in hardware.

As I have expressed before here, I believe the answer to the first part
is "yes", but not as feature rich as a full picture clause. I tend to
agree with you about the second.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 23:19:18 2024

On Wed, 17 Jul 2024 21:14:43 +0000, Stephen Fuld wrote:

Scott Lurndal wrote:

The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.

There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide. The other question is about whether such should be implemented in hardware.

As I have expressed before here, I believe the answer to the first part
is "yes", but not as feature rich as a full picture clause. I tend to
agree with you about the second.

The fact that output formatting has advanced so much during my lifetime contraindicates putting what we currently understand in HW.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 23:17:37 2024

On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×

more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.

A single calculation took months, and he didn't use weak
hardware.

A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are
serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

The FDIV and EXP instructions consume all the FMAC cycles, so even if
you completely unrolled the loop, you are not going to get more than
6-cycles less in performing the repeated calculations.

A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.

{Would be an interesting reservation station design, though}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jul 17 20:48:18 2024

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×

more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.

A single calculation took months, and he didn't use weak
hardware.

A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are
serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

The FDIV and EXP instructions consume all the FMAC cycles, so even if
you completely unrolled the loop, you are not going to get more than
6-cycles less in performing the repeated calculations.

A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jul 18 06:00:46 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took months, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

Ah, OK.

They are all performed in the FMAC unit and here the instructions are serially dependent.

A loop containing the calculation could be unrolled, but without
a large effect.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 07:39:35 2024

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.

Take a look at what Fortran can do, it does quite a few things
(and is correspondingly difficult to implement :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 07:54:23 2024

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

[Arrhenius]

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

I ran a little Arrhenius benchmark on an i7-11700. Main program was

program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main

and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)

subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius

Timing result (wall-clock time only):

-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154

Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Jul 18 08:10:45 2024

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took months, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Thu Jul 18 06:45:29 2024

On Wed, 17 Jul 2024 03:51:50 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

I can remember a Comp Sci professor waxing lyrical about p-adic number >systems, as an alternative to conventional p-ary ones. The key thing was
that division was no more complex than multiplication.

And addition, I suppose, is the same complexity too. But comparison
kills you; you need to apply the Chinese Remainder Theorem every time
you want to print out a number.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Thu Jul 18 13:35:33 2024

Thomas Koenig wrote:

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.

Take a look at what Fortran can do, it does quite a few things
(and is correspondingly difficult to implement :-)

I used to write a fair amount of Fortran, but haven't recently. While
it does have a lot of functionality (most of which I never had occasion
to use). But it at least didn't have support for my pet peeve, the
separator character (whatever one is used) every three digits. This
really improves readability for numbers with lots of digits. Does it
now support this?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 16:16:56 2024

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
ops in parallel. The same applies to SSE2 implementations on few
post-Merom "small" cores, both from Intel and from AMD.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Jul 18 13:58:37 2024

Michael S <already5chosen@yahoo.com> writes:

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point

All of which have been obsolete for almost two decades.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Jul 18 16:40:49 2024

On Wed, 17 Jul 2024 23:17:37 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.

{Would be an interesting reservation station design, though}

Your "really BIG" implementation sound not so big by modern standards.
I think, all x86 cores sold in last 5 years except those that are called "small" or "economical" have at least 8 FMAC units. The last
mainstream x86 core with 4 FMAC was AMD Zen1.
On the ARM side "mainstream" Cortex-A cores still feature 4 FMAC units,
but "premium" Cortex-X cores have more. So do "performance" cores from
Apple and Qualcomm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu Jul 18 17:52:44 2024

On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point

All of which have been obsolete for almost two decades.

According to my understanding, Thomas Koenig was/is around for more
than three decades.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to SFuld@alumni.cmu.edu.invalid on Thu Jul 18 09:18:14 2024

On Wed, 17 Jul 2024 20:56:06 -0000 (UTC), "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> wrote:

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

in a way, that's the wrong question.

If you need to have exponentiations and divides running in parallel
_on a single CPU_ because the problem isn't parallel _enough_ to
dispense with them being very tightly coupled... then, yes, that's a
reasonable question.

But _otherwise_ there is no limit on the number of calculations that
can take place in parallel, just throw more CPUs at the problem.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 15:35:50 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point

All of which have been obsolete for almost two decades.

According to my understanding, Thomas Koenig was/is around for more
than three decades.

I have been (starting doing things on mainframes in the late 1980s),
but I was specifically asking about what Mitch had in mind for
My 66000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 15:43:07 2024

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

Thomas Koenig wrote:

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.

Take a look at what Fortran can do, it does quite a few things
(and is correspondingly difficult to implement :-)

I used to write a fair amount of Fortran, but haven't recently. While
it does have a lot of functionality (most of which I never had occasion
to use). But it at least didn't have support for my pet peeve, the
separator character (whatever one is used) every three digits. This
really improves readability for numbers with lots of digits. Does it
now support this?

It does not, you would have to roll your own.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Thu Jul 18 15:47:29 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Wed, 17 Jul 2024 20:56:06 -0000 (UTC), "Stephen Fuld"
<SFuld@alumni.cmu.edu.invalid> wrote:

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

in a way, that's the wrong question.

If you need to have exponentiations and divides running in parallel
_on a single CPU_ because the problem isn't parallel _enough_ to
dispense with them being very tightly coupled... then, yes, that's a reasonable question.

But _otherwise_ there is no limit on the number of calculations that
can take place in parallel, just throw more CPUs at the problem.

That is often, if not usually not the case; single core performance
is still important. There are several reasons:

People usually don't have unlimited number of CPUs at their
disposal, the efficiency of parallel computing goes down with
increasing number of cores (Amdahl's law). Finally, if you use
commercial software, you often only have a certain number of
licenses for prallel computation. That varies a lot by vendor,
though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Jul 18 12:33:13 2024

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.

Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to keep the FMAC fully busy (my naive understanding is that there isn't)?
If not, do current CPU make the FMAC available for other operations while
an FDIV/SQRT is in progress? If not, how hard would it be?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 19:38:03 2024

On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

[Arrhenius]

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

I ran a little Arrhenius benchmark on an i7-11700. Main program was

program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main

and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)

subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius

Timing result (wall-clock time only):

-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154

Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .

Does the benchmark represent a real-world use?
In particular,
1. Is there really a large number of different EA values or only around
dozen or hundred?
2. Does temperature vary all the time or there are relatively big
groups of points calculated at the same temperature?
A similar question can be asked about A, but it is of little practical importance.
3. Is double precision really needed? According to my understanding,
empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single
precision is sufficient with digits to spare.
4. Dies the equation work at all when the temperature is not close to
the point of equilibrium ? If not, what is a sane range for ea/(r*t) ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 16:40:11 2024

On Thu, 18 Jul 2024 12:10:45 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

MitchAlsup1 wrote:

On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more. >>>>>

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took months, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt

Does not look "all that bad" to me.

So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?

FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined

They are all performed in the FMAC unit and here the instructions are
serially dependent.

So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.

Over the 20 cycles the multiplier is doing Goldschmidt iterations, there
are only 3 slots where a different instruction could sneak through.

Note: the multiplier used in Goldschmidt iterations is used every cycle
first for the denominator being driven towards 1.0, the second driving
the numerator towards quotient.

That is, its a 4 cycle pipeline unit from the outside, but a 2 cycle
pipeline unit from within the function unit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 18 16:43:26 2024

On Thu, 18 Jul 2024 12:45:29 +0000, John Savard wrote:

On Wed, 17 Jul 2024 03:51:50 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I can remember a Comp Sci professor waxing lyrical about p-adic number >>systems, as an alternative to conventional p-ary ones. The key thing was >>that division was no more complex than multiplication.

And addition, I suppose, is the same complexity too. But comparison
kills you; you need to apply the Chinese Remainder Theorem every time
you want to print out a number.

You can make FADD/FSUB/FCMP easy (Standard IEEE organization)
OR
You can make FMUL/FDIV easy (logarithmic organization)

In both cases performing the "other" calculation is rather
equally hard. FADD/FSUB/FCMP <-> FMULL/FDIV

I suggest that logarithmic organization makes FMAC very hard.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 16:51:29 2024

On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

MitchAlsup1 wrote:

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.

You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.

So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Jul 18 16:47:35 2024

On Thu, 18 Jul 2024 15:35:50 +0000, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point

All of which have been obsolete for almost two decades.

According to my understanding, Thomas Koenig was/is around for more
than three decades.

I have been (starting doing things on mainframes in the late 1980s),
but I was specifically asking about what Mitch had in mind for
My 66000.

I have a spectrum in mind; everything from an in order 1-wide to (as of
now) 6-wide GBOoO design with 4 FMAC units.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stefan Monnier on Thu Jul 18 12:56:46 2024

Stefan Monnier wrote:

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >> so each can only use one of the 4 pipeline slots.

Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to keep the FMAC fully busy (my naive understanding is that there isn't)?
If not, do current CPU make the FMAC available for other operations while
an FDIV/SQRT is in progress? If not, how hard would it be?

Stefan

And if they can't mix then to what extent can the end of one op,
as it drains from the pipeline, overlap with the start of the next?
Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
with the end of a prior FDIV? An EXP?

I was thinking about reservation station schedulers and wondering
what they might have to optimize.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 17:06:52 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

[Arrhenius]

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

I ran a little Arrhenius benchmark on an i7-11700. Main program was

program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main

and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)

subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius

Timing result (wall-clock time only):

-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154

Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .

Does the benchmark represent a real-world use?
In particular,
1. Is there really a large number of different EA values or only around
dozen or hundred?

Usually, one for each chemical reaction. If you have a complex
reaction network, it can be a few hundred.

It is possible to pre-calculate Ea/R, but the division by T still
needs to be done.

2. Does temperature vary all the time or there are relatively big
groups of points calculated at the same temperature?

It varies all the time because of the exothermic/endothermic
character of the reactions. This is not what was calculated, but you
can think of Methane combusion with oxygen. There are numerous
radical, high-energy species appearing and disappearing all the time.
And if you're unlucky, you will also have to calculate radiation :-(

But for each fluid element, the energy equation is solved again
each time step.

And if you try to average the temperatueres over groups of cells...
don't. You average enough already by selecting the grid, and also by
applying multigrid solvers.

A similar question can be asked about A, but it is of little practical importance.
3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single
precision is sufficient with digits to spare.

To calculate values, yes, but if you try to differentiate things,
things can go wrong pretty fast. I'm actually not sure what he used
in this particular case.

There is, however, a tendency with engineers to use double precision
because quite often, you do not go wrong with that, and you can
go wrong with single precision.

A lot of numerical work in general, and CFD work in particular,
consists of fighting for convergence. You usually don't want to
change anything that would endanger that.

4. Dies the equation work at all when the temperature is not close to
the point of equilibrium ? If not, what is a sane range for ea/(r*t) ?

It does work far away from the equilibrium (which is the point).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 20:28:43 2024

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

[Arrhenius]

Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?

I ran a little Arrhenius benchmark on an i7-11700. Main program
was

program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main

and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)

subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius

Timing result (wall-clock time only):

-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154

Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .

Does the benchmark represent a real-world use?
In particular,
1. Is there really a large number of different EA values or only
around dozen or hundred?

Usually, one for each chemical reaction. If you have a complex
reaction network, it can be a few hundred.

It is possible to pre-calculate Ea/R, but the division by T still
needs to be done.

The idea I was thinking about was to pre-calculate R/Ea rather than
Ea/R. And then to find fast algorithm for approximation of exp(-1/x) on
the range of interest.
This sort of tricks no longer make sense on post-Skylake Intel
or on post-Zen2 AMD, or on few later generations of Apple. But just 5-6
years ago it made a good sense on pretty much anything.

2. Does temperature vary all the time or there are relatively big
groups of points calculated at the same temperature?

It varies all the time because of the exothermic/endothermic
character of the reactions. This is not what was calculated, but you
can think of Methane combusion with oxygen. There are numerous
radical, high-energy species appearing and disappearing all the time.
And if you're unlucky, you will also have to calculate radiation :-(

But for each fluid element, the energy equation is solved again
each time step.

And if you try to average the temperatueres over groups of cells...
don't. You average enough already by selecting the grid, and also by applying multigrid solvers.

A similar question can be asked about A, but it is of little
practical importance.
3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single precision is sufficient with digits to spare.

To calculate values, yes, but if you try to differentiate things,
things can go wrong pretty fast. I'm actually not sure what he used
in this particular case.

There is, however, a tendency with engineers to use double precision
because quite often, you do not go wrong with that, and you can
go wrong with single precision.

A lot of numerical work in general, and CFD work in particular,
consists of fighting for convergence. You usually don't want to
change anything that would endanger that.

4. Dies the equation work at all when the temperature is not close
to the point of equilibrium ? If not, what is a sane range for
ea/(r*t) ?

It does work far away from the equilibrium (which is the point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 18:48:45 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction). If you see that in data, that
means that the Arrhenius model is not applicable because there is
in fact no activation energy to overcome, that something strange
happens in thermodynamics (say, you get solid precipitation of
one component at a certain temperature) or that some there are
some intermediate steps which are not described in the model.

And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 22:48:53 2024

On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction).

In the branch of physical chemistry that I was taught before switching
to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.

If you see that in data, that
means that the Arrhenius model is not applicable because there is
in fact no activation energy to overcome, that something strange
happens in thermodynamics (say, you get solid precipitation of
one component at a certain temperature) or that some there are
some intermediate steps which are not described in the model.

And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .

So, something like [-10:0] ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 19:54:05 2024

On Thu, 18 Jul 2024 16:56:46 +0000, EricP wrote:

Stefan Monnier wrote:

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >>> so each can only use one of the 4 pipeline slots.

Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to
keep the FMAC fully busy (my naive understanding is that there isn't)?
If not, do current CPU make the FMAC available for other operations
while
an FDIV/SQRT is in progress? If not, how hard would it be?

In an SRT FDIV unit: no absolutely not
In a Goldschmidt FDIV unit 3 slots out of 20

Neither of these uses microcode--sequencing: yes; microcode: no.

Stefan

And if they can't mix then to what extent can the end of one op,
as it drains from the pipeline, overlap with the start of the next?

By the pipeline depth of the function unit minus 1.

For example: FDIV = 20 cycles unpipelined, FMUL 4-cycles pipelined;
the multiplier tree is in cycle 2 of the 4 stages. Sequencer knows
the last cycle FDIV of the multiplier (17) and enables its reservation
station to spit out an FMUL on cycle 15 which arrives at FU on cycle
17 after forwarding, so the FMUL takes cycle 18-19-20-21 and is done
in its normal 4-cycles.

Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
with the end of a prior FDIV? An EXP?

The "busy" time of a FU is most often:

busy = latency - (pipeline_depth - 1);

I was thinking about reservation station schedulers and wondering
what they might have to optimize.

The interesting nature is spitting out an instruction and then routing
it to a non-busy FU of appropriate capability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 22:24:52 2024

On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction). If you see that in data, that
means that the Arrhenius model is not applicable because there is
in fact no activation energy to overcome, that something strange
happens in thermodynamics (say, you get solid precipitation of
one component at a certain temperature) or that some there are
some intermediate steps which are not described in the model.

And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Jul 18 20:06:49 2024

On Thu, 18 Jul 2024 17:28:43 +0000, Michael S wrote:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

The idea I was thinking about was to pre-calculate R/Ea rather than
Ea/R. And then to find fast algorithm for approximation of exp(-1/x) on
the range of interest.

exp( -x ) = 1/exp( x )

I guess, in the past, when -x was at most 1 cycle, and FDIV was 20
cycles
and exp() was 150 cycles, you did not worry about creating an elementary function to calculate 1/exp(x). With exp(x) now down in the 16 cycle
range, it might be different as 1/exp(x) would likely also take 16
cycles.

My 66000 exp(x) instruction has sign control on the operands so one can calculate exp(-x) as:

EXP Rd,-Rx // still 16 cycles

exp(1/x) is pretty much confined to the serial latency of FDIV and EXP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Jul 19 08:19:17 2024

On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

The problem is that the modern languages don't have a modern concept of
a PICture clause.

Perhaps one thing that killed them was they are not localizable?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Fri Jul 19 14:45:46 2024

On Fri, 19 Jul 2024 11:28:17 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the
point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean
that a reaction is is slowed down by increasing temperature (and
by this don't mean the reverse reaction).

In the branch of physical chemistry that I was taught before
switching to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.

Do you happen to recall if that process was limited by thermodynamic equilibrium, by mass transfer or indeed by a chemical reaction?

I don't know, but would think that what causes it is an approuch to equilibrium. But at 1550 degrees equilibrium is still pretty far away
(50-70 degrees away? I don't remember) and despite that a slowdown is
already very significant.

To me, that would sound like maybe one of the first two, but
knowing nothing about this, I am now firmly speculating :-)

And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .

So, something like [-10:0] ?

Activation energies can also be 180 kJ/mol, but I don't really
know the range that people might want to calculate these values.

I would think that when reaction is slowed down by factor of more than
10K then an exact rate is of lesser interest and the aithmetic with
precision of one octave should be o.k. since precision of the model is
worse than that anyway.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Fri Jul 19 11:28:17 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction).

In the branch of physical chemistry that I was taught before switching
to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.

Do you happen to recall if that process was limited by thermodynamic equilibrium, by mass transfer or indeed by a chemical reaction?
To me, that would sound like maybe one of the first two, but
knowing nothing about this, I am now firmly speculating :-)

And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .

So, something like [-10:0] ?

Activation energies can also be 180 kJ/mol, but I don't really
know the range that people might want to calculate these values.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Fri Jul 19 16:16:01 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10Ã— more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a) term?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Jul 19 14:41:52 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10Ã— more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.

Not bad.

Using the square root approximation for 1/x has not yet been
metiioned, but it is of course quite possible, especially since
the benchmarks I posted, with -Ofast, do enable this option.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a) term?

I don't think that formula is correct.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Fri Jul 19 15:56:32 2024

MitchAlsup1 wrote:

On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

MitchAlsup1 wrote:

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.

You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.

So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.

Please bear in mind that I am a big fan of VVM over SIMD instructions.
But . . . is this a potential drawback of VVM versus SIMD? i.e. that
as the number of execution units increases, the design of the
reservation sations gets more complex for VVM, but not for SIMD. i.e.
It sounds like a CPU with 8 FMAC units would be a more difficult design
for VVM.

Is that an issue?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 19 16:32:43 2024

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10Ã— more.

Maybe time for a little story.

Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,

k = A * exp(-E_a/(R*T))

and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.

A single calculation took _months_, and he didn't use weak hardware.

A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.

Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Jul 19 16:15:47 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

The problem is that the modern languages don't have a modern concept of
a PICture clause.

Perhaps one thing that killed them was they are not localizable?

They certainly were localizable, at least with respect to
punctuation and currency symbol. And COBOL is far from 'killed'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jul 19 17:46:23 2024

According to Scott Lurndal <slp53@pacbell.net>:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

The problem is that the modern languages don't have a modern concept of
a PICture clause.

Perhaps one thing that killed them was they are not localizable?

They certainly were localizable, at least with respect to
punctuation and currency symbol. And COBOL is far from 'killed'.

COBOL now has CURRENCY SIGN which lets you set the currency symbol to
edit into a picture, and DECIMAL POINT IS COMMA, to switch the two.
That covers a lot of local currency conventions but not all. Some
countries put the currency symbol before the amount, some after it,
some between the units and the decimals. Some use two decimals, some
three, many use none. Some show negative values with a minus sign,
some with parens around the amount, some other ways.

If you want to do this right, you need a library that can
format all this per country, like this python library:

https://babel.pocoo.org/en/latest/api/numbers.html#babel.numbers.format_currency

It's certainly possible to do all that in COBOL, but it's painful.

I think the problem with picture clauses is that they while they did
the job, in practice they are mostly used in a few stylized ways, the ones
that the 360's ED and EDMK can handle, and couldn't handle situations
where the subfields are reordered.

I admit printf can't reorder the subfields but many of its modern descendants like python formats can and often do.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Fri Jul 19 20:55:47 2024

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his
reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

Yeah, that was eye-opening to me as well, to the level where I consider
the invsqrt() NR iteration as a mainstay, it can be useful for both sqrt
and 1/x as well. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Jul 19 20:25:51 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4. So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)

Other architectures I have tried don't seem to have it.

Does it make sense? Well, if you want to calculate lots of Arrhenius
equations, you don't need full accuracy and (like in Mitch's case)
exp has become as fast as division, then it could actually make a
lot of sense. It is still possible to add Newton steps afterwards,
which is what gcc does if you add -mrecip -ffast-math.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jul 19 20:34:22 2024

On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?

I don't think that formula is correct.

It is not::

exp(x+y) = exp(x)×exp(y)

so add/subtract can be simplified, but multiply and divide cannot
(easily).
Now, if we were dealing with Ln() those multiplies and divides can be simplified.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Jul 20 08:45:13 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 18 Jul 2024 15:35:50 +0000, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).

All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point

All of which have been obsolete for almost two decades.

According to my understanding, Thomas Koenig was/is around for more
than three decades.

I have been (starting doing things on mainframes in the late 1980s),
but I was specifically asking about what Mitch had in mind for
My 66000.

I have a spectrum in mind; everything from an in order 1-wide to (as of
now) 6-wide GBOoO design with 4 FMAC units.

If you're thinking about 64-bit FMA units, that would
be half of what Intel apparently plans to put into their
"economy" cores with Skymont, which is four 128-bit FMA units: https://chipsandcheese.com/2024/06/15/intel-details-skymont/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sat Jul 20 08:53:26 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 19 Jul 2024 11:28:17 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

It does work far away from the equilibrium (which is the
point).

But how far?
Does it work for EA/RT outside of, say [-10:+12] ?

Ea is almost always positive; if if was negative, it would mean
that a reaction is is slowed down by increasing temperature (and
by this don't mean the reverse reaction).

In the branch of physical chemistry that I was taught before
switching to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.

Do you happen to recall if that process was limited by thermodynamic
equilibrium, by mass transfer or indeed by a chemical reaction?

I don't know, but would think that what causes it is an approuch to equilibrium. But at 1550 degrees equilibrium is still pretty far away
(50-70 degrees away? I don't remember) and despite that a slowdown is
already very significant.

That would also be covered by the Arrhenius equation - you have one
reate for the forward reaction and one rate for the backward reaction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sat Jul 20 19:04:12 2024

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?

I don't think that formula is correct.

It is not::

exp(x+y) = exp(x)×exp(y)

so add/subtract can be simplified, but multiply and divide cannot
(easily).
Now, if we were dealing with Ln() those multiplies and divides can be simplified.

Yeah, I mixed up exp() and ln() pretty badly. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 20 19:01:15 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

I've taken both a second and third look at InvSqrt() over the last few
years, it turns out that the Quake version is far from optimal: With the
exact same instructions, just a different set of constants, you can get
about 1.5 bits more than quake does, i.e. about 10 bits after that
single NR step (which isn't really NR since it modifies both the 1.5 and
the 0.5 factors).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 20 20:12:07 2024

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

I've taken both a second and third look at InvSqrt() over the last few
years, it turns out that the Quake version is far from optimal: With the
exact same instructions, just a different set of constants, you can get
about 1.5 bits more than quake does, i.e. about 10 bits after that
single NR step (which isn't really NR since it modifies both the 1.5 and
the 0.5 factors).

Wikipedia has something on that, also literature; Moroz et al. https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
method to do this, with one or two steps.

Impressive amounts of math here, unfortunately they completely miss the
point!

What they derive is the exact optimal value for the magic constant,
using zero, one or two text-book NR iterations.

However, if you are willing to take that first NR iteration

halfnumber = 0.5f*x
...
i = R-(i>>1);
...
x = x*(1.5f-halfnumber*x*x);

and then make both the 0.5f and 1.5f constants free variables, you can
in fact get 1.5 more bits than what they show in this paper.

In fact, using slightly different values for R (the magic constant)
results in very marginal changes in max rel error, while optimizing all
three constants improves that max error from the stated 1.75e-3 to about 0.6e-3!

I've also looked a little bit at simplifying exp(-1/x) directly, and
it seems an unpleasent function to implement; at least there is no
argument reduction which comes to mind. With exp2(x), you can split
off the integer part and do an appoximation on the rest over a finite interval, but if there is a fast approximation of the integer part
of 1/x without dividing, I am not aware of one :-)

:-(

Fast is probably what you get from a pure lookup table, but with no
obvious NR iteration to improve it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Jul 20 17:19:56 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?

I don't think that formula is correct.

It is not::

exp(x+y) = exp(x)×exp(y)

so add/subtract can be simplified, but multiply and divide cannot
(easily).
Now, if we were dealing with Ln() those multiplies and divides can be simplified.

Well, you can sometimes the relation exp(a)**b = exp(a*b), but it is
not useful in this case.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 20 17:17:53 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

I've taken both a second and third look at InvSqrt() over the last few
years, it turns out that the Quake version is far from optimal: With the exact same instructions, just a different set of constants, you can get
about 1.5 bits more than quake does, i.e. about 10 bits after that
single NR step (which isn't really NR since it modifies both the 1.5 and
the 0.5 factors).

Wikipedia has something on that, also literature; Moroz et al. https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
method to do this, with one or two steps.

I've also looked a little bit at simplifying exp(-1/x) directly, and
it seems an unpleasent function to implement; at least there is no
argument reduction which comes to mind. With exp2(x), you can split
off the integer part and do an appoximation on the rest over a finite
interval, but if there is a fast approximation of the integer part
of 1/x without dividing, I am not aware of one :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 20 21:10:52 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Wikipedia has something on that, also literature; Moroz et al.
https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
method to do this, with one or two steps.

Impressive amounts of math here, unfortunately they completely miss the point!

What they derive is the exact optimal value for the magic constant,
using zero, one or two text-book NR iterations.

However, if you are willing to take that first NR iteration

halfnumber = 0.5f*x
...
i = R-(i>>1);
...
x = x*(1.5f-halfnumber*x*x);

and then make both the 0.5f and 1.5f constants free variables, you can
in fact get 1.5 more bits than what they show in this paper.

Wikipedia also has something about that...

In fact, using slightly different values for R (the magic constant)
results in very marginal changes in max rel error, while optimizing all
three constants improves that max error from the stated 1.75e-3 to about 0.6e-3!

Looks like https://web.archive.org/web/20180709021629/http://rrrola.wz.cz/inv_sqrt.html
who reports 6.50196699E−4 as the maximum error (also from the
Wikipedia article).

That's 10.5 bits of accuracy, not bad at all.

However... assume you want to do another NR step. In that case,
you might be better off not loading different constants from memory,
so having the same constants might actually be an advantage
(whch does not mean that they have to be the original Newton steps).

I've also looked a little bit at simplifying exp(-1/x) directly, and
it seems an unpleasent function to implement; at least there is no
argument reduction which comes to mind. With exp2(x), you can split
off the integer part and do an appoximation on the rest over a finite
interval, but if there is a fast approximation of the integer part
of 1/x without dividing, I am not aware of one :-)

:-(

Fast is probably what you get from a pure lookup table, but with no
obvious NR iteration to improve it.

I stronlgy suspect you're fight, but it is too late in the evening
to calculate this :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to All on Sun Jul 21 00:51:40 2024

Above one table was misformated. Here is a correct format.
Accurate sqrt (single precision)
Arch Latency Throughput (scalar/128b/256b)
Zen3 14 0.20/0.200/0.200
SkyLake 12 0.33/0.333/0.167
Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Jul 21 00:23:44 2024

On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.

It seems, you looked at the wrong instruction table.
Here are not the very modern x86-64 cores:
Arch Latency Throughput (scalar/128b/256b)
Zen3 3 2/2/1
Skylake 4 1/1/1
Ice Lake 4 1/1/1
Power9 5-7 4/2/N/A

So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)

If you are have enough of independent rsqrt to do, all four processors
have the same theoretical peak throughput, but x86 tend to have more
cores and to run at faster clock. And lower latency makes achieving
peak throughput easier. Also, depending on target precision, higher
initial precision of x86 estimate means that sometimes you can get away
with 1 less NR iteration.

Also, if what you really need is sqrt rather than rsqrt, then depending
on how much inaccuracy you can accept, sometimes on modern x86 the
calculating accurate sqrt can be better solution than calculating approximation. It is less likely to be the case on POWER9 Accurate sqrt
(single precision) Zen3 14 0.20/0.200/0.200 SkyLake 12 0.33/0.333/0.167 Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A

Accurate sqrt (double precision)
Zen3 20 0.111/0.111/0.111
Skylake 12 0.167/0.167/0.083
Ice Lake 12 0.167/0.167/0.083
Power9 36 0.111/0.067/N/A

Other architectures I have tried don't seem to have it.

Arm64 has it. It is called FRSQRTE.

Does it make sense? Well, if you want to calculate lots of Arrhenius equations, you don't need full accuracy and (like in Mitch's case)
exp has become as fast as division, then it could actually make a
lot of sense. It is still possible to add Newton steps afterwards,
which is what gcc does if you add -mrecip -ffast-math.

I don't know about POWER, but on x86 I wouldn't do it.
I'd either use plain division that on modern cores is quite fast
or will use NR to calculate normal reciprocal. x86 provides initial
estimate for that too (RCPSS).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sat Jul 20 21:58:59 2024

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.

It seems, you looked at the wrong instruction table.

[Note I was not writing about inverse squre root, I was writing
about inverse].

I have to admit to being almost terminally confused by Intel
generation names, so I am likely to mix up what is old and what
is new.

Here are not the very modern x86-64 cores:
Arch Latency Throughput (scalar/128b/256b)
Zen3 3 2/2/1
Skylake 4 1/1/1
Ice Lake 4 1/1/1
Power9 5-7 4/2/N/A

Power9 has it for 128-bit, but not for 256 bits (it doesn't have
those registers), and if I read the handbook correctly, that
would also be 4 operations in parallel.

So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)

If you are have enough of independent rsqrt to do, all four processors
have the same theoretical peak throughput, but x86 tend to have more
cores and to run at faster clock. And lower latency makes achieving
peak throughput easier. Also, depending on target precision, higher
initial precision of x86 estimate means that sometimes you can get away
with 1 less NR iteration.

Also, if what you really need is sqrt rather than rsqrt, then depending
on how much inaccuracy you can accept, sometimes on modern x86 the calculating accurate sqrt can be better solution than calculating approximation. It is less likely to be the case on POWER9 Accurate sqrt

[table reformatted, hope I got this right]

(single precision)
Zen3 14 0.20/0.200/0.200
SkyLake 12 0.33/0.333/0.167
Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A

Accurate sqrt (double precision)
Zen3 20 0.111/0.111/0.111
Skylake 12 0.167/0.167/0.083
Ice Lake 12 0.167/0.167/0.083
Power9 36 0.111/0.067/N/A

Other architectures I have tried don't seem to have it.

Arm64 has it. It is called FRSQRTE.

Interesting that "gcc -O3 -ffast-meth -mrecip" does not
appear to use it.

Does it make sense? Well, if you want to calculate lots of Arrhenius
equations, you don't need full accuracy and (like in Mitch's case)
exp has become as fast as division, then it could actually make a
lot of sense. It is still possible to add Newton steps afterwards,
which is what gcc does if you add -mrecip -ffast-math.

I don't know about POWER, but on x86 I wouldn't do it.
I'd either use plain division that on modern cores is quite fast
or will use NR to calculate normal reciprocal. x86 provides initial
estimate for that too (RCPSS).

Note that I was talking about the inverse in the first place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun Jul 21 01:46:53 2024

On Sat, 20 Jul 2024 21:58:59 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.

I can well believe that.

It is interesting to see what different architectures offer for
faster reciprocals.

POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).

x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.

It seems, you looked at the wrong instruction table.

[Note I was not writing about inverse squre root, I was writing
about inverse].

Oh, sorry. I misaunderstood.

I have to admit to being almost terminally confused by Intel
generation names, so I am likely to mix up what is old and what
is new.

Gen 2 is Sandy Bridg
Gen 3 is Ivy Bridge which is very similar core microarchitecture
Gen 4 is Haswell
Gen 5 is Broadwell which is similar, but has few changes
Gen 6,7,8,9 and majority of 10 is all the same microarchitecture -
Skylake.
11 is mostly Tiger lake and partially Ice Lake and Rocket Lake. All
three are different from each other on silicon proces side, but ver
similar on core microarcheticture.
12 is Alder Lake that has P cores and E cores. Microarchitecture of P
core is called Golden Cove.

Here are not the very modern x86-64 cores:
Arch Latency Throughput (scalar/128b/256b)
Zen3 3 2/2/1
Skylake 4 1/1/1
Ice Lake 4 1/1/1
Power9 5-7 4/2/N/A

Power9 has it for 128-bit, but not for 256 bits (it doesn't have
those registers), and if I read the handbook correctly, that
would also be 4 operations in parallel.

Handbook is quite clear that both xvresp and xvredp have
max throughput=2.

So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)

That's the sentence that caused my misunderstanding.

If you are have enough of independent rsqrt to do, all four
processors have the same theoretical peak throughput, but x86 tend
to have more cores and to run at faster clock. And lower latency
makes achieving peak throughput easier. Also, depending on target precision, higher initial precision of x86 estimate means that
sometimes you can get away with 1 less NR iteration.

Also, if what you really need is sqrt rather than rsqrt, then
depending on how much inaccuracy you can accept, sometimes on
modern x86 the calculating accurate sqrt can be better solution
than calculating approximation. It is less likely to be the case on
POWER9 Accurate sqrt

[table reformatted, hope I got this right]

(single precision)
Zen3 14 0.20/0.200/0.200
SkyLake 12 0.33/0.333/0.167
Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A

Accurate sqrt (double precision)
Zen3 20 0.111/0.111/0.111
Skylake 12 0.167/0.167/0.083
Ice Lake 12 0.167/0.167/0.083
Power9 36 0.111/0.067/N/A

Other architectures I have tried don't seem to have it.

Arm64 has it. It is called FRSQRTE.

Interesting that "gcc -O3 -ffast-meth -mrecip" does not
appear to use it.

You mean, on other architectures gcc does emit approximate rsqrt from
plain C or plain Fortran? In which situations?

Does it make sense? Well, if you want to calculate lots of
Arrhenius equations, you don't need full accuracy and (like in
Mitch's case) exp has become as fast as division, then it could
actually make a lot of sense.

Mitch was told more than one time that floating point division on
modern cores (esp. Apple, but others too) is much faster than he thinks.
But he tend to forget it quickly.

It is still possible to add Newton

steps afterwards, which is what gcc does if you add -mrecip
-ffast-math.

I don't know about POWER, but on x86 I wouldn't do it.
I'd either use plain division that on modern cores is quite fast
or will use NR to calculate normal reciprocal. x86 provides initial estimate for that too (RCPSS).

Note that I was talking about the inverse in the first place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun Jul 21 07:24:41 2024

On Fri, 19 Jul 2024 17:46:23 -0000 (UTC), John Levine wrote:

COBOL now has CURRENCY SIGN which lets you set the currency symbol to
edit into a picture, and DECIMAL POINT IS COMMA, to switch the two.

All that involves changes to the code, though. You need to ability to get
such settings from the system locale, for easy (re)configuration.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Sun Jul 21 08:33:09 2024

Michael S <already5chosen@yahoo.com> schrieb:

You mean, on other architectures gcc does emit approximate rsqrt from
plain C or plain Fortran? In which situations?

AMD64: -mrecip -ffast-math, 32-bit real only
POWER: -mrecip -ffast-math
aarch64: -ffast-math -mlow-precision-div (it's a different option
name, whihc is why I didn't find it before). There, you can also
adjust the number of Newton iterations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Jul 21 12:27:15 2024

Michael S <already5chosen@yahoo.com> writes:

Gen 2 is Sandy Bridg
Gen 3 is Ivy Bridge which is very similar core microarchitecture
Gen 4 is Haswell
Gen 5 is Broadwell which is similar, but has few changes
Gen 6,7,8,9 and majority of 10 is all the same microarchitecture -
Skylake.
11 is mostly Tiger lake and partially Ice Lake and Rocket Lake. All
three are different from each other on silicon proces side, but ver
similar on core microarcheticture.
12 is Alder Lake that has P cores and E cores. Microarchitecture of P
core is called Golden Cove.

Introduction year = Generation + 2009

That's what's important to Intel's marketing, and it explains why
there are different microarchitectures with the same generation, and
the same microarchitecture with different generations.

However, they gave Rocket Lake (introduce 2021) 11th-Generation
numbers, probably in anticipation of the introduction of Alder Lake
later that year.

AFAIK all Ice Lakes are "10th generation". I also used to think that
the microarchitectural difference between Ice Lake, Tiger Lake, and
Rocket Lake are very minor, but it turns out that they are
significant. I remember (IIRC the source was chipsncheese) one
difference being that Rocket Lake can optimize dependent moves, but
Tiger Lake cannot (apparently a bug that was fixed properly in Rocket
Lake, but only by turning off that feature in Tiger Lake and Ice
Lake). I also saw differences in measurements I made:

http://www.complang.tuwien.ac.at/anton/tmp/intel-p-evo-eps-converted-to.pdf

The different lines are for different benchmarks, and the y axis
represents a speedup, of Gforth with the cib optimization over Gforth
without it. If Rocket Lake and Tiger Lake had the same
microarchitecture, I would expect flat lines between them (I don't
think that these benchmarks are much influenced by the Tiger Lake's
larger L2 cache), but we see quite a bit of difference for many
benchmarks.

Other views at the same data are:

http://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps http://www.complang.tuwien.ac.at/anton/tmp/unopt-ipc-uarch.eps

These are IPC results for the same benchmarks, with the results sorted
for each microarchitecture. opt is with the cib optimization, unopt
without. Tigerlake is the second-lightest blue and Rocket Lake the third-lightest blue, and they are close to the top in both graphs.
But you can see that they have different IPC numbers for a number of benchmarks. It's better visible in the opt graph,

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jul 21 13:11:56 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

However, if you are willing to take that first NR iteration

halfnumber = 0.5f*x
...
i = R-(i>>1);
...
x = x*(1.5f-halfnumber*x*x);

and then make both the 0.5f and 1.5f constants free variables, you can
in fact get 1.5 more bits than what they show in this paper.

...

Looks like https://web.archive.org/web/20180709021629/http://rrrola.wz.cz/inv_sqrt.html
who reports 6.50196699E−4 as the maximum error (also from the
Wikipedia article).

That's 10.5 bits of accuracy, not bad at all.

However... assume you want to do another NR step. In that case,
you might be better off not loading different constants from memory,
so having the same constants might actually be an advantage
(whch does not mean that they have to be the original Newton steps).

The number of accurate digits doubles after each NR step, so starting
with the better first "NR" iteration would result in an additional
accuracy of 3 bits. And if you optimize the new constants for the
second iteration, you may even get more. Or you could optimize for
two iterations using the same constants ...

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Jul 21 12:28:43 2024

MitchAlsup1 wrote:

On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

MitchAlsup1 wrote:

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.

You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.

So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.

This is where I saw a benefits to using valued reservation stations vs valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank. https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.
The number of PRF read ports scales with the number of dispatched uOp
operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.

Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Jul 21 17:03:08 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

This is where I saw a benefits to using valued reservation stations vs >valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank. >https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).

If the operation was waiting in the RS, at least one operand comes in
through a forwarding path (at least if the uarch has those).

This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)

Yes, but ARM A64 seems to have been designed with that in mind.

Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

ARM A64's stp instruction needs up to 4 register reads. But I think
Oryon only supports two stores per cycle, so that raises the number to
44. It also has a split register file: So for the integer register
file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
file, 3*4=12 ports would be needed. I think, though, that the number
actually needed is much smaller, and various techniques are used to
make do with far fewer register ports.

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.

Note that Oryon has 8-way rename/dispatch (I wonder how that works
with, e.g., 8 ldp instructions that each can write three registers).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Jul 21 19:44:49 2024

On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

MitchAlsup1 wrote:

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.

You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.

So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.

This is where I saw a benefits to using valued reservation stations vs valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank. https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

I always had RSs keep tack of which FU was delivering the final
operand, so that these could be picked up by the forwarding logic
and not need a RF port. This gets rid of 50%-75% of the RF port
needs.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.

Delivering result at dispatch time.

The number of PRF read ports scales with the number of dispatched uOp operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.

The width of the decoder is narrower than the width of the data path.
We used to call this "catch up bandwidth".

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,

First, a 6-wide machine is not doing 6 3-operand instructions,
it is more like 3-memory ops (2-reg+displacement), one 3-op,
one general 2-op, and one 1-op (branch) so, you only need 12-ports
instead of 18 Most of the time.

The penalty is that each RS entry is 5× the size of the value-free
RS designs. These work just fine when the execution window is
reasonable (say 96 instructions) but fails when the window is
larger than 150-ish.

and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.

One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.

Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank
for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jul 21 19:49:49 2024

On Sun, 21 Jul 2024 17:03:08 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

This is where I saw a benefits to using valued reservation stations vs >>valueless ones - when a uArch has multiple similar FU each with its own >>bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank. >>https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).

If the operation was waiting in the RS, at least one operand comes in
through a forwarding path (at least if the uarch has those).

This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)

Yes, but ARM A64 seems to have been designed with that in mind.

Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

ARM A64's stp instruction needs up to 4 register reads.

2-reads to AGEN, then after Write permission is available, then 2-more.
HW breaks up STs into "get started" and "finish off", finish off is
often near retirement and definitely after one has seen the TLB.

But I think
Oryon only supports two stores per cycle, so that raises the number to
44. It also has a split register file: So for the integer register
file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller,

Typically around 1/3rd to 1/4th of the static count as most operands
arrive via forwarding.

and various techniques are used to
make do with far fewer register ports.

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.

Note that Oryon has 8-way rename/dispatch (I wonder how that works
with, e.g., 8 ldp instructions that each can write three registers).

Averaging 2 I/c saves dynamic port count by big numbers.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon Jul 22 14:01:15 2024

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Jul 22 09:10:41 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

This is where I saw a benefits to using valued reservation stations vs
valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank.
https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).

If the operation was waiting in the RS, at least one operand comes in
through a forwarding path (at least if the uarch has those).

I see the primary purpose of a reservation station as scheduling,
with watching for operand forwarding and wake-up as a component of that.
If any operand is ready when the RS is allocated then that operand's
forwarding wake-up logic is unused.

When all the operands are ready, the RS becomes ready and bids for the FU.
If all operands are ready when the RS entry is allocated then the RS is immediately ready and immediately bids for the FU.

It is an optimization to have a separate RS bank without forwarding
wake-up logic for when all operands are ready at time of dispatch.

This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)

Yes, but ARM A64 seems to have been designed with that in mind.

Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

ARM A64's stp instruction needs up to 4 register reads. But I think
Oryon only supports two stores per cycle, so that raises the number to
44. It also has a split register file: So for the integer register
file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller, and various techniques are used to
make do with far fewer register ports.

Yes, they could be more optimal than the brute force approach.
Unfortunately the article doesn't say how many read ports the integer
PRF has.

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.

Note that Oryon has 8-way rename/dispatch (I wonder how that works
with, e.g., 8 ldp instructions that each can write three registers).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Mon Jul 22 11:02:06 2024

MitchAlsup1 wrote:

On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

MitchAlsup1 wrote:

{Would be an interesting reservation station design, though}

In what way would the RS be interesting or different?

The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.

You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.

So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.

This is where I saw a benefits to using valued reservation stations vs
valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.

Example of horizontal scaling of similar FU each with its own RS bank.
https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.

I always had RSs keep tack of which FU was delivering the final
operand, so that these could be picked up by the forwarding logic
and not need a RF port. This gets rid of 50%-75% of the RF port
needs.

And this also adds another thing to consider to the central port scheduling
and assignment logic.

I considered tracking FU's but, at least for me, it wouldn't help.
In my case with valued RS, each FU result has to be forwarded to all FU operands anyway so it has to drive the same number of destination loads.
This is what allows back-to-back execution.

Also because the Dispatcher passes ready values to RS at allocation,
if a result destination register is written in the same cycle as an OP
is dispatched using the same register as a source, then the result value
is forwarded from the write port to the dispatcher operand.
This allows an OP to launch immediately on the next cycle after Dispatch
if all its operands are ready and the FU available.

Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.

With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.

Delivering result at dispatch time.

The valued RS either receives its operand values from the Dispatcher
when the RS is allocated, or later from the result forwarding bus.
When RS has all operand values for its operation it bids for the FU.

PRF read ports are only required for Dispatcher not for FU launching.
FU launching only needs to read its own local RS operands or the
forwarding buses.

The number of PRF read ports scales with the number of dispatched uOp
operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.

The width of the decoder is narrower than the width of the data path.
We used to call this "catch up bandwidth".

Yes, this would be set by the number of result buses = PRF result write ports.

I was also thinking that even an in-order uArch might make use of
multiple result buses and write ports for limited simple overlapped
execution to catch up after stalls or multi-cycle ops like multiply.

With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,

First, a 6-wide machine is not doing 6 3-operand instructions,
it is more like 3-memory ops (2-reg+displacement), one 3-op,
one general 2-op, and one 1-op (branch) so, you only need 12-ports
instead of 18 Most of the time.

Yes, the average case is much smaller than the worst case.
This is true whether RS are valued or valueless.

I was starting from the view that any instruction can dispatch from
any lane any time, which implies equal, maximal resources for each
dispatch lane but also make the routing much simpler.

One can then move off that, saving read port resources by incurring
increased routing costs as now operands can come from varying sources.
The operand buses become more like cross-bars.

For valueless RS, such optimizations also requires centralized port and bus allocating and scheduling, to route the register number from the RS that
is launching to the PRF read port and route the value back to the FU.

The penalty is that each RS entry is 5× the size of the value-free
RS designs. These work just fine when the execution window is
reasonable (say 96 instructions) but fails when the window is
larger than 150-ish.

Yes but its just flip flops, or even better latches, for each operand.
(Nobody bats an eye at using kilobytes for the branch predictor.)
And it locates the operands closer to the FU minimizing wire delay
at launch, though forwarding is likely still the critical path.

and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.

One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.

Yes, its mostly all for naught.

Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank
for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Mon Jul 22 15:01:10 2024

On Mon, 22 Jul 2024 11:01:15 +0000, Michael S wrote:

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

There are certain N-R iterations that can be expressed with both::

NR+1 = F( NR, SQRT() )
and
NR+1 = F'(NR, RSQRT() )

Typically the one with RSQRT() converges slightly faster than the
one using SQRT(). How much is slightly::maybe ½-1 more bit per
iteration.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jul 22 18:50:10 2024

On Mon, 22 Jul 2024 15:01:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 22 Jul 2024 11:01:15 +0000, Michael S wrote:

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

There are certain N-R iterations that can be expressed with both::

NR+1 = F( NR, SQRT() )
and
NR+1 = F'(NR, RSQRT() )

Typically the one with RSQRT() converges slightly faster than the
one using SQRT(). How much is slightly::maybe ½-1 more bit per
iteration.

As I said above, I am not aware of the NR iterations for SQRT() that
consist of fmul, fadd and nothing else.
If you know one, please tell me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jul 22 17:39:10 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

[...]

(Nobody bats an eye at using kilobytes for the branch predictor.)

IIRC it's in the hundreds of KB on the larger cores. But it pays
off.

One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.

Yes, its mostly all for naught.

Evidence? If it was all for naught, people would not build these big
cores.

Counterevidence:

http://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps

Goldencove has a median IPC of almost 4 for these benchmarks (a bunch
of Forth benchmarks running on Gforth). You also see a steady advance
between generations, from Sandybridge to Goldencove, from Silvermont
to Gracemont, from K8 to Zen3, and also the expected differences
between A72, A76, and Firestorm. And the in-order cores are the
dashed ones at the bottom. Particularly remarkable is Gracemont,
which usually has better IPC than Skylake and comparable IPC to
Firestorm.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Jul 23 14:35:20 2024

Michael S wrote:

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

I.e. calculate rsqrt(x) to the precision you need and then do a single fmul?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Jul 23 16:26:21 2024

On Tue, 23 Jul 2024 14:35:20 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

I.e. calculate rsqrt(x) to the precision you need and then do a
single fmul?

Terje

That much I know.

When precise rounding is not required, I even know slightly better
"combined" method:
1. Do N-1 iterations for RSQRT delivering r0 with 2 or 3 more
significant bits than n/2
2. Calculate SQRT estimate as y0 = x*r0
3. Do last iteration using both y0 and r0 as y = y0 + (x-y0*y0)*0.5*r0.
That would give max. error of something like 0.51 ULP.
A similar combined method could be useful for sw calculation of
correctly rounded quad-precision sqrt as well. In this case it serves
as 'conditionally last step' rather than 'absolutely last step'.

I was hoping that you'll tell me about NR formula for SQRT itself that
can be applied recursively.
The only formula that I know is y = (y*y + x)/(y*2). Obviously, when
speed matters this formula is not useful.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Jul 23 15:14:44 2024

On Mon, 22 Jul 2024 15:50:10 +0000, Michael S wrote:

On Mon, 22 Jul 2024 15:01:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

There are certain N-R iterations that can be expressed with both::

NR+1 = F( NR, SQRT() )
and
NR+1 = F'(NR, RSQRT() )

Typically the one with RSQRT() converges slightly faster than the
one using SQRT(). How much is slightly::maybe ½-1 more bit per
iteration.

As I said above, I am not aware of the NR iterations for SQRT() that
consist of fmul, fadd and nothing else.
If you know one, please tell me.

A am NOT talking about a NR iteration for SQRT() these can be dome with
bit twiddling and multiplication.

I am talking about a NR iteration that uses SQRT() in the
calculation
of some other <harder> elementary function.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Jul 23 18:22:39 2024

Michael S wrote:

On Tue, 23 Jul 2024 14:35:20 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.

I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.

Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)

Terje

What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.

sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

I.e. calculate rsqrt(x) to the precision you need and then do a
single fmul?

Terje

That much I know.

I knew that! :-)

When precise rounding is not required, I even know slightly better
"combined" method:
1. Do N-1 iterations for RSQRT delivering r0 with 2 or 3 more
significant bits than n/2
2. Calculate SQRT estimate as y0 = x*r0
3. Do last iteration using both y0 and r0 as y = y0 + (x-y0*y0)*0.5*r0.
That would give max. error of something like 0.51 ULP.
A similar combined method could be useful for sw calculation of
correctly rounded quad-precision sqrt as well. In this case it serves
as 'conditionally last step' rather than 'absolutely last step'.

I was hoping that you'll tell me about NR formula for SQRT itself that
can be applied recursively.
The only formula that I know is y = (y*y + x)/(y*2). Obviously, when
speed matters this formula is not useful.

Sorry, no, I have not found anything approaching the rsqrt() NR for sqrt().

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Paul A. Clayton on Sun Jul 28 11:39:04 2024

On Sat, 27 Jul 2024 20:06:11 -0400
"Paul A. Clayton" <paaronclayton@gmail.com> wrote:

On 7/18/24 12:40 PM, MitchAlsup1 wrote:
[snip]

Over the 20 cycles the multiplier is doing Goldschmidt iterations,
there
are only 3 slots where a different instruction could sneak through.

Since initial iterations could (I think) use reduced precision, it
might also be possible to use part of the multiplier for a
different lower precision multiplication.

(ISTR, an AMD implementation used different precision for
different steps.)

I never looked at Goldschmidt division algorithm before (partly
because it does not look practical for software implementations and
partly with no particular reason|) so it's possible that I am missing
something obvious.
At the first glance, it appears that cutting precision of F(1) and F(2)
factors is not a matter of reduction in computational complexity. It is
a necessity for correctness.
If the number of significant bits in F(1) and F(2) not cut upfront, we
will have to calculate the result products N(i) with impractically high precision.
With F(i) properly cut, the required precision for N(i) is still high. Something like 67 bits for first iteration, 95 bits for 2nd, 151 bits
for the 3rd and hopefully last iteration.
I still didn't figure out if the same sequence applies to D(i) or it is possible to cut corners here.

BTW, are we sure that modern processors really use Goldshmidt division?
Intel and AMD both have FMA latency=4 and worst case DIVSD latency=14,
so doing it with highly optimized Goldshmidt appears [border-line]
possible. The same applies to high-end Arm Inc. cores (worst case
FDIV latency=15). But on Apple Firestorm the latency of FDIV.FP64 = 10
which sounds too fast for Goldshmidt.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to EricP on Sun Jul 28 13:08:19 2024

On Thu, 18 Jul 2024 12:56:46 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Stefan Monnier wrote:

If the FP multiplier is a 4-stage pipeline, and FDIV is iterating
using the multiplier, can the pipeline get a mix of multiple
operations going at once? FDIV for both Newton–Raphson and
Goldschmidt iterates serially so each can only use one of the 4
pipeline slots.

Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode"
to keep the FMAC fully busy (my naive understanding is that there
isn't)? If not, do current CPU make the FMAC available for other
operations while an FDIV/SQRT is in progress? If not, how hard
would it be?

Stefan

And if they can't mix then to what extent can the end of one op,
as it drains from the pipeline, overlap with the start of the next?
Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
with the end of a prior FDIV? An EXP?

I was thinking about reservation station schedulers and wondering
what they might have to optimize.

On most (all ?) modern GBOoO cores FDIV can overlap with the next
independent FDIV and sometimes with more than one. In that light
it would be hard to believe that FDIV can not overlap with FMUL or FMAC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:22:17
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Continuations

Who's Online

Recent Visitors

System Info