Has there ever been a hardware architecture that managed the flow of
control via “continuations”?
That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.
You might have a location defined in a call frame for a “pointer to parent call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as easily have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a different call frame does not automatically invalidate the previous one. If it stays valid, then there is no reason why you couldn’t, at some point, come back to it and resume execution from where it left off.
The beauty of continuations is that they are a single generalized control construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)
Very few high-level languages (outside of the Lisp family, anyway) seem to have implemented continuations as an explicit language concept. This is an integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)
The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and
coroutines, all built from the same common abstraction. One thing
that is difficult to do with them is arbitrary gotos. (I consider
that a feature, not a bug.)
In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:
The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and coroutines,
all built from the same common abstraction. One thing that is difficult
to do with them is arbitrary gotos. (I consider that a feature, not a
bug.)
Just occasionally, you need a goto to get out of a complicated error situation. Not often, but sometimes.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.
There is basically no real way to get dynamically allocated
stack-frames and argument copying to be performance competitive with adjusting a stack pointer and passing arguments in registers.
Function calls are also, not exactly rare.
Worse still, one is very likely going to need a garbage collector
(in the off chance anyone actually uses continuations, if the
continuation is captured via call/cc or similar, can no longer
automatically free it).
On 7/13/2024 2:50 AM, Lawrence D'Oliveiro wrote:
Has there ever been a hardware architecture that managed the flow of
control via "continuations"?
Doing an ABI based around continuations would basically wreck performance; >Conventional thread-based multitasking is cheaper to implement and gives >better performance.
There is basically no real way to get dynamically allocated stack-frames
and argument copying to be performance competitive with adjusting a
stack pointer and passing arguments in registers. Function calls are
also, not exactly rare.
On Sat, 13 Jul 2024 17:42 +0100 (BST), John Dallman wrote:
In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
The beauty of continuations is that they are a single generalized
control construct that can be used to implement specific language
features like regular routine calls, loops, exceptions and coroutines,
all built from the same common abstraction. One thing that is difficult
to do with them is arbitrary gotos. (I consider that a feature, not a
bug.)
Just occasionally, you need a goto to get out of a complicated error
situation. Not often, but sometimes.
That’s what exceptions are for. And they are handled by continuations as >well.
At its simplest, a continuation is simply "the next instruction to be >executed", so any time you take/save a code address and then jump to
it you have executed a continuation.
Has there ever been a hardware architecture that managed the flow of
control via “continuations”?
That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and >maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you >have a “load new call frame” instruction.
You might have a location defined in a call frame for a “pointer to parent >call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as easily >have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a different >call frame does not automatically invalidate the previous one. If it stays >valid, then there is no reason why you couldn’t, at some point, come back >to it and resume execution from where it left off.
The beauty of continuations is that they are a single generalized control >construct that can be used to implement specific language features like >regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)
Very few high-level languages (outside of the Lisp family, anyway) seem to >have implemented continuations as an explicit language concept. This is an >integral part of Scheme, not so much it seems of non-Scheme Lisps. I >implemented it in my PostScript revival language ><https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up >with a useful example, like for instance an implementation of coroutines, >that doesn’t do my head in. ;)
One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)
According to George Neuner <gneuner2@comcast.net>:
At its simplest, a continuation is simply "the next instruction to be
executed", so any time you take/save a code address and then jump to
it you have executed a continuation.
This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then explicitly saved its state and returned to the dispatcher.
This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.
I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.
The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.
Could you explain a bit about the "dispatcher" and how it decided which >routine to execute next? Perhaps depending on which I/O had completed?
But I can't immediately think of any language with
continuations and without GC. They kinda go together.
On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
One thing that is difficult to do with them is arbitrary gotos. (I
consider that a feature, not a bug.)
It's a bug.
If a programming language wants to disallow arbitrary gotos in order to promote better quality code writing, that's great.
But the machine language of a computer has got to allow _any_ language
to be implemented on it. Including C, or old-style FORTRAN. Otherwise,
one doesn't have a general-purpose machine.
Has there ever been a hardware architecture that managed the flow of...
control via “continuations”?
with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)
According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
Could you explain a bit about the "dispatcher" and how it decided which
routine to execute next? Perhaps depending on which I/O had completed?
As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
characters that said what kind of request it was. The routine would
then do a bunch of work consisting of computing and disk I/O. Whenever
it scheduled a disk operation, it'd explicitly save its state and
return to the dispatcher. When the disk I/O was done, it'd schedule
the next part of the task which restored the state and continued.
Repeat until the task sends a reply to the terminal and is done.
Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual database record.
According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
This programming style is very old. The SAGE air defense system usedI think that looks more like a multi-tasking system with
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.
Right, that saved context is the continuation. Like I said, the tools
these days take care of a lot of the drudge work, e.g. node.js.
The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.
Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
The context you save depends on what you're doing.
Could you explain a bit about the "dispatcher" and how it decided which
routine to execute next? Perhaps depending on which I/O had completed?
As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
characters that said what kind of request it was. The routine would
then do a bunch of work consisting of computing and disk I/O. Whenever
it scheduled a disk operation, it'd explicitly save its state and
return to the dispatcher. When the disk I/O was done, it'd schedule
the next part of the task which restored the state and continued.
Repeat until the task sends a reply to the terminal and is done.
Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual database record.
On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.
Was that the RS/6000? That was quite a step forward at the time (and for
some years after), even among all the other new RISC-based architectures appearing around that time.
What OS/software were you trying above? Because it doesn’t sound like regular C code on a Unix-type system.
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.
Airline reservation and credit card systems still work this way, and
have amazing transaction rates, like 1000 updates/sec to an individual
database record.
In a classic state machine there is a loop around a CASE statement
based on the current state number. The last thing each state case
does is assign the next state number to jump to.
Each current local context could be allocated from a heap.
But be careful what you wish for.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal >decrement/increment stack it used a linked list of heap allocated objects. >Every call did a malloc and every return did a free. All the hardware speed >improvement had been squandered on this completely unnecessary ABI overhead.
According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
This programming style is very old. The SAGE air defense system used
it in the 1950s, and SABRE copied it in the early 1960s. The software
was oranizaed into routines each of which took some input, did
something with it, perhaps scheduled other I/O operations, and then
explicitly saved its state and returned to the dispatcher.
I think that looks more like a multi-tasking system with
run-to-completion, non-preemptive (small) tasks, considering each
"routine" as a task and assuming that each routine started with
recovering its saved context.
Right, that saved context is the continuation. Like I said, the tools
these days take care of a lot of the drudge work, e.g. node.js.
The main difference to present-day multi-tasking systems is then that
the context-saving and -restoring was implemented separately and
specifically in each task, and not in a shared "task switch" function.
Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
The context you save depends on what you're doing.
According to EricP <ThatWouldBeTelling@thevillage.com>:
Each current local context could be allocated from a heap.
But be careful what you wish for.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal >>decrement/increment stack it used a linked list of heap allocated
objects.
Every call did a malloc and every return did a free. All the hardware
speed
improvement had been squandered on this completely unnecessary ABI
overhead.
Maybe it was an IBM thing. In the 1960s the IBM 360 Algol F compiler,
which I always assumed was written because some European institution
had it on a checklist rather than because they expected anyone to use
it, did the same thing. Every time it entered a block it did a GETMAIN
system call, and when it left the block it did a FREEMAIN, with all of
the performance you'd expect.
Someone at Princeton patched it to do more sane allocation where it
did a big GETMAIN and suballocated from that until it filled up and
then did another one. I never wrote enough Algol to tell how well it
worked.
The compiler was a marvel of passive aggressive standards
implementation. The
keywords were all quoted 'BEGIN' 'END' and unless you told it otherwise
it used
the 48 charater set so a semicolon was ., and assignment was .= with one
dot,
not two. A squared was A'POWER'2. I/O was procedures that used dataset numbers, 0 was the input deck, 1 was the printer, and you skipped to the
next output record by padding out the current record to 80 spaces.
'COMMENT'TOO BAD THEY DID NOT USE 'TNEMMOC'.,
Yuck.
Has there ever been a hardware architecture that managed the flow of
control via “continuations”?
That is, you do away with the hardware concept of a stack. Instead, you
have call frames that, while defined to some extent by the architecture,
can be located anywhere in memory (allocation managed by the OS, runtime
etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.
You might have a location defined in a call frame for a “pointer to
parent
call frame” field, in which case “return from subroutine” just loads this
pointer into the current-call-frame register. But you could just as
easily
have pointers to other call frames defining other kinds of
interrelationships between them. And note that transferring to a
different
call frame does not automatically invalidate the previous one. If it
stays
valid, then there is no reason why you couldn’t, at some point, come
back
to it and resume execution from where it left off.
The beauty of continuations is that they are a single generalized
control
construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
the same common abstraction. One thing that is difficult to do with them
is arbitrary gotos. (I consider that a feature, not a bug.)
Very few high-level languages (outside of the Lisp family, anyway) seem
to
have implemented continuations as an explicit language concept. This is
an
integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come
up
with a useful example, like for instance an implementation of
coroutines,
that doesn’t do my head in. ;)
Thomas Koenig <tkoenig@netcologne.de> writes:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.
$ ulimit -s unlimited
Should handle the default stack.
Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.
In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().
Lawrence D'Oliveiro wrote:
On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead.
Was that the RS/6000? That was quite a step forward at the time (and
for some years after), even among all the other new RISC-based
architectures appearing around that time.
What OS/software were you trying above? Because it doesn’t sound like
regular C code on a Unix-type system.
Yes, an RS6000. I don't remember what language, Fortran or C, likely on
AIX.
... In the 1960s the IBM 360 Algol F compiler,
which I always assumed was written because some European institution
had it on a checklist rather than because they expected anyone to use
it, did the same thing. Every time it entered a block it did a GETMAIN
system call, and when it left the block it did a FREEMAIN, with all of
the performance you'd expect.
This could also have been a RT PC as our port project was
running around 1987 and straddled the RS6000 launch in 1990.
(so I might have got my 'R's mixed up.)
And according to Wikipedia RT's were notoriously slow.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.
(Plese note that the sentence above was DeepL-filtered, on
"diplomatic" setting).
IMO, for something to qualify as a continuation-passing call, the caller should select the callee -- the thing to be executed next -- but in SAGE/SABRE it is the scheduler that selects what to execute next.
I think of continuation passing as always ending a subprogram by calling
some other subprogram, passing it one or more "continuation" subprograms
as parameters, and expecting that callee to do something and then to "continue" by further calling one of those "continuations" (with further "continuation" parameters) and NOT to return to the original caller,
ever.
The effect is that code bounces from action routine to action routine carrying its execute context with it, waiting for something external, evaluating the result and deciding what action to do next.
We were playing with an IBM R6000 around 1990 which was supposed to
be this fast risc machine but we found it ran slower than our VAX.
We disassembled the code and found out that instead of a normal
decrement/increment stack it used a linked list of heap allocated
objects. Every call did a malloc and every return did a free. All the
hardware speed improvement had been squandered on this completely
unnecessary ABI overhead. ...
This could also have been a RT PC as our port project was
running around 1987 and straddled the RS6000 launch in 1990.
(so I might have got my 'R's mixed up.)
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.
$ ulimit -s unlimited
Should handle the default stack.
Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).
Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.
In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().
Changing the standard startup sequence of programs (i.e. replacing
_start) - that would be heavy messing, and be no longer compatible
with standard startup sequences.
And for threads, this would get even messier.
On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:
IMO, for something to qualify as a continuation-passing call, the caller
should select the callee -- the thing to be executed next -- but in
SAGE/SABRE it is the scheduler that selects what to execute next.
What you have is that SAGE/SABRE has built a higher-level scheduler >abstraction on top of lower-level coroutines/continuations.
On Mon, 15 Jul 2024 14:12:32 -0400, EricP wrote:
The effect is that code bounces from action routine to action routine
carrying its execute context with it, waiting for something external,
evaluating the result and deciding what action to do next.
The thing that coroutines bring to the table is that you don’t have to break up the code that executes in this common context into separate
pieces that are each invoked via a callback and end in a reschedule
call.
Intead, you just write it as a single block, with explicit “await” calls at the various points along the way where rescheduling happens.
The net result is the logic is much easier to follow, and there is often
less of it.
On Mon, 15 Jul 2024 23:55:21 +0000, Lawrence D'Oliveiro wrote:
The thing that coroutines bring to the table is that you don’t have to
break up the code that executes in this common context into separate
pieces that are each invoked via a callback and end in a reschedule
call. Intead, you just write it as a single block, with explicit
“await” calls at the various points along the way where rescheduling
happens.
Is that still valuable if it cost 50× what a subroutine call costs ???
According to Lawrence D'Oliveiro <ldo@nz.invalid>:
On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:
IMO, for something to qualify as a continuation-passing call, the
caller should select the callee -- the thing to be executed next --
but in SAGE/SABRE it is the scheduler that selects what to execute
next.
What you have is that SAGE/SABRE has built a higher-level scheduler
abstraction on top of lower-level coroutines/continuations.
It was an event loop, not coroutines.
On Mon, 15 Jul 2024 23:56:00 -0000 (UTC), John Levine wrote:
According to Lawrence D'Oliveiro <ldo@nz.invalid>:
On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:
IMO, for something to qualify as a continuation-passing call, the
caller should select the callee -- the thing to be executed next --
but in SAGE/SABRE it is the scheduler that selects what to execute
next.
What you have is that SAGE/SABRE has built a higher-level scheduler
abstraction on top of lower-level coroutines/continuations.
It was an event loop, not coroutines.
The one does not preclude the other. So they lacked the coroutine >abstraction; that could have been added easily enough in the application >implementation language.
Just for amusement, what do you think the application implementation languages for SAGE and SABRE were?
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Anyway, at the time I made a mental note about the relative cost of
heap allocation of stack space in an ABI.
I wonder if I might mention my dislike of arbitrary limits on
stack size that are imposed on modern 64-bit systems.
$ ulimit -s unlimited
Should handle the default stack.
Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).
MacOS has been obsolete for a decade now.
OSX has an initial
size of 8MB and a default configured hard limit of 64GB.
Of course, it still needs to fit into whatever address space
the target processor offers, but even with a 48-bit VA,
there should be enuf VA space for several terabytes of
stack.
In any case, one can mmap an arbitrarily large region of
memory and assign it to the stack pointer in (or even before)
main().
Changing the standard startup sequence of programs (i.e. replacing
_start) - that would be heavy messing, and be no longer compatible
with standard startup sequences.
That depends on the program, I guess.
A little inline assembler
in main should suffice;
An mmap'ed region will automatically
allocate on first touch, so the stack can grow as necessary.
And for threads, this would get even messier.
pthreads already get their own stack, and have standard functions
to specify stack pointer and stack size (e.g. pthread_attr_setstack(2)).
John Levine wrote:
In a classic state machine there is a loop around a CASE statement
based on the current state number. The last thing each state case
does is assign the next state number to jump to.
It only has to be Turing-equivalent (or Lisp-equivalent) to be a general- >purpose machine. Theres nothing in that criterion that requires gotos.
Slightly different topic: Unfortunately, Apple decided in their
wisdom that an mmapped region cannot grow, so you have to make a
guess at the needed size. As to why, I have no idea; they probably
wanted to follow the example of Windows. Learning from Microsoft
is learning to win!
On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:
Just for amusement, what do you think the application implementation
languages for SAGE and SABRE were?
Assembler with macros.
On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
It only has to be Turing-equivalent (or Lisp-equivalent) to be a
general-
purpose machine. Theres nothing in that criterion that requires gotos.
It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.
However, the term "general-purpose" can have other meanings. The sense
in which I was using it was: a machine that is _efficient_ at running programs written in all popular programming languages for the kinds of applications for which computers are normally used. Thus, the IBM
System/360 was general-purpose since it had string manipulation
instructions and decimal arithmetic in addition to floating-point and
binary integer arithmetic.
John Savard
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K or
something like that).
MacOS has been obsolete for a decade now.
Sorry for messing up my terminology.
On Tue, 16 Jul 2024 2:47:05 +0000, Lawrence D'Oliveiro wrote:
On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:
Just for amusement, what do you think the application implementation
languages for SAGE and SABRE were?
Assembler with macros.
It is rare, today, for programmers to be aware of the power of a good
macro assembler, and how close these tend to be to a real low level
language (like K&R C) but without the limitations.
Does a continuation in a block structured language NEED everything on
the block-stack, or is it more like 2-10 variables at rather well
defined locations ?? more like a functor than a continuation.
I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- >culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7-10 faster.
On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7×-10× faster.
Some operating systems put a hard limit on this, for example MacOS,
and some an absurdly low limit on per-thread stack size (512 K
or something like that).
On Tue, 16 Jul 2024 20:51:17 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
I think that history shows that those string facilities were, at best, >>overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
I will agree with that; the 68000 and the x86 don't have those things
either.
Even if my personal inclination has been to include them too.
On the other hand, I am advocating that some Elementary Functions be
raised to first class citizens of ISA so that SIM() is 20 cycles instead
of 150 cycles. Why, because today, we understand enough about the cal- >>culations, polynomials, error sources, to build in HW things that are
just as accurate as SW can build and nearly 7×-10× faster.
I thought this was already the case; the 8087 had transcendental
functions, and so did the 486 DX and from that onwards, and this was
also the case with the 68881 and 68882.
Of course, even if they're citizens of the ISA, they could be
second-class citizens, whatever the definition of that might be.
Here, my personal inclination is to simply include the raw CORDIC
calculation as an instruction, with bounds checking, special cases,
and any fancy stuff in software. On the grounds that some things are
too complicated for a classic old-style ISA.
That, too, is of course a bad decision from a practical perspective.
John Savard
On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:
On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
It only has to be Turing-equivalent (or Lisp-equivalent) to be a general-
purpose machine. Theres nothing in that criterion that requires gotos.
It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.
But we also want that stuff to be fast (under some metric).
However, the term "general-purpose" can have other meanings. The
sense in which I was using it was: a machine that is efficient at
running programs written in all popular programming languages for
the kinds of applications for which computers are normally used.
Thus, the IBM System/360 was general-purpose since it had string manipulation instructions and decimal arithmetic in addition to floating-point and binary integer arithmetic.
I think that history shows that those string facilities were, at best, overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
MitchAlsup1 wrote:
On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:
On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
It only has to be Turing-equivalent (or Lisp-equivalent) to be a
general-
purpose machine. Theres nothing in that criterion that requires
gotos.
It's true that Turing-equivalence is sufficient to allow a computer
with sufficient storage available to do anything.
But we also want that stuff to be fast (under some metric).
However, the term "general-purpose" can have other meanings. The
sense in which I was using it was: a machine that is efficient at
running programs written in all popular programming languages for
the kinds of applications for which computers are normally used.
Thus, the IBM System/360 was general-purpose since it had string
manipulation instructions and decimal arithmetic in addition to
floating-point and binary integer arithmetic.
I think that history shows that those string facilities were, at best,
overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
Regarding EDIT, etc., I think there are three possibilities:
1. They were a bad idea from the start and never should have been put >into S/360.
2. Putting them into S/360 was the right decision at the time, but the >changing technology (i.e. they wouldn't fit into the initial CISC >microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.
3. Putting them into S/360 was the right decision at the time but the >workloads changed. There was less requirement for things like actually >printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where
there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.
I suspect it was mostly number 3, but I think number 2 was a part of it.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took _months_, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took _months_, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
MitchAlsup1 wrote:
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
Regarding EDIT, etc., I think there are three possibilities:
1. They were a bad idea from the start and never should have been put >>into S/360.
2. Putting them into S/360 was the right decision at the time, but the >>changing technology (i.e. they wouldn't fit into the initial CISC >>microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.
3. Putting them into S/360 was the right decision at the time but the >>workloads changed. There was less requirement for things like actually >>printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where >>there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.
I suspect it was mostly number 3, but I think number 2 was a part of it.
I also suspect it was (3). Burroughs medium systems (B3500 etc) used
the edit instruction (EDT) extensively for COBOL.
Regarding EDIT, etc., I think there are three possibilities:
1. They were a bad idea from the start and never should have been put
into S/360.
2. Putting them into S/360 was the right decision at the time, but the changing technology (i.e. they wouldn't fit into the initial CISC microprocessors and RISC showed the functionality could be done other
ways) made putting them into newer designs a bad idea.
3. Putting them into S/360 was the right decision at the time but the workloads changed. There was less requirement for things like actually printing checks and general ledger reports, and programmers moved away
from COBOL, which was where EDIT was a natural fit, to languages where
there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.
I suspect it was mostly number 3, but I think number 2 was a part of it.
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took months, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
applications for which computers are normally used. Thus, the IBM
System/360 was general-purpose since it had string manipulation
instructions and decimal arithmetic in addition to floating-point and
binary integer arithmetic.
I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
and at least as general purpose as 360.
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:
"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
MitchAlsup1 wrote:
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
Regarding EDIT, etc., I think there are three possibilities:
1. They were a bad idea from the start and never should have been put >>>into S/360.
2. Putting them into S/360 was the right decision at the time, but the >>>changing technology (i.e. they wouldn't fit into the initial CISC >>>microprocessors and RISC showed the functionality could be done other >>>ways) made putting them into newer designs a bad idea.
3. Putting them into S/360 was the right decision at the time but the >>>workloads changed. There was less requirement for things like actually >>>printing checks and general ledger reports, and programmers moved away >>>from COBOL, which was where EDIT was a natural fit, to languages where >>>there wasn't as natural fit, so not putting EDIT into newer CPUs was
the right decision.
I suspect it was mostly number 3, but I think number 2 was a part of it.
I also suspect it was (3). Burroughs medium systems (B3500 etc) used
the edit instruction (EDT) extensively for COBOL.
I suspect that as compute power grew, the kinds of things we wanted out
of EDIT changed.
negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY
I suspect that as compute power grew, the kinds of things we wanted out
of EDIT changed.
negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY
EDIT/EDT instructions were designed from the start for
COBOL fixed point arithmetic. No Nan, No Infinities.
COBOL had verbs to assign the correct characters as
the decimal point and thousands separator
The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took months, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
more.What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and diffusion) and quite a few chemical reactions together. So, he evaluated a huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.
A single calculation took months, and he didn't use weak
hardware.
A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the Arrhenius equation into your ISA, I would have done so already
:-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:
"Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
MitchAlsup1 wrote:
Especially the COBOL stuff like EDIT and EDIT-and-MARK.
Regarding EDIT, etc., I think there are three possibilities:
1. They were a bad idea from the start and never should have
been put into S/360.
2. Putting them into S/360 was the right decision at the time,
but the changing technology (i.e. they wouldn't fit into the
initial CISC microprocessors and RISC showed the functionality
could be done other ways) made putting them into newer designs
a bad idea.
3. Putting them into S/360 was the right decision at the time
but the workloads changed. There was less requirement for
things like actually printing checks and general ledger
reports, and programmers moved away from COBOL, which was where
EDIT was a natural fit, to languages where there wasn't as
natural fit, so not putting EDIT into newer CPUs was the right decision.
I suspect it was mostly number 3, but I think number 2 was a
part of it.
used >> the edit instruction (EDT) extensively for COBOL.I also suspect it was (3). Burroughs medium systems (B3500 etc)
I suspect that as compute power grew, the kinds of things we wanted
out of EDIT changed.
negative numbers printed in red
negative numbers surrounded with parens (-123.45)
decimal point and comma LOCAL selection
NaN
INFINITY
EDIT/EDT instructions were designed from the start for
COBOL fixed point arithmetic. No Nan, No Infinities.
COBOL had verbs to assign the correct characters as
the decimal point and thousands separator (on the
Burroughs B3500 those characters were stored in reserved
low memory which was accessed the EDT instruction when
processing the edit rule(s)).
Printing in various colors or print trains were not the
responsibility of the COBOL program, but rather the operator
who mounted the print chain/train and/or ribbon for
the print job.
For certain applications, an edt instruction would still
be useful, but probably wouldn't be significantly more
time efficient than a set of risc-ish instructions (it
would likely be more space efficient).
The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.
Scott Lurndal wrote:
The problem is that the modern languages don't have
a modern concept of a PICture clause. An instruction
to format based on a printf-like format string could
be useful, I suspect, but the die area is probably more
useful as cache.
There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide. The other question is about whether such should be implemented in hardware.
As I have expressed before here, I believe the answer to the first part
is "yes", but not as feature rich as a full picture clause. I tend to
agree with you about the second.
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:more.
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.
A single calculation took months, and he didn't use weak
hardware.
A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are
serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:more.
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.
A single calculation took months, and he didn't use weak
hardware.
A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are
serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
The FDIV and EXP instructions consume all the FMAC cycles, so even if
you completely unrolled the loop, you are not going to get more than
6-cycles less in performing the repeated calculations.
A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.
{Would be an interesting reservation station design, though}
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took months, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took months, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
I can remember a Comp Sci professor waxing lyrical about p-adic number >systems, as an alternative to conventional p-ary ones. The key thing was
that division was no more complex than multiplication.
Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.
Take a look at what Fortran can do, it does quite a few things
(and is correspondingly difficult to implement :-)
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.
{Would be an interesting reservation station design, though}
Michael S <already5chosen@yahoo.com> writes:
On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
All of which have been obsolete for almost two decades.
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
All of which have been obsolete for almost two decades.
According to my understanding, Thomas Koenig was/is around for more
than three decades.
Thomas Koenig wrote:
Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
There are two separate issues here. One is whether modern languages
should have syntax for esily formatting numbers that is "more
expressive" than what things like printf currently provide.
Take a look at what Fortran can do, it does quite a few things
(and is correspondingly difficult to implement :-)
I used to write a fair amount of Fortran, but haven't recently. While
it does have a lot of functionality (most of which I never had occasion
to use). But it at least didn't have support for my pet peeve, the
separator character (whatever one is used) every three digits. This
really improves readability for numbers with lots of digits. Does it
now support this?
On Wed, 17 Jul 2024 20:56:06 -0000 (UTC), "Stephen Fuld"
<SFuld@alumni.cmu.edu.invalid> wrote:
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
in a way, that's the wrong question.
If you need to have exponentiations and divides running in parallel
_on a single CPU_ because the problem isn't parallel _enough_ to
dispense with them being very tightly coupled... then, yes, that's a reasonable question.
But _otherwise_ there is no limit on the number of calculations that
can take place in parallel, just throw more CPUs at the problem.
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.
Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
[Arrhenius]
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
I ran a little Arrhenius benchmark on an i7-11700. Main program was
program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main
and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)
subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius
Timing result (wall-clock time only):
-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154
Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until aMaybe time for a little story.
sin() takes about the same number of cycles of FDIV, not 10× more. >>>>>
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took months, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are
serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.
On Wed, 17 Jul 2024 03:51:50 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
I can remember a Comp Sci professor waxing lyrical about p-adic number >>systems, as an alternative to conventional p-ary ones. The key thing was >>that division was no more complex than multiplication.
And addition, I suppose, is the same complexity too. But comparison
kills you; you need to apply the Chinese Remainder Theorem every time
you want to print out a number.
John Savard
MitchAlsup1 wrote:
{Would be an interesting reservation station design, though}
In what way would the RS be interesting or different?
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
All of which have been obsolete for almost two decades.
According to my understanding, Thomas Koenig was/is around for more
than three decades.
I have been (starting doing things on mainframes in the late 1980s),
but I was specifically asking about what Mitch had in mind for
My 66000.
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >> so each can only use one of the 4 pipeline slots.
Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to keep the FMAC fully busy (my naive understanding is that there isn't)?
If not, do current CPU make the FMAC available for other operations while
an FDIV/SQRT is in progress? If not, how hard would it be?
Stefan
On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
[Arrhenius]
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
I ran a little Arrhenius benchmark on an i7-11700. Main program was
program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main
and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)
subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius
Timing result (wall-clock time only):
-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154
Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .
Does the benchmark represent a real-world use?
In particular,
1. Is there really a large number of different EA values or only around
dozen or hundred?
2. Does temperature vary all the time or there are relatively big
groups of points calculated at the same temperature?
A similar question can be asked about A, but it is of little practical importance.
3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single
precision is sufficient with digits to spare.
4. Dies the equation work at all when the temperature is not close to
the point of equilibrium ? If not, what is a sane range for ea/(r*t) ?
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
[Arrhenius]
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
I ran a little Arrhenius benchmark on an i7-11700. Main program
was
program main
implicit none
integer, parameter :: n = 1024
double precision, dimension(n) :: k, a, ea, t
integer :: i
call random_number (a)
call random_number(ea)
ea = 10000+ea*30000
call random_number(t)
t = 400 + 200*t
do i=1,1024*1024
call arrhenius(k,a,ea,t,n)
end do
end program main
and the called routine was (in a separate file, so the compiler
could not notice that the results were actually never used)
subroutine arrhenius(k, a, ea, t, n)
implicit none
integer, intent(in) :: n
double precision, dimension(n), intent(out) :: k
double precision, dimension(n), intent(in) :: a, ea, t
double precision, parameter :: r = 8.314
k = a * exp(-ea/(r*t))
end subroutine arrhenius
Timing result (wall-clock time only):
-O0: 5.343s
-O2: 4.560s
-Ofast: 2.237s
-Ofast -march=native -mtune=native: 2.154
Of course, you kever know what speed your CPU is actually running
at these days, but if I assume 5GHz, that would give around 10
cycles per Arrhenius evaluation, which is quite fast (IMHO).
It uses an AVX2 version of exp, or so I gather from the function
name, _ZGVdN4v_exp_avx2 .
Does the benchmark represent a real-world use?
In particular,
1. Is there really a large number of different EA values or only
around dozen or hundred?
Usually, one for each chemical reaction. If you have a complex
reaction network, it can be a few hundred.
It is possible to pre-calculate Ea/R, but the division by T still
needs to be done.
2. Does temperature vary all the time or there are relatively big
groups of points calculated at the same temperature?
It varies all the time because of the exothermic/endothermic
character of the reactions. This is not what was calculated, but you
can think of Methane combusion with oxygen. There are numerous
radical, high-energy species appearing and disappearing all the time.
And if you're unlucky, you will also have to calculate radiation :-(
But for each fluid element, the energy equation is solved again
each time step.
And if you try to average the temperatueres over groups of cells...
don't. You average enough already by selecting the grid, and also by applying multigrid solvers.
A similar question can be asked about A, but it is of little
practical importance.
3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single precision is sufficient with digits to spare.
To calculate values, yes, but if you try to differentiate things,
things can go wrong pretty fast. I'm actually not sure what he used
in this particular case.
There is, however, a tendency with engineers to use double precision
because quite often, you do not go wrong with that, and you can
go wrong with single precision.
A lot of numerical work in general, and CFD work in particular,
consists of fighting for convergence. You usually don't want to
change anything that would endanger that.
4. Dies the equation work at all when the temperature is not close
to the point of equilibrium ? If not, what is a sane range for
ea/(r*t) ?
It does work far away from the equilibrium (which is the point).
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction).
If you see that in data, that
means that the Arrhenius model is not applicable because there is
in fact no activation energy to overcome, that something strange
happens in thermodynamics (say, you get solid precipitation of
one component at a certain temperature) or that some there are
some intermediate steps which are not described in the model.
And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .
Stefan Monnier wrote:
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >>> so each can only use one of the 4 pipeline slots.
Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to
keep the FMAC fully busy (my naive understanding is that there isn't)?
If not, do current CPU make the FMAC available for other operations
while
an FDIV/SQRT is in progress? If not, how hard would it be?
Stefan
And if they can't mix then to what extent can the end of one op,
as it drains from the pipeline, overlap with the start of the next?
Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
with the end of a prior FDIV? An EXP?
I was thinking about reservation station schedulers and wondering
what they might have to optimize.
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction). If you see that in data, that
means that the Arrhenius model is not applicable because there is
in fact no activation energy to overcome, that something strange
happens in thermodynamics (say, you get solid precipitation of
one component at a certain temperature) or that some there are
some intermediate steps which are not described in the model.
And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
The idea I was thinking about was to pre-calculate R/Ea rather than
Ea/R. And then to find fast algorithm for approximation of exp(-1/x) on
the range of interest.
The problem is that the modern languages don't have a modern concept of
a PICture clause.
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the
point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Ea is almost always positive; if if was negative, it would mean
that a reaction is is slowed down by increasing temperature (and
by this don't mean the reverse reaction).
In the branch of physical chemistry that I was taught before
switching to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.
Do you happen to recall if that process was limited by thermodynamic equilibrium, by mass transfer or indeed by a chemical reaction?
To me, that would sound like maybe one of the first two, but
knowing nothing about this, I am now firmly speculating :-)
And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .
So, something like [-10:0] ?
Activation energies can also be 180 kJ/mol, but I don't really
know the range that people might want to calculate these values.
On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Ea is almost always positive; if if was negative, it would mean that
a reaction is is slowed down by increasing temperature (and by this
don't mean the reverse reaction).
In the branch of physical chemistry that I was taught before switching
to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.
And to make sense, -Ea/(RT) is always less then zero. Reasonable
example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
argument of the exponent, so the term is then ~ 0.002444 .
So, something like [-10:0] ?
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took _months_, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took _months_, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a) term?
On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:
MitchAlsup1 wrote:
{Would be an interesting reservation station design, though}
In what way would the RS be interesting or different?
The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.
You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.
So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
A single calculation took _months_, and he didn't use weak hardware.
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?
Terje
On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:
The problem is that the modern languages don't have a modern concept of
a PICture clause.
Perhaps one thing that killed them was they are not localizable?
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:
The problem is that the modern languages don't have a modern concept of
a PICture clause.
Perhaps one thing that killed them was they are not localizable?
They certainly were localizable, at least with respect to
punctuation and currency symbol. And COBOL is far from 'killed'.
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an Computation
Fluid Chemistry researcher from Sweden asked for help speeding up his
reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
combining the 1/x and the sqrt and doing three of them pipelind together
(all the water molecules having three atoms), his weeklong simulation
runs ran in half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?
I don't think that formula is correct.
On Thu, 18 Jul 2024 15:35:50 +0000, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 13:58:37 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).
All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
Pentium-M and AMD K8, do not perform double precision floating point
All of which have been obsolete for almost two decades.
According to my understanding, Thomas Koenig was/is around for more
than three decades.
I have been (starting doing things on mainframes in the late 1980s),
but I was specifically asking about what Mitch had in mind for
My 66000.
I have a spectrum in mind; everything from an in order 1-wide to (as of
now) 6-wide GBOoO design with 4 FMAC units.
On Fri, 19 Jul 2024 11:28:17 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
It does work far away from the equilibrium (which is the
point).
But how far?
Does it work for EA/RT outside of, say [-10:+12] ?
Ea is almost always positive; if if was negative, it would mean
that a reaction is is slowed down by increasing temperature (and
by this don't mean the reverse reaction).
In the branch of physical chemistry that I was taught before
switching to completely different profession, that's not that rare.
For example, when temperature grows from 1500 degrees toward 1550
the transfer of phosphorus from molten steel to slag decelerates
greatly.
Do you happen to recall if that process was limited by thermodynamic
equilibrium, by mass transfer or indeed by a chemical reaction?
I don't know, but would think that what causes it is an approuch to equilibrium. But at 1550 degrees equilibrium is still pretty far away
(50-70 degrees away? I don't remember) and despite that a slowdown is
already very significant.
On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?
I don't think that formula is correct.
It is not::
exp(x+y) = exp(x)×exp(y)
so add/subtract can be simplified, but multiply and divide cannot
(easily).
Now, if we were dealing with Ln() those multiplies and divides can be simplified.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
I've taken both a second and third look at InvSqrt() over the last few
years, it turns out that the Quake version is far from optimal: With the
exact same instructions, just a different set of constants, you can get
about 1.5 bits more than quake does, i.e. about 10 bits after that
single NR step (which isn't really NR since it modifies both the 1.5 and
the 0.5 factors).
Wikipedia has something on that, also literature; Moroz et al. https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
method to do this, with one or two steps.
I've also looked a little bit at simplifying exp(-1/x) directly, and
it seems an unpleasent function to implement; at least there is no
argument reduction which comes to mind. With exp2(x), you can split
off the integer part and do an appoximation on the rest over a finite interval, but if there is a fast approximation of the integer part
of 1/x without dividing, I am not aware of one :-)
On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I'm guessing that you would do the exp(-E_a/(R*T)) as
exp(-E-a)-exp(R*T), since that should give the same result and now you
could interleave the two exp) calculations, and/or hoist the exp(-E_a)
term?
I don't think that formula is correct.
It is not::
exp(x+y) = exp(x)×exp(y)
so add/subtract can be simplified, but multiply and divide cannot
(easily).
Now, if we were dealing with Ln() those multiplies and divides can be simplified.
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that converge
faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
I've taken both a second and third look at InvSqrt() over the last few
years, it turns out that the Quake version is far from optimal: With the exact same instructions, just a different set of constants, you can get
about 1.5 bits more than quake does, i.e. about 10 bits after that
single NR step (which isn't really NR since it modifies both the 1.5 and
the 0.5 factors).
Wikipedia has something on that, also literature; Moroz et al.
https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
method to do this, with one or two steps.
Impressive amounts of math here, unfortunately they completely miss the point!
What they derive is the exact optimal value for the magic constant,
using zero, one or two text-book NR iterations.
However, if you are willing to take that first NR iteration
halfnumber = 0.5f*x
...
i = R-(i>>1);
...
x = x*(1.5f-halfnumber*x*x);
and then make both the 0.5f and 1.5f constants free variables, you can
in fact get 1.5 more bits than what they show in this paper.
In fact, using slightly different values for R (the magic constant)
results in very marginal changes in max rel error, while optimizing all
three constants improves that max error from the stated 1.75e-3 to about 0.6e-3!
I've also looked a little bit at simplifying exp(-1/x) directly, and
it seems an unpleasent function to implement; at least there is no
argument reduction which comes to mind. With exp2(x), you can split
off the integer part and do an appoximation on the rest over a finite
interval, but if there is a fast approximation of the integer part
of 1/x without dividing, I am not aware of one :-)
:-(
Fast is probably what you get from a pure lookup table, but with no
obvious NR iteration to improve it.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.
So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)
Other architectures I have tried don't seem to have it.
Does it make sense? Well, if you want to calculate lots of Arrhenius equations, you don't need full accuracy and (like in Mitch's case)
exp has become as fast as division, then it could actually make a
lot of sense. It is still possible to add Newton steps afterwards,
which is what gcc does if you add -mrecip -ffast-math.
On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.
It seems, you looked at the wrong instruction table.
Here are not the very modern x86-64 cores:
Arch Latency Throughput (scalar/128b/256b)
Zen3 3 2/2/1
Skylake 4 1/1/1
Ice Lake 4 1/1/1
Power9 5-7 4/2/N/A
So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)
If you are have enough of independent rsqrt to do, all four processors
have the same theoretical peak throughput, but x86 tend to have more
cores and to run at faster clock. And lower latency makes achieving
peak throughput easier. Also, depending on target precision, higher
initial precision of x86 estimate means that sometimes you can get away
with 1 less NR iteration.
Also, if what you really need is sqrt rather than rsqrt, then depending
on how much inaccuracy you can accept, sometimes on modern x86 the calculating accurate sqrt can be better solution than calculating approximation. It is less likely to be the case on POWER9 Accurate sqrt
(single precision)
Zen3 14 0.20/0.200/0.200
SkyLake 12 0.33/0.333/0.167
Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A
Accurate sqrt (double precision)
Zen3 20 0.111/0.111/0.111
Skylake 12 0.167/0.167/0.083
Ice Lake 12 0.167/0.167/0.083
Power9 36 0.111/0.067/N/A
Other architectures I have tried don't seem to have it.
Arm64 has it. It is called FRSQRTE.
Does it make sense? Well, if you want to calculate lots of Arrhenius
equations, you don't need full accuracy and (like in Mitch's case)
exp has become as fast as division, then it could actually make a
lot of sense. It is still possible to add Newton steps afterwards,
which is what gcc does if you add -mrecip -ffast-math.
I don't know about POWER, but on x86 I wouldn't do it.
I'd either use plain division that on modern cores is quite fast
or will use NR to calculate normal reciprocal. x86 provides initial
estimate for that too (RCPSS).
Michael S <already5chosen@yahoo.com> schrieb:
On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.
I can well believe that.
It is interesting to see what different architectures offer for
faster reciprocals.
POWER has fre and fres (double and single version) for approximate
divisin, which are accurate to 1/256. These operations are quite
fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
so obviously fully pipelined. With 1/256 accuracy, this could
actually be the original Quake algorithm (or its modification)
with a single Newton step, but this is of course much better in
hardware where exponent handling can be much simplified (and
done only once).
x86_64 has rcpss, accurate to 1/6144, with (looking at the
instruction tables) 6 for newer architectures, with a throuhtput
of 1/4.
It seems, you looked at the wrong instruction table.
[Note I was not writing about inverse squre root, I was writing
about inverse].
I have to admit to being almost terminally confused by Intel
generation names, so I am likely to mix up what is old and what
is new.
Here are not the very modern x86-64 cores:
Arch Latency Throughput (scalar/128b/256b)
Zen3 3 2/2/1
Skylake 4 1/1/1
Ice Lake 4 1/1/1
Power9 5-7 4/2/N/A
Power9 has it for 128-bit, but not for 256 bits (it doesn't have
those registers), and if I read the handbook correctly, that
would also be 4 operations in parallel.
So, if your business depends on calculating many inaccurate
square roots, fast, buy a POWER :-)
If you are have enough of independent rsqrt to do, all four
processors have the same theoretical peak throughput, but x86 tend
to have more cores and to run at faster clock. And lower latency
makes achieving peak throughput easier. Also, depending on target precision, higher initial precision of x86 estimate means that
sometimes you can get away with 1 less NR iteration.
Also, if what you really need is sqrt rather than rsqrt, then
depending on how much inaccuracy you can accept, sometimes on
modern x86 the calculating accurate sqrt can be better solution
than calculating approximation. It is less likely to be the case on
POWER9 Accurate sqrt
[table reformatted, hope I got this right]
(single precision)
Zen3 14 0.20/0.200/0.200
SkyLake 12 0.33/0.333/0.167
Ice Lake 12 0.33/0.333/0.167
Power9 26 0.20/0.095/N/A
Accurate sqrt (double precision)
Zen3 20 0.111/0.111/0.111
Skylake 12 0.167/0.167/0.083
Ice Lake 12 0.167/0.167/0.083
Power9 36 0.111/0.067/N/A
Other architectures I have tried don't seem to have it.
Arm64 has it. It is called FRSQRTE.
Interesting that "gcc -O3 -ffast-meth -mrecip" does not
appear to use it.
Does it make sense? Well, if you want to calculate lots of
Arrhenius equations, you don't need full accuracy and (like in
Mitch's case) exp has become as fast as division, then it could
actually make a lot of sense.
It is still possible to add Newton
steps afterwards, which is what gcc does if you add -mrecip
-ffast-math.
I don't know about POWER, but on x86 I wouldn't do it.
I'd either use plain division that on modern cores is quite fast
or will use NR to calculate normal reciprocal. x86 provides initial estimate for that too (RCPSS).
Note that I was talking about the inverse in the first place.
COBOL now has CURRENCY SIGN which lets you set the currency symbol to
edit into a picture, and DECIMAL POINT IS COMMA, to switch the two.
You mean, on other architectures gcc does emit approximate rsqrt from
plain C or plain Fortran? In which situations?
Gen 2 is Sandy Bridg
Gen 3 is Ivy Bridge which is very similar core microarchitecture
Gen 4 is Haswell
Gen 5 is Broadwell which is similar, but has few changes
Gen 6,7,8,9 and majority of 10 is all the same microarchitecture -
Skylake.
11 is mostly Tiger lake and partially Ice Lake and Rocket Lake. All
three are different from each other on silicon proces side, but ver
similar on core microarcheticture.
12 is Alder Lake that has P cores and E cores. Microarchitecture of P
core is called Golden Cove.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:...
However, if you are willing to take that first NR iteration
halfnumber = 0.5f*x
...
i = R-(i>>1);
...
x = x*(1.5f-halfnumber*x*x);
and then make both the 0.5f and 1.5f constants free variables, you can
in fact get 1.5 more bits than what they show in this paper.
Looks like https://web.archive.org/web/20180709021629/http://rrrola.wz.cz/inv_sqrt.html
who reports 6.50196699E−4 as the maximum error (also from the
Wikipedia article).
That's 10.5 bits of accuracy, not bad at all.
However... assume you want to do another NR step. In that case,
you might be better off not loading different constants from memory,
so having the same constants might actually be an advantage
(whch does not mean that they have to be the original Newton steps).
On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:
MitchAlsup1 wrote:
{Would be an interesting reservation station design, though}
In what way would the RS be interesting or different?
The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.
You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.
So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.
This is where I saw a benefits to using valued reservation stations vs >valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.
Example of horizontal scaling of similar FU each with its own RS bank. >https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).
This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)
Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.
MitchAlsup1 wrote:
On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:
MitchAlsup1 wrote:
{Would be an interesting reservation station design, though}
In what way would the RS be interesting or different?
The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.
You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.
So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.
This is where I saw a benefits to using valued reservation stations vs valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.
Example of horizontal scaling of similar FU each with its own RS bank. https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.
The number of PRF read ports scales with the number of dispatched uOp operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.
Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank
for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.
EricP <ThatWouldBeTelling@thevillage.com> writes:
This is where I saw a benefits to using valued reservation stations vs >>valueless ones - when a uArch has multiple similar FU each with its own >>bank of RS that is scheduled for that FU.
Example of horizontal scaling of similar FU each with its own RS bank. >>https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).
If the operation was waiting in the RS, at least one operand comes in
through a forwarding path (at least if the uarch has those).
This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)
Yes, but ARM A64 seems to have been designed with that in mind.
Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
ARM A64's stp instruction needs up to 4 register reads.
But I think
Oryon only supports two stores per cycle, so that raises the number to
44. It also has a split register file: So for the integer register
file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller,
and various techniques are used to
make do with far fewer register ports.
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.
Note that Oryon has 8-way rename/dispatch (I wonder how that works
with, e.g., 8 ldp instructions that each can write three registers).
- anton
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
EricP <ThatWouldBeTelling@thevillage.com> writes:
This is where I saw a benefits to using valued reservation stations vs
valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.
Example of horizontal scaling of similar FU each with its own RS bank.
https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution).
If the operation was waiting in the RS, at least one operand comes in
through a forwarding path (at least if the uarch has those).
This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.)
Yes, but ARM A64 seems to have been designed with that in mind.
Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
ARM A64's stp instruction needs up to 4 register reads. But I think
Oryon only supports two stores per cycle, so that raises the number to
44. It also has a split register file: So for the integer register
file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller, and various techniques are used to
make do with far fewer register ports.
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
and the read ports are potentially usable for all dispatches.
Note that Oryon has 8-way rename/dispatch (I wonder how that works
with, e.g., 8 ldp instructions that each can write three registers).
- anton
On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:
MitchAlsup1 wrote:
On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:
MitchAlsup1 wrote:
{Would be an interesting reservation station design, though}
In what way would the RS be interesting or different?
The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.
You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.
So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.
This is where I saw a benefits to using valued reservation stations vs
valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.
Example of horizontal scaling of similar FU each with its own RS bank.
https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
I always had RSs keep tack of which FU was delivering the final
operand, so that these could be picked up by the forwarding logic
and not need a RF port. This gets rid of 50%-75% of the RF port
needs.
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.
Delivering result at dispatch time.
The number of PRF read ports scales with the number of dispatched uOp
operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.
The width of the decoder is narrower than the width of the data path.
We used to call this "catch up bandwidth".
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
First, a 6-wide machine is not doing 6 3-operand instructions,
it is more like 3-memory ops (2-reg+displacement), one 3-op,
one general 2-op, and one 1-op (branch) so, you only need 12-ports
instead of 18 Most of the time.
The penalty is that each RS entry is 5× the size of the value-free
RS designs. These work just fine when the execution window is
reasonable (say 96 instructions) but fails when the window is
larger than 150-ish.
and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.
One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.
Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank
for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.
On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.
On Mon, 22 Jul 2024 11:01:15 +0000, Michael S wrote:
On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.
There are certain N-R iterations that can be expressed with both::
NR+1 = F( NR, SQRT() )
and
NR+1 = F'(NR, RSQRT() )
Typically the one with RSQRT() converges slightly faster than the
one using SQRT(). How much is slightly::maybe ½-1 more bit per
iteration.
MitchAlsup1 wrote:[...]
(Nobody bats an eye at using kilobytes for the branch predictor.)
One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.
Yes, its mostly all for naught.
On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran in
half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.
Michael S wrote:
On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.
sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)
I.e. calculate rsqrt(x) to the precision you need and then do a
single fmul?
Terje
On Mon, 22 Jul 2024 15:01:10 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
There are certain N-R iterations that can be expressed with both::
NR+1 = F( NR, SQRT() )
and
NR+1 = F'(NR, RSQRT() )
Typically the one with RSQRT() converges slightly faster than the
one using SQRT(). How much is slightly::maybe ½-1 more bit per
iteration.
As I said above, I am not aware of the NR iterations for SQRT() that
consist of fmul, fadd and nothing else.
If you know one, please tell me.
On Tue, 23 Jul 2024 14:35:20 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Fri, 19 Jul 2024 20:55:47 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup1 wrote:
On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
Back when I first looked at Invsqrt(), I did so because an
Computation Fluid Chemistry researcher from Sweden asked for help
speeding up his reciprocal calculations
(sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
the sqrt and doing three of them pipelind together (all the water
molecules having three atoms), his weeklong simulation runs ran
in half the time, on both PentiumPro and Alpha hardware.
I, personally, have found many Newton-Raphson iterators that
converge faster using 1/SQRT(x) than using the SQRT(x)
equivalent.
Yeah, that was eye-opening to me as well, to the level where I
consider the invsqrt() NR iteration as a mainstay, it can be useful
for both sqrt and 1/x as well. :-)
Terje
What is this "SQRT(x) equivalent" all of you are talking about?
I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
for SQRT that consists only of multiplicationa and additions.
If it exists, I will be very interested to know.
sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)
I.e. calculate rsqrt(x) to the precision you need and then do a
single fmul?
Terje
That much I know.
When precise rounding is not required, I even know slightly better
"combined" method:
1. Do N-1 iterations for RSQRT delivering r0 with 2 or 3 more
significant bits than n/2
2. Calculate SQRT estimate as y0 = x*r0
3. Do last iteration using both y0 and r0 as y = y0 + (x-y0*y0)*0.5*r0.
That would give max. error of something like 0.51 ULP.
A similar combined method could be useful for sw calculation of
correctly rounded quad-precision sqrt as well. In this case it serves
as 'conditionally last step' rather than 'absolutely last step'.
I was hoping that you'll tell me about NR formula for SQRT itself that
can be applied recursively.
The only formula that I know is y = (y*y + x)/(y*2). Obviously, when
speed matters this formula is not useful.
On 7/18/24 12:40 PM, MitchAlsup1 wrote:
[snip]
Over the 20 cycles the multiplier is doing Goldschmidt iterations,
there
are only 3 slots where a different instruction could sneak through.
Since initial iterations could (I think) use reduced precision, it
might also be possible to use part of the multiplier for a
different lower precision multiplication.
(ISTR, an AMD implementation used different precision for
different steps.)
Stefan Monnier wrote:
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating
using the multiplier, can the pipeline get a mix of multiple
operations going at once? FDIV for both Newton–Raphson and
Goldschmidt iterates serially so each can only use one of the 4
pipeline slots.
Something I've been wondering for a while, indeed.
IOW, is there enough parallelism inside the FDIV/SQRT "microcode"
to keep the FMAC fully busy (my naive understanding is that there
isn't)? If not, do current CPU make the FMAC available for other
operations while an FDIV/SQRT is in progress? If not, how hard
would it be?
Stefan
And if they can't mix then to what extent can the end of one op,
as it drains from the pipeline, overlap with the start of the next?
Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
with the end of a prior FDIV? An EXP?
I was thinking about reservation station schedulers and wondering
what they might have to optimize.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:22:17 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |