• Continuations

    From Lawrence D'Oliveiro@21:1/5 to All on Sat Jul 13 07:50:42 2024
    Has there ever been a hardware architecture that managed the flow of
    control via “continuations”?

    That is, you do away with the hardware concept of a stack. Instead, you
    have call frames that, while defined to some extent by the architecture,
    can be located anywhere in memory (allocation managed by the OS, runtime
    etc as part of the ABI). A call frame has a current program counter, and
    maybe some other context like local variables and a static link for
    lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

    You might have a location defined in a call frame for a “pointer to parent call frame” field, in which case “return from subroutine” just loads this pointer into the current-call-frame register. But you could just as easily
    have pointers to other call frames defining other kinds of
    interrelationships between them. And note that transferring to a different
    call frame does not automatically invalidate the previous one. If it stays valid, then there is no reason why you couldn’t, at some point, come back
    to it and resume execution from where it left off.

    The beauty of continuations is that they are a single generalized control construct that can be used to implement specific language features like
    regular routine calls, loops, exceptions and coroutines, all built from
    the same common abstraction. One thing that is difficult to do with them
    is arbitrary gotos. (I consider that a feature, not a bug.)

    Very few high-level languages (outside of the Lisp family, anyway) seem to
    have implemented continuations as an explicit language concept. This is an integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up
    with a useful example, like for instance an implementation of coroutines,
    that doesn’t do my head in. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Jul 13 11:16:45 2024
    Lawrence D'Oliveiro wrote:
    Has there ever been a hardware architecture that managed the flow of
    control via “continuations”?

    That is, you do away with the hardware concept of a stack. Instead, you
    have call frames that, while defined to some extent by the architecture,
    can be located anywhere in memory (allocation managed by the OS, runtime
    etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
    lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

    This is not really a hardware ISA issue and is mostly what you do with it.

    If an ISA uses a Branch-and-link style subroutine call then it does not
    have the concept of a push/pop stack embedded in it.

    On x86 which does have a defined stack pointer register,
    that user mode SP can point wherever you want the caller PC saved.
    There is also the PUSH and POP instructions but just don't use them.
    The only other hardware embedded stack concept would be the frame pointer.
    But FP is only manipulated by the ENTER and LEAVE instructions and pretty
    much no one uses them.

    Everything else is about how memory is allocated and at what cost.

    Each current local context could be allocated from a heap.
    But be careful what you wish for.
    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal decrement/increment stack it used a linked list of heap allocated objects. Every call did a malloc and every return did a free. All the hardware speed improvement had been squandered on this completely unnecessary ABI overhead.

    You might have a location defined in a call frame for a “pointer to parent call frame” field, in which case “return from subroutine” just loads this
    pointer into the current-call-frame register. But you could just as easily have pointers to other call frames defining other kinds of
    interrelationships between them. And note that transferring to a different call frame does not automatically invalidate the previous one. If it stays valid, then there is no reason why you couldn’t, at some point, come back to it and resume execution from where it left off.

    The beauty of continuations is that they are a single generalized control construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
    the same common abstraction. One thing that is difficult to do with them
    is arbitrary gotos. (I consider that a feature, not a bug.)

    Very few high-level languages (outside of the Lisp family, anyway) seem to have implemented continuations as an explicit language concept. This is an integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)

    I had not heard the term “continuation” until John Levine used it
    in a recent post. I looked it up on Wikipedia and found it is
    what I call a "callback state machine".

    https://en.wikipedia.org/wiki/Continuation

    This is the basic mechanism used by IO device drivers in RSX, VMS and WinNT.
    I have used it for async I/O and transaction processing.

    Those IO subsystems still have regular push/pop call stacks,
    as well as each IO context, called an IRP or IO request Packet,
    also has a deferred callback stack.

    Used in device drivers the mechanism is very efficient but also somewhat difficult to program and each driver function must be broken up into a
    series of subroutine calls separated by wait states, leaving a trail of callback bread crumbs behind as it travels so it can find its way home.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Sat Jul 13 17:42:00 2024
    In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    The beauty of continuations is that they are a single generalized
    control construct that can be used to implement specific language
    features like regular routine calls, loops, exceptions and
    coroutines, all built from the same common abstraction. One thing
    that is difficult to do with them is arbitrary gotos. (I consider
    that a feature, not a bug.)

    Just occasionally, you need a goto to get out of a complicated error
    situation. Not often, but sometimes.

    I'm also of the opinion that unless memory gets a lot faster, requiring hardware to maintain and navigate data structures in memory is a bad idea. They'll get used because they're there, but they'll be slow. One can't
    avoid page tables, but they should be simple.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Sat Jul 13 23:43:52 2024
    On Sat, 13 Jul 2024 17:42 +0100 (BST), John Dallman wrote:

    In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

    The beauty of continuations is that they are a single generalized
    control construct that can be used to implement specific language
    features like regular routine calls, loops, exceptions and coroutines,
    all built from the same common abstraction. One thing that is difficult
    to do with them is arbitrary gotos. (I consider that a feature, not a
    bug.)

    Just occasionally, you need a goto to get out of a complicated error situation. Not often, but sometimes.

    That’s what exceptions are for. And they are handled by continuations as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Jul 13 23:43:18 2024
    On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal decrement/increment stack it used a linked list of heap allocated
    objects. Every call did a malloc and every return did a free. All the hardware speed improvement had been squandered on this completely
    unnecessary ABI overhead.

    Was that the RS/6000? That was quite a step forward at the time (and for
    some years after), even among all the other new RISC-based architectures appearing around that time.

    What OS/software were you trying above? Because it doesn’t sound like
    regular C code on a Unix-type system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to BGB on Sun Jul 14 09:38:31 2024
    BGB <cr88192@gmail.com> wrote:

    There is basically no real way to get dynamically allocated
    stack-frames and argument copying to be performance competitive with adjusting a stack pointer and passing arguments in registers.
    Function calls are also, not exactly rare.

    There is a way to combine a fast implementation and allow
    continuations, and I think it's been done several times.

    Proceed as usual, using a stack-based ABI, until a continuation is
    captured. At that point, walk the stack and create the continuation by
    copying stack frame contents into dynamically-allocated continuation
    frames. Pass the continuation to whoever wants it. They may evaluate
    it, inspect it, or whatever.

    This does require precise information about the current stack frame to
    be kept at all call sites, but that doesn't have to be much more than
    the usual debuginfo.

    I believe this technique was invented by Peter Deutsch for Smalltalk,
    and it's used today for Kotlin.

    Worse still, one is very likely going to need a garbage collector
    (in the off chance anyone actually uses continuations, if the
    continuation is captured via call/cc or similar, can no longer
    automatically free it).

    It doesn't really need GC, as long as it's the recipient's job to free
    the continution. But I can't immediately think of any language with continuations and without GC. They kinda go together.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sun Jul 14 14:22:23 2024
    BGB <cr88192@gmail.com> writes:
    On 7/13/2024 2:50 AM, Lawrence D'Oliveiro wrote:
    Has there ever been a hardware architecture that managed the flow of
    control via "continuations"?

    RISCs usually have no architectural support for stack-based calling conventions, so they work just as well for managing the flow of
    control via continuations.

    Doing an ABI based around continuations would basically wreck performance; >Conventional thread-based multitasking is cheaper to implement and gives >better performance.

    There is basically no real way to get dynamically allocated stack-frames
    and argument copying to be performance competitive with adjusting a
    stack pointer and passing arguments in registers. Function calls are
    also, not exactly rare.

    Quite a while ago "continuation-passing style" was a hot research
    topic (see some references below; I found [appel&jim89] more inspiring
    than [appel92]), and AFAIK a number of compilers used this concept
    (and some probably still use it, especially Scheme compilers, where continuations are a language feature).

    Around that time there was the argument (maybe also by Appel, but I
    don't remember) that, in the limit, garbage collection is cheaper than stack-based allocation (never mind malloc/free): If you delay garbage collection long enough, the cost of a copying garbage collector will
    be smaller than the cost of stack-deallocating the garbage. That's a
    copying garbage collector only accesses and copies the non-garbage.

    In practice, you don't want to provide so much memory that this point
    will be reached, plus stack-based allocation has better cache
    characteristicts.

    @InProceedings{appel&jim89,
    author = "Andrew W. Appel and Trevor Jim",
    title = "Continuation-Passing, Closure-Passing style",
    booktitle = "Sixteenth Annual {ACM} Symposium on Principles of
    Programming Languages",
    year = "1989",
    pages = "293--302",
    address = "Austin, Texas",
    annote = "ML compiler based on continuation passing style.
    After the translation into CPS and optimization
    closures are made explicit and registers are
    allocated. No stack is used for the closures. Instead
    the compiler relies on garbage collection."
    }

    @Book{appel92,
    author = "Andrew W. Appel",
    title = "Compiling with Continuations",
    publisher = "Cambridge University Press",
    year = "1992"
    }

    @InProceedings{flanagan+93,
    author = "Cormac Flanagan and Amr Sabry and Bruce F. Duba and
    Matthias Felleisen",
    title = "The Essence of Compiling with Continuations",
    crossref = "sigplan93",
    pages = "237--247",
    annote = "CPS compilers convert into CPS form, optimize the
    program and convert back. Isomorphic optimizations can
    be performed on the original program, saving the
    transformation to and from CPS form."
    }

    @Proceedings{sigplan93,
    booktitle = "SIGPLAN '93 Conference on Programming Language
    Design and Implementation",
    title = "SIGPLAN '93 Conference on Programming Language
    Design and Implementation",
    year = "1993",
    key = "SIGPLAN '93",
    note = "SIGPLAN Notices 28(6)"
    }

    @InProceedings{bruggeman+96,
    author = {Carl Bruggeman and Oscar Waddel and R. Kent Dybvig},
    title = {Representing Control in the Presence of One-Shot
    Continuations},
    crossref = {sigplan96},
    pages = {99--107},
    annote = {Discuss how to represent multi-shot continuations
    and one-shot continuations in a stack-based Scheme
    implementation. One-shot continuations are
    programmer-specified (with the \emph{call/1cc}
    call). They offer a small (about 13\% in the threads
    benchmarks) performance benefit over multi-shot
    continuations, if they are applicable.}
    }

    @Proceedings{sigplan96,
    booktitle = "SIGPLAN '96 Conference on Programming Language
    Design and Implementation",
    title = "SIGPLAN '96 Conference on Programming Language
    Design and Implementation",
    year = "1996",
    key = "PLDI '96"
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ldo@nz.invalid on Sun Jul 14 13:51:14 2024
    On Sat, 13 Jul 2024 23:43:52 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Sat, 13 Jul 2024 17:42 +0100 (BST), John Dallman wrote:

    In article <v6tbki$3g9rg$1@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    The beauty of continuations is that they are a single generalized
    control construct that can be used to implement specific language
    features like regular routine calls, loops, exceptions and coroutines,
    all built from the same common abstraction. One thing that is difficult
    to do with them is arbitrary gotos. (I consider that a feature, not a
    bug.)

    Just occasionally, you need a goto to get out of a complicated error
    situation. Not often, but sometimes.

    That’s what exceptions are for. And they are handled by continuations as >well.

    At its simplest, a continuation is simply "the next instruction to be executed", so any time you take/save a code address and then jump to
    it you have executed a continuation.

    Many languages permit "callbacks", which are a form of continuation.

    The notion of "continuation" in Lisp has been conflated with function
    calling. Continuations which take/permit arguments are a particularly heavyweight form.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jul 14 18:14:47 2024
    According to George Neuner <gneuner2@comcast.net>:
    At its simplest, a continuation is simply "the next instruction to be >executed", so any time you take/save a code address and then jump to
    it you have executed a continuation.

    This programming style is very old. The SAGE air defense system used
    it in the 1950s, and SABRE copied it in the early 1960s. The software
    was oranizaed into routines each of which took some input, did
    something with it, perhaps scheduled other I/O operations, and then
    explicitly saved its state and returned to the dispatcher. It was all
    written in assembler and was very tedious to write and debug but it
    got fantastic performance. SABRE hooked 1500 terminals to a pair of
    7090s each with 32K words of 2us core and got good performance.

    These days the syntax is a lot nicer, CPS makes it easier to say
    what the context is you're saving, but it's the same idea.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ldo@nz.invalid on Sun Jul 14 13:41:33 2024
    On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    Has there ever been a hardware architecture that managed the flow of
    control via “continuations”?

    That is, you do away with the hardware concept of a stack. Instead, you
    have call frames that, while defined to some extent by the architecture,
    can be located anywhere in memory (allocation managed by the OS, runtime
    etc as part of the ABI). A call frame has a current program counter, and >maybe some other context like local variables and a static link for
    lexical binding. Instead of a “return from subroutine” instruction, you >have a “load new call frame” instruction.

    You might have a location defined in a call frame for a “pointer to parent >call frame” field, in which case “return from subroutine” just loads this
    pointer into the current-call-frame register. But you could just as easily >have pointers to other call frames defining other kinds of
    interrelationships between them. And note that transferring to a different >call frame does not automatically invalidate the previous one. If it stays >valid, then there is no reason why you couldn’t, at some point, come back >to it and resume execution from where it left off.

    The beauty of continuations is that they are a single generalized control >construct that can be used to implement specific language features like >regular routine calls, loops, exceptions and coroutines, all built from
    the same common abstraction. One thing that is difficult to do with them
    is arbitrary gotos. (I consider that a feature, not a bug.)

    Very few high-level languages (outside of the Lisp family, anyway) seem to >have implemented continuations as an explicit language concept. This is an >integral part of Scheme, not so much it seems of non-Scheme Lisps. I >implemented it in my PostScript revival language ><https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come up >with a useful example, like for instance an implementation of coroutines, >that doesn’t do my head in. ;)

    In order for this to work well, the program needs to be compiled in Continuation Passing Style (CPS). THe problem is that CPS does not
    play well with Single Static Assignment (SSA), which is the preferred
    IR form now because it supports the greatest number of potential
    optimizations.
    [There was a rather lengthy discussion of compilation techniques in comp.lang.lisp that dealt with issues combining SSA with CPS, but it
    was ~20 years ago so you'd have to dig through the archives.]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to ldo@nz.invalid on Sun Jul 14 12:31:26 2024
    On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    One thing that is difficult to do with them
    is arbitrary gotos. (I consider that a feature, not a bug.)

    It's a bug.
    If a programming language wants to disallow arbitrary gotos in order
    to promote better quality code writing, that's great.
    But the machine language of a computer has got to allow _any_ language
    to be implemented on it. Including C, or old-style FORTRAN.
    Otherwise, one doesn't have a general-purpose machine.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to John Levine on Sun Jul 14 22:22:41 2024
    On 2024-07-14 19:14, John Levine wrote:
    According to George Neuner <gneuner2@comcast.net>:
    At its simplest, a continuation is simply "the next instruction to be
    executed", so any time you take/save a code address and then jump to
    it you have executed a continuation.


    That does not sound much like "continuation passing" to me.

    I think of continuation passing as always ending a subprogram by calling
    some other subprogram, passing it one or more "continuation" subprograms
    as parameters, and expecting that callee to do something and then to
    "continue" by further calling one of those "continuations" (with further "continuation" parameters) and NOT to return to the original caller, ever.

    By John's "simplest" definition, an ordinary call-and-return would be "continuation passing": the call "takes/saves" the return address
    somewhere, and the return jumps to that address.


    This programming style is very old. The SAGE air defense system used
    it in the 1950s, and SABRE copied it in the early 1960s. The software
    was oranizaed into routines each of which took some input, did
    something with it, perhaps scheduled other I/O operations, and then explicitly saved its state and returned to the dispatcher.


    I think that looks more like a multi-tasking system with
    run-to-completion, non-preemptive (small) tasks, considering each
    "routine" as a task and assuming that each routine started with
    recovering its saved context.

    The main difference to present-day multi-tasking systems is then that
    the context-saving and -restoring was implemented separately and
    specifically in each task, and not in a shared "task switch" function.

    There are today several real-time kernels that implement that kind of run-to-completion, non-preemptive multi-tasking.

    Could you explain a bit about the "dispatcher" and how it decided which
    routine to execute next? Perhaps depending on which I/O had completed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jul 14 22:29:05 2024
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    This programming style is very old. The SAGE air defense system used
    it in the 1950s, and SABRE copied it in the early 1960s. The software
    was oranizaed into routines each of which took some input, did
    something with it, perhaps scheduled other I/O operations, and then
    explicitly saved its state and returned to the dispatcher.

    I think that looks more like a multi-tasking system with
    run-to-completion, non-preemptive (small) tasks, considering each
    "routine" as a task and assuming that each routine started with
    recovering its saved context.

    Right, that saved context is the continuation. Like I said, the tools
    these days take care of a lot of the drudge work, e.g. node.js.

    The main difference to present-day multi-tasking systems is then that
    the context-saving and -restoring was implemented separately and
    specifically in each task, and not in a shared "task switch" function.

    Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
    The context you save depends on what you're doing.

    Could you explain a bit about the "dispatcher" and how it decided which >routine to execute next? Perhaps depending on which I/O had completed?

    As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
    characters that said what kind of request it was. The routine would
    then do a bunch of work consisting of computing and disk I/O. Whenever
    it scheduled a disk operation, it'd explicitly save its state and
    return to the dispatcher. When the disk I/O was done, it'd schedule
    the next part of the task which restored the state and continued.
    Repeat until the task sends a reply to the terminal and is done.

    Airline reservation and credit card systems still work this way, and
    have amazing transaction rates, like 1000 updates/sec to an individual
    database record.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to aph on Mon Jul 15 01:58:25 2024
    On Sun, 14 Jul 2024 09:38:31 +0000, aph wrote:

    But I can't immediately think of any language with
    continuations and without GC. They kinda go together.

    Instead of “GC”, I would say “dynamic allocation”. Consider languages like
    Python and Perl, which do reference-counting as a first resort, only
    falling back to full-on garbage collection when reference-counting is no
    longer enough. That gives you the best of both worlds.

    Of course, neither of of them has continuations as such, or at least fully-general ones: Python does have both “generator” functions and the newer async/await-style “stackless” coroutines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Mon Jul 15 01:59:58 2024
    On Sun, 14 Jul 2024 12:31:26 -0600, John Savard wrote:

    On Sat, 13 Jul 2024 07:50:42 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    One thing that is difficult to do with them is arbitrary gotos. (I
    consider that a feature, not a bug.)

    It's a bug.
    If a programming language wants to disallow arbitrary gotos in order to promote better quality code writing, that's great.
    But the machine language of a computer has got to allow _any_ language
    to be implemented on it. Including C, or old-style FORTRAN. Otherwise,
    one doesn't have a general-purpose machine.

    It only has to be Turing-equivalent (or Lisp-equivalent) to be a “general- purpose” machine. There’s nothing in that criterion that requires gotos.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From wolfgang kern@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 07:50:54 2024
    On 13/07/2024 09:50, Lawrence D'Oliveiro wrote:
    Has there ever been a hardware architecture that managed the flow of
    control via “continuations”?
    ...
    with a useful example, like for instance an implementation of coroutines, that doesn’t do my head in. ;)

    more than 40 years ago I worked on 1802 MCU programs.
    it was a 8 bit arch and had many single byte instructions which
    immediate switched any of its 16 registers to either IP, SP or Index.
    it had no call return opcodes, but it was easy to program as if it had.

    so at the end of a routine it could use any prepared register to either
    jump or perform a call with two preset registers.
    IIRC I used R4 and R5 as my standard call/return and even modified my ASM&DISASS tools to show this two as CALL and Return.
    But these were always just simple JUMP reg instructions.
    __
    wolfgang

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to John Levine on Mon Jul 15 11:56:32 2024
    John Levine wrote:
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    Could you explain a bit about the "dispatcher" and how it decided which
    routine to execute next? Perhaps depending on which I/O had completed?

    As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
    characters that said what kind of request it was. The routine would
    then do a bunch of work consisting of computing and disk I/O. Whenever
    it scheduled a disk operation, it'd explicitly save its state and
    return to the dispatcher. When the disk I/O was done, it'd schedule
    the next part of the task which restored the state and continued.
    Repeat until the task sends a reply to the terminal and is done.

    Airline reservation and credit card systems still work this way, and
    have amazing transaction rates, like 1000 updates/sec to an individual database record.

    I used this approach (which I later learned was called coroutines) in my
    Dos terminal emulator: I wanted to process all characters as I received
    them, which meant that for any of the many, many escape sequences (for
    cursor positioning, font selection, foreground/background color etc) I
    needed to remember where I was.

    Initially I just restored the read pointer to the beginning of the
    escape sequence, knowing I would be called again when one or more new characters had been received, so that I would always process the stream
    in complete units. However I soon realized that it was much cheaper to
    make the stream parser a separate routine which would yield when it ran
    out of inputs. I only needed a tiny bit on inline assembler to do the
    coroutine switching (mostly just a separate stack and IP, but no
    register contents).

    The receiving routine would yield to the parser when the serial port
    buffer has some new data.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Mon Jul 15 14:12:32 2024
    John Levine wrote:
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    This programming style is very old. The SAGE air defense system used
    it in the 1950s, and SABRE copied it in the early 1960s. The software
    was oranizaed into routines each of which took some input, did
    something with it, perhaps scheduled other I/O operations, and then
    explicitly saved its state and returned to the dispatcher.
    I think that looks more like a multi-tasking system with
    run-to-completion, non-preemptive (small) tasks, considering each
    "routine" as a task and assuming that each routine started with
    recovering its saved context.

    Right, that saved context is the continuation. Like I said, the tools
    these days take care of a lot of the drudge work, e.g. node.js.

    The main difference to present-day multi-tasking systems is then that
    the context-saving and -restoring was implemented separately and
    specifically in each task, and not in a shared "task switch" function.

    Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
    The context you save depends on what you're doing.

    Could you explain a bit about the "dispatcher" and how it decided which
    routine to execute next? Perhaps depending on which I/O had completed?

    As best I understand it, SABRE would receive a message from one of the terminals and dispatch it to a routine based on the first few
    characters that said what kind of request it was. The routine would
    then do a bunch of work consisting of computing and disk I/O. Whenever
    it scheduled a disk operation, it'd explicitly save its state and
    return to the dispatcher. When the disk I/O was done, it'd schedule
    the next part of the task which restored the state and continued.
    Repeat until the task sends a reply to the terminal and is done.

    Airline reservation and credit card systems still work this way, and
    have amazing transaction rates, like 1000 updates/sec to an individual database record.

    In a classic state machine there is a loop around a CASE statement
    based on the current state number. The last thing each state case
    does is assign the next state number to jump to.

    A variation on that idea has each state as an action subroutine
    and the state number and CASE is replaced by a function address.

    The way I did this was that each task context TaskCtx struct has a
    field that is the next action routine to call NextActionRtn (*taskCtx),
    and links for a double linked list.

    Like the classic state machine, the last thing each action routine does is decide what to do next and assign the next action function to be called.

    The TaskCtx is in a task wait list if waiting for IO.
    Async IO completion moves the TaskCtx to the ready task list.

    The dispatcher is a loop that pops the next TaskCtx from ready list and
    calls its NextActionRtn routine field passing the *taskCtx as an arg.

    The effect is that code bounces from action routine to action routine
    carrying its execute context with it, waiting for something external, evaluating the result and deciding what action to do next.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 14:50:04 2024
    Lawrence D'Oliveiro wrote:
    On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal
    decrement/increment stack it used a linked list of heap allocated
    objects. Every call did a malloc and every return did a free. All the
    hardware speed improvement had been squandered on this completely
    unnecessary ABI overhead.

    Was that the RS/6000? That was quite a step forward at the time (and for
    some years after), even among all the other new RISC-based architectures appearing around that time.

    What OS/software were you trying above? Because it doesn’t sound like regular C code on a Unix-type system.

    Yes, an RS6000. I don't remember what language, Fortran or C, likely on AIX.
    I didn't run these test, I was watching others run them and remember
    all our jaws dropping when the disassembler showed what it was doing.

    We mostly worked in Fortran 77 on VAX, and a side group were porting code to VM/CMS on a 4331 so probably F77. We weren't porting code to RS6000 and
    I don't remember why IBM loaned it to us, probably just to get our feedback.

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Mon Jul 15 18:57:06 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    (Plese note that the sentence above was DeepL-filtered, on
    "diplomatic" setting).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jul 15 19:25:33 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    $ ulimit -s unlimited

    Should handle the default stack.

    Of course, it still needs to fit into whatever address space
    the target processor offers, but even with a 48-bit VA,
    there should be enuf VA space for several terabytes of
    stack.

    In any case, one can mmap an arbitrarily large region of
    memory and assign it to the stack pointer in (or even before)
    main().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jul 15 19:43:34 2024
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    Airline reservation and credit card systems still work this way, and
    have amazing transaction rates, like 1000 updates/sec to an individual
    database record.

    In a classic state machine there is a loop around a CASE statement
    based on the current state number. The last thing each state case
    does is assign the next state number to jump to.

    Right. This is what we'd now call an event loop, with code
    dispatched in response to external events like I/O completion.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jul 15 20:07:10 2024
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    Each current local context could be allocated from a heap.
    But be careful what you wish for.
    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal >decrement/increment stack it used a linked list of heap allocated objects. >Every call did a malloc and every return did a free. All the hardware speed >improvement had been squandered on this completely unnecessary ABI overhead.

    Maybe it was an IBM thing. In the 1960s the IBM 360 Algol F compiler,
    which I always assumed was written because some European institution
    had it on a checklist rather than because they expected anyone to use
    it, did the same thing. Every time it entered a block it did a GETMAIN
    system call, and when it left the block it did a FREEMAIN, with all of
    the performance you'd expect.

    Someone at Princeton patched it to do more sane allocation where it
    did a big GETMAIN and suballocated from that until it filled up and
    then did another one. I never wrote enough Algol to tell how well it
    worked.

    The compiler was a marvel of passive aggressive standards implementation. The keywords were all quoted 'BEGIN' 'END' and unless you told it otherwise it used the 48 charater set so a semicolon was ., and assignment was .= with one dot, not two. A squared was A'POWER'2. I/O was procedures that used dataset numbers, 0 was the input deck, 1 was the printer, and you skipped to the
    next output record by padding out the current record to 80 spaces.

    'COMMENT'TOO BAD THEY DID NOT USE 'TNEMMOC'.,

    Yuck.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to John Levine on Mon Jul 15 23:06:13 2024
    On 2024-07-14 23:29, John Levine wrote:
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    This programming style is very old. The SAGE air defense system used
    it in the 1950s, and SABRE copied it in the early 1960s. The software
    was oranizaed into routines each of which took some input, did
    something with it, perhaps scheduled other I/O operations, and then
    explicitly saved its state and returned to the dispatcher.

    I think that looks more like a multi-tasking system with
    run-to-completion, non-preemptive (small) tasks, considering each
    "routine" as a task and assuming that each routine started with
    recovering its saved context.

    Right, that saved context is the continuation. Like I said, the tools
    these days take care of a lot of the drudge work, e.g. node.js.

    The main difference to present-day multi-tasking systems is then that
    the context-saving and -restoring was implemented separately and
    specifically in each task, and not in a shared "task switch" function.

    Not really. The context is specific to each routine. It's not like time-slicing, it's do as much work as you can and then block for I/O.
    The context you save depends on what you're doing.


    Yes yes, as I said, the "tasks" (analogous to routines) are
    run-to-completion, not time-sliced, and the context save/load is
    "specifically" in each "task".

    Special systems like SAGE and SABRE can be viewed with different "eyes"
    for comparing to current concepts. You look at them with "continuation
    passing" eyes; I use my "run-to-completion, non-preemptive multitasking"
    eyes, perhaps because I am more familiar with that kind of SW design.

    For me the deciding factor in SAGE/SABRE is that the dispatcher, to
    which each routine performs a call-with-continuation (in your view), is
    /not/ an application function, but a system function, and is indeed a
    scheduler that finds some new routine to execute /which the calling
    routine did not choose/.

    IMO, for something to qualify as a continuation-passing call, the caller
    should select the callee -- the thing to be executed next -- but in
    SAGE/SABRE it is the scheduler that selects what to execute next.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Mon Jul 15 20:54:39 2024
    On Mon, 15 Jul 2024 20:07:10 +0000, John Levine wrote:

    According to EricP <ThatWouldBeTelling@thevillage.com>:
    Each current local context could be allocated from a heap.
    But be careful what you wish for.
    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal >>decrement/increment stack it used a linked list of heap allocated
    objects.
    Every call did a malloc and every return did a free. All the hardware
    speed
    improvement had been squandered on this completely unnecessary ABI
    overhead.

    Maybe it was an IBM thing. In the 1960s the IBM 360 Algol F compiler,
    which I always assumed was written because some European institution
    had it on a checklist rather than because they expected anyone to use
    it, did the same thing. Every time it entered a block it did a GETMAIN
    system call, and when it left the block it did a FREEMAIN, with all of
    the performance you'd expect.

    The only thing I can comment on, here, is that 1108 Algol was much
    faster than 360/67 Algol.

    Someone at Princeton patched it to do more sane allocation where it
    did a big GETMAIN and suballocated from that until it filled up and
    then did another one. I never wrote enough Algol to tell how well it
    worked.

    The compiler was a marvel of passive aggressive standards
    implementation. The
    keywords were all quoted 'BEGIN' 'END' and unless you told it otherwise
    it used
    the 48 charater set so a semicolon was ., and assignment was .= with one
    dot,
    not two. A squared was A'POWER'2. I/O was procedures that used dataset numbers, 0 was the input deck, 1 was the printer, and you skipped to the
    next output record by padding out the current record to 80 spaces.

    'COMMENT'TOO BAD THEY DID NOT USE 'TNEMMOC'.,

    Yuck.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Jul 15 20:59:23 2024
    On Sat, 13 Jul 2024 7:50:42 +0000, Lawrence D'Oliveiro wrote:

    Has there ever been a hardware architecture that managed the flow of
    control via “continuations”?

    I want to simply ask WHY ??
    Why do you think this would add value to an ISA ??
    Why do you think this would add value to an HLL ??
    Why do you think this would add value to a programmer ??

    That is, you do away with the hardware concept of a stack. Instead, you
    have call frames that, while defined to some extent by the architecture,
    can be located anywhere in memory (allocation managed by the OS, runtime
    etc as part of the ABI). A call frame has a current program counter, and maybe some other context like local variables and a static link for
    lexical binding. Instead of a “return from subroutine” instruction, you have a “load new call frame” instruction.

    You could learn to fly an airplane by coupling 16 or more electrodes
    to your head and practicing flying until your brain figured it out !!

    But there is a reason planes are not flow this way.....


    You might have a location defined in a call frame for a “pointer to
    parent
    call frame” field, in which case “return from subroutine” just loads this
    pointer into the current-call-frame register. But you could just as
    easily
    have pointers to other call frames defining other kinds of
    interrelationships between them. And note that transferring to a
    different
    call frame does not automatically invalidate the previous one. If it
    stays
    valid, then there is no reason why you couldn’t, at some point, come
    back
    to it and resume execution from where it left off.

    Block structured languages allow for this, they are not "in that great
    a favor" right now.

    The beauty of continuations is that they are a single generalized
    control
    construct that can be used to implement specific language features like regular routine calls, loops, exceptions and coroutines, all built from
    the same common abstraction. One thing that is difficult to do with them
    is arbitrary gotos. (I consider that a feature, not a bug.)

    The ugliness is (as BGB indicates) their poor performance.
    So, whatever they bring to the party, does not seem to be
    the "guest of honor".

    Very few high-level languages (outside of the Lisp family, anyway) seem
    to
    have implemented continuations as an explicit language concept. This is
    an
    integral part of Scheme, not so much it seems of non-Scheme Lisps. I implemented it in my PostScript revival language <https://bitbucket.org/ldo17/gxscript/>, and I am still trying to come
    up
    with a useful example, like for instance an implementation of
    coroutines,
    that doesn’t do my head in. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Jul 15 21:23:37 2024
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    $ ulimit -s unlimited

    Should handle the default stack.

    Some operating systems put a hard limit on this, for example MacOS,
    and some an absurdly low limit on per-thread stack size (512 K
    or something like that).

    Of course, it still needs to fit into whatever address space
    the target processor offers, but even with a 48-bit VA,
    there should be enuf VA space for several terabytes of
    stack.

    In any case, one can mmap an arbitrarily large region of
    memory and assign it to the stack pointer in (or even before)
    main().

    Changing the standard startup sequence of programs (i.e. replacing
    _start) - that would be heavy messing, and be no longer compatible
    with standard startup sequences.

    And for threads, this would get even messier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Mon Jul 15 17:26:44 2024
    EricP wrote:
    Lawrence D'Oliveiro wrote:
    On Sat, 13 Jul 2024 11:16:45 -0400, EricP wrote:

    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal
    decrement/increment stack it used a linked list of heap allocated
    objects. Every call did a malloc and every return did a free. All the
    hardware speed improvement had been squandered on this completely
    unnecessary ABI overhead.

    Was that the RS/6000? That was quite a step forward at the time (and
    for some years after), even among all the other new RISC-based
    architectures appearing around that time.

    What OS/software were you trying above? Because it doesn’t sound like
    regular C code on a Unix-type system.

    Yes, an RS6000. I don't remember what language, Fortran or C, likely on
    AIX.

    This could also have been a RT PC as our port project was
    running around 1987 and straddled the RS6000 launch in 1990.
    (so I might have got my 'R's mixed up.)

    And according to Wikipedia RT's were notoriously slow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to John Levine on Mon Jul 15 22:28:37 2024
    On 15/07/2024 21:07, John Levine wrote:
    ... In the 1960s the IBM 360 Algol F compiler,
    which I always assumed was written because some European institution
    had it on a checklist rather than because they expected anyone to use
    it, did the same thing. Every time it entered a block it did a GETMAIN
    system call, and when it left the block it did a FREEMAIN, with all of
    the performance you'd expect.

    The /360 F level ALGOL compiler was notoriously bad at procedure calls,
    and now I know why. On the Ackermann function benchmark it is thought
    to have required about 820 instructions per call. Not surprising if it
    was using an OS store allocation routine to set up each stack frame.

    For comparison with near contemporary machines, and compilers:

    IBM, F Level, /360 ALGOL 60: 820 ins/call*
    NU 1108 ALGOL 60: 175 ins/call*
    Delft U, /360 ALGOL 60: 142 ins/call
    Wirth@Stanford U, /360 ALGOL W: 74 ins/call
    Edinburgh U, EMAS /360 ALGOL 60: 21 ins/call
    York U, VAX Ada: 8 ins/call

    * Estimated

    The VAX Ada compiler was 12 years later.

    Data taken from Brian Wichmann's study.

    See <http://www.findlayw.plus.com/KDF9/Documents/Benchmarking/>
    and
    <http://www.findlayw.plus.com/KDF9/The%20KDF9%20and%20Benchmarking.pdf>)

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jul 15 22:01:51 2024
    On Mon, 15 Jul 2024 17:26:44 -0400, EricP wrote:

    This could also have been a RT PC as our port project was
    running around 1987 and straddled the RS6000 launch in 1990.
    (so I might have got my 'R's mixed up.)

    And according to Wikipedia RT's were notoriously slow.

    That sounds much more like it, to be honest. What you were saying was
    totally at odds with the reputation for performance that IBM POWER, and
    later PowerPC, was gathering.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to All on Mon Jul 15 12:31:46 2024
    A little over a decade ago, I was asked if I could track down decision
    to add virtual memory to all IBM 370s and found a former staff member to executive making the decision. Basically OS/360 MVT storage management
    was so bad that region sizes had to be specified four times larger than
    used ... as a result only four concurrent regions could be running
    concurrently on typical 1mbyte 370/165 system, insufficient to keep
    system busy and justified.

    Mapping MVT to a 16mbyte virtual address space (vs2/svs, similar to
    running MVT in a CP67 16mbyte virtual machine), allowed concurrently
    running regions increased by factor of four (limited to 15 with 4bit
    storage protect keys, keeping regions isolated). However, as systems
    increase, even fifteen weren't sufficient and they went to providing a
    separate 16mbyte virtual address space for each region
    (VS2/MVS). However OS/360 heavy pointer-passing API, resulted in mapping
    a 8mbyte image of the MVS kernel into every virtual address space
    (kernel API call would be able to access API parameters) ... leaving
    8mbytes for application. Then because subsystems were also placed in
    their own separate 16mbyte virtual address space ... for subsystems
    access to caller's parameters, they mapped a 1mbyte "common segment
    area" (CSA) into every virtual address space (leaving 7mbyes for
    applications). However CSA space requirement was proportional to
    concurrent executing applications and number of subsystems ... and
    quickly became multi-mbyte "common system area". By 3033 time-frame CSA
    was frequently 5-6mbytes (leaving 2-3mbytes for applications) and
    threatening to become 8mbytes (leaving zero).

    This was part of the mad rush to 370/XA & MVS/XA with 31-bit addressing
    and "access registers" (semi-privileged subsystem could access caller's
    address space "as secondary address space"). As temporary stop-gap, a
    subset of "access registers" were retrofitted to 3033 for MVS, as
    "dual-address space mode". However, kernel code was still required to
    swap hardware address space pointers (subsystem call, passing through
    kernel, moving caller's address space pointer to secondary, and loading subsystems address space pointer as primary ... and then restoring on
    return). Then got hardware support with a "program call" table ... entry
    for each subsystem and subsystem call could be handled all by hardware, including the address space pointer swapping (w/o overhead of kernel
    software).

    "program transfer" ... more like continuation,

    1983 370/XA Pinciples of Operation SSA22-7085 https://bitsavers.org/pdf/ibm/370/princOps/SA22-7085-0_370-XA_Principles_of_Operation_Mar83.pdf
    pg3-5 Primary & Secondary Virtual Address
    pg10-22 "Program Call"
    pg10-28 "Program Transfer"

    "program transfer" originally could also be used to restore CPU to state
    saved by previous "Program Call" (aka return, programming notes pg10-30)


    Principles of Operation SA22-7832-13 (May 2022) https://publibfp.dhe.ibm.com/epubs/pdf/a227832d.pdf
    pg3-23 "Address Spaces"
    pg10-97 "Program Call"
    pg10-110 "Program Return"
    pg10-114 "Program Transfer"

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Mon Jul 15 18:28:15 2024
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    (Plese note that the sentence above was DeepL-filtered, on
    "diplomatic" setting).

    It should be possible to do stack chaining where if a stack allocation
    exceeds the current stack bottom then a new larger discontinuous stack
    memory section is allocated and chains back to the old one
    and the SP and FP patched.

    It needs to be thought about relatively early in the design as
    the kernel needs to know how it works so it can make sure to delete
    all the stack sections when a thread terminates.
    Also it requires fields in the user mode thread header to store the
    stack limits, plus store the backward links at the top of the stack
    so access ranges can be checked cheaply.

    It also needs to work both for routine entry that overflows and
    dynamic stack allocation by alloca that overflows later
    (in the later case could straddle the stacks -
    the FP points into the old stack but SP points into the new stack).

    Also it has to work for languages like Ada that allow dynamic storage
    on the stack but recover the space when the block is exited.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Mon Jul 15 23:52:09 2024
    On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

    IMO, for something to qualify as a continuation-passing call, the caller should select the callee -- the thing to be executed next -- but in SAGE/SABRE it is the scheduler that selects what to execute next.

    What you have is that SAGE/SABRE has built a higher-level scheduler
    abstraction on top of lower-level coroutines/continuations.

    Python’s asyncio module works in a similar way. Coroutines are a language primitive, that pass control directly to each other via await calls.
    asyncio is a library module that wraps a coroutine in a higher-level
    “task” object. When one task awaits another, that doesn’t necessarily mean
    that control is passed directly from the first task to the second:
    instead, the scheduler regains control and decides who is to run next.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Mon Jul 15 23:47:36 2024
    On Sun, 14 Jul 2024 22:22:41 +0300, Niklas Holsti wrote:

    I think of continuation passing as always ending a subprogram by calling
    some other subprogram, passing it one or more "continuation" subprograms
    as parameters, and expecting that callee to do something and then to "continue" by further calling one of those "continuations" (with further "continuation" parameters) and NOT to return to the original caller,
    ever.

    That’s the most general case. But it makes sense to have simpler
    conventions for common cases, like subroutine call/return. So a subroutine
    call implicitly passes a continuation that, when invoked, resumes the
    caller. And the subroutine block can end with an implicit invocation of
    this continuation. And such continuation objects are so common that it
    makes sense, as an optimization, to have a special place to allocate them
    in LIFO fashion--call it a “stack”.

    You may have heard some people talk about “tree-like stacks”. This is starting to generalize from simple linear stacks to fully general continuations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Jul 15 23:55:21 2024
    On Mon, 15 Jul 2024 14:12:32 -0400, EricP wrote:

    The effect is that code bounces from action routine to action routine carrying its execute context with it, waiting for something external, evaluating the result and deciding what action to do next.

    The thing that coroutines bring to the table is that you don’t have to
    break up the code that executes in this common context into separate
    pieces that are each invoked via a callback and end in a reschedule call. Intead, you just write it as a single block, with explicit “await” calls
    at the various points along the way where rescheduling happens.

    The net result is the logic is much easier to follow, and there is often
    less of it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jul 15 23:52:28 2024
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    We were playing with an IBM R6000 around 1990 which was supposed to
    be this fast risc machine but we found it ran slower than our VAX.
    We disassembled the code and found out that instead of a normal
    decrement/increment stack it used a linked list of heap allocated
    objects. Every call did a malloc and every return did a free. All the
    hardware speed improvement had been squandered on this completely
    unnecessary ABI overhead. ...

    This could also have been a RT PC as our port project was
    running around 1987 and straddled the RS6000 launch in 1990.
    (so I might have got my 'R's mixed up.)

    The RT was pretty fast but AIX suffered from running atop a virtual
    machine system that was originally intended to support multiple operating systems and in retrospect did more than it really needed to.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jul 15 23:57:38 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    $ ulimit -s unlimited

    Should handle the default stack.

    Some operating systems put a hard limit on this, for example MacOS,
    and some an absurdly low limit on per-thread stack size (512 K
    or something like that).

    MacOS has been obsolete for a decade now. OSX has an initial
    size of 8MB and a default configured hard limit of 64GB.

    Of course, it still needs to fit into whatever address space
    the target processor offers, but even with a 48-bit VA,
    there should be enuf VA space for several terabytes of
    stack.

    In any case, one can mmap an arbitrarily large region of
    memory and assign it to the stack pointer in (or even before)
    main().

    Changing the standard startup sequence of programs (i.e. replacing
    _start) - that would be heavy messing, and be no longer compatible
    with standard startup sequences.

    That depends on the program, I guess. A little inline assembler
    in main should suffice; An mmap'ed region will automatically
    allocate on first touch, so the stack can grow as necessary.


    And for threads, this would get even messier.

    pthreads already get their own stack, and have standard functions
    to specify stack pointer and stack size (e.g. pthread_attr_setstack(2)).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jul 15 23:56:00 2024
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

    IMO, for something to qualify as a continuation-passing call, the caller
    should select the callee -- the thing to be executed next -- but in
    SAGE/SABRE it is the scheduler that selects what to execute next.

    What you have is that SAGE/SABRE has built a higher-level scheduler >abstraction on top of lower-level coroutines/continuations.

    It was an event loop, not coroutines. Every routine
    returned to the dispatcher, not another routine. You
    could reasonably call it continuations, implemented in a
    painfully crude way with explicit save and restore of
    the context.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jul 16 00:25:39 2024
    On Mon, 15 Jul 2024 23:55:21 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 15 Jul 2024 14:12:32 -0400, EricP wrote:

    The effect is that code bounces from action routine to action routine
    carrying its execute context with it, waiting for something external,
    evaluating the result and deciding what action to do next.

    The thing that coroutines bring to the table is that you don’t have to break up the code that executes in this common context into separate
    pieces that are each invoked via a callback and end in a reschedule
    call.
    Intead, you just write it as a single block, with explicit “await” calls at the various points along the way where rescheduling happens.

    Is that still valuable if it cost 50× what a subroutine call costs ???

    Is that still valuable if you can do it yourself at only 20× the cost
    ???

    The net result is the logic is much easier to follow, and there is often
    less of it.

    The right answer late is the wrong answer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 01:08:41 2024
    On Tue, 16 Jul 2024 00:25:39 +0000, MitchAlsup1 wrote:

    On Mon, 15 Jul 2024 23:55:21 +0000, Lawrence D'Oliveiro wrote:

    The thing that coroutines bring to the table is that you don’t have to
    break up the code that executes in this common context into separate
    pieces that are each invoked via a callback and end in a reschedule
    call. Intead, you just write it as a single block, with explicit
    “await” calls at the various points along the way where rescheduling
    happens.

    Is that still valuable if it cost 50× what a subroutine call costs ???

    It doesn’t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Jul 16 01:09:58 2024
    On Mon, 15 Jul 2024 23:56:00 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

    IMO, for something to qualify as a continuation-passing call, the
    caller should select the callee -- the thing to be executed next --
    but in SAGE/SABRE it is the scheduler that selects what to execute
    next.

    What you have is that SAGE/SABRE has built a higher-level scheduler
    abstraction on top of lower-level coroutines/continuations.

    It was an event loop, not coroutines.

    The one does not preclude the other. So they lacked the coroutine
    abstraction; that could have been added easily enough in the application implementation language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Jul 16 02:03:38 2024
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Mon, 15 Jul 2024 23:56:00 -0000 (UTC), John Levine wrote:

    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Mon, 15 Jul 2024 23:06:13 +0300, Niklas Holsti wrote:

    IMO, for something to qualify as a continuation-passing call, the
    caller should select the callee -- the thing to be executed next --
    but in SAGE/SABRE it is the scheduler that selects what to execute
    next.

    What you have is that SAGE/SABRE has built a higher-level scheduler
    abstraction on top of lower-level coroutines/continuations.

    It was an event loop, not coroutines.

    The one does not preclude the other. So they lacked the coroutine >abstraction; that could have been added easily enough in the application >implementation language.

    Just for amusement, what do you think the application implementation
    languages for SAGE and SABRE were? For that matter, what computers do
    you think they ran on?

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Jul 16 02:47:05 2024
    On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

    Just for amusement, what do you think the application implementation languages for SAGE and SABRE were?

    Assembler with macros.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jul 16 05:54:44 2024
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Anyway, at the time I made a mental note about the relative cost of
    heap allocation of stack space in an ABI.

    I wonder if I might mention my dislike of arbitrary limits on
    stack size that are imposed on modern 64-bit systems.

    $ ulimit -s unlimited

    Should handle the default stack.

    Some operating systems put a hard limit on this, for example MacOS,
    and some an absurdly low limit on per-thread stack size (512 K
    or something like that).

    MacOS has been obsolete for a decade now.

    Sorry for messing up my terminology.

    OSX has an initial
    size of 8MB and a default configured hard limit of 64GB.

    The former leads to segfaults on otherwise fine programs, and
    makes using gfortran's -fstack-arrays (which really brings speed
    improvements) unsuitable for a lot of problems because it puts
    local arrays on the stack...

    Of course, it still needs to fit into whatever address space
    the target processor offers, but even with a 48-bit VA,
    there should be enuf VA space for several terabytes of
    stack.

    In any case, one can mmap an arbitrarily large region of
    memory and assign it to the stack pointer in (or even before)
    main().

    Changing the standard startup sequence of programs (i.e. replacing
    _start) - that would be heavy messing, and be no longer compatible
    with standard startup sequences.

    That depends on the program, I guess.

    In this case, a general approach.

    A little inline assembler
    in main should suffice;

    Not possible in Fortran, for example, unless the compiler does
    so itself. And, of course, it should be platform-independent,
    especially on systems that the maintainers of the Fortran compiler
    have no access to, and no special knowledge of, plus it would
    require platform-specific assembly in the startup code.

    An mmap'ed region will automatically
    allocate on first touch, so the stack can grow as necessary.

    Slightly different topic: Unfortunately, Apple decided in their
    wisdom that an mmapped region cannot grow, so you have to make a
    guess at the needed size. As to why, I have no idea; they probably
    wanted to follow the example of Windows. Learning from Microsoft
    is learning to win!

    This makes dynamic memory management for shared memory much harder
    than it needs to be (been there, been bitten by it).

    And for threads, this would get even messier.

    pthreads already get their own stack, and have standard functions
    to specify stack pointer and stack size (e.g. pthread_attr_setstack(2)).

    Life would be so much easier if defaults were reasonable...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Jul 16 15:59:19 2024
    On Mon, 15 Jul 2024 18:12:32 +0000, EricP wrote:

    John Levine wrote:

    In a classic state machine there is a loop around a CASE statement
    based on the current state number. The last thing each state case
    does is assign the next state number to jump to.

    You underestimate what we HW guys do with state machines::

    There is a loop, yes, but inside the loop are several CASE statements
    each producing its own next state, the only thing in common is a way
    to exit the loop when needed.

    For example 68000 µcode had such a CASE statement for the PC section,
    the Data section, and the Address section, another for the pins, each
    mostly independent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to ldo@nz.invalid on Tue Jul 16 10:00:59 2024
    On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    It only has to be Turing-equivalent (or Lisp-equivalent) to be a general- >purpose machine. Theres nothing in that criterion that requires gotos.

    It's true that Turing-equivalence is sufficient to allow a computer
    with sufficient storage available to do anything.

    However, the term "general-purpose" can have other meanings. The sense
    in which I was using it was: a machine that is _efficient_ at running
    programs written in all popular programming languages for the kinds of applications for which computers are normally used. Thus, the IBM
    System/360 was general-purpose since it had string manipulation
    instructions and decimal arithmetic in addition to floating-point and
    binary integer arithmetic.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Tue Jul 16 17:17:57 2024
    I think the underlying problem with continuations en-the-large is
    that too much is in the continuation--leading to unnecessary over-
    head.

    Does a continuation in a block structured language NEED everything
    on the block-stack, or is it more like 2-10 variables at rather
    well defined locations ?? more like a functor than a continuation.

    When I was writing real time coroutines on a PDP-11 I rarely needed
    more than 2-3 variables and the addressing modes allowed for the
    coroutine call and return to be simple PDP-11 instructions with
    no more overhead than a normal subroutine call. (1975-77)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jul 16 20:53:22 2024
    On Tue, 16 Jul 2024 5:54:44 +0000, Thomas Koenig wrote:


    Slightly different topic: Unfortunately, Apple decided in their
    wisdom that an mmapped region cannot grow, so you have to make a
    guess at the needed size. As to why, I have no idea; they probably
    wanted to follow the example of Windows. Learning from Microsoft
    is learning to win!

    Lemmings learn from the lemming right in front of them...too

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Jul 16 20:55:01 2024
    On Tue, 16 Jul 2024 2:47:05 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

    Just for amusement, what do you think the application implementation
    languages for SAGE and SABRE were?

    Assembler with macros.

    It is rare, today, for programmers to be aware of the power of a
    good macro assembler, and how close these tend to be to a real
    low level language (like K&R C) but without the limitations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jul 16 20:51:17 2024
    On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

    On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    It only has to be Turing-equivalent (or Lisp-equivalent) to be a
    “general-
    purpose” machine. There’s nothing in that criterion that requires gotos.

    It's true that Turing-equivalence is sufficient to allow a computer
    with sufficient storage available to do anything.

    But we also want that stuff to be fast (under some metric).

    However, the term "general-purpose" can have other meanings. The sense
    in which I was using it was: a machine that is _efficient_ at running programs written in all popular programming languages for the kinds of applications for which computers are normally used. Thus, the IBM
    System/360 was general-purpose since it had string manipulation
    instructions and decimal arithmetic in addition to floating-point and
    binary integer arithmetic.

    I think that history shows that those string facilities were, at best, overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.

    On the other hand, I am advocating that some Elementary Functions be
    raised to first class citizens of ISA so that SIM() is 20 cycles instead
    of 150 cycles. Why, because today, we understand enough about the cal- culations, polynomials, error sources, to build in HW things that are
    just as accurate as SW can build and nearly 7×-10× faster.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Tue Jul 16 23:51:05 2024
    On Tue, 16 Jul 2024 05:54:44 -0000 (UTC), Thomas Koenig wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Some operating systems put a hard limit on this, for example MacOS,
    and some an absurdly low limit on per-thread stack size (512 K or
    something like that).

    MacOS has been obsolete for a decade now.

    Sorry for messing up my terminology.

    Just a note that Apple now seems to be referring to OS X as “macOS” (with the lower-case m).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 23:54:59 2024
    On Tue, 16 Jul 2024 20:55:01 +0000, MitchAlsup1 wrote:

    On Tue, 16 Jul 2024 2:47:05 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 16 Jul 2024 02:03:38 -0000 (UTC), John Levine wrote:

    Just for amusement, what do you think the application implementation
    languages for SAGE and SABRE were?

    Assembler with macros.

    It is rare, today, for programmers to be aware of the power of a good
    macro assembler, and how close these tend to be to a real low level
    language (like K&R C) but without the limitations.

    Precisely my point. So the idea that you cannot create a macro equivalent
    to “await” seems implausible to me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 16 23:57:46 2024
    On Tue, 16 Jul 2024 17:17:57 +0000, MitchAlsup1 wrote:

    Does a continuation in a block structured language NEED everything on
    the block-stack, or is it more like 2-10 variables at rather well
    defined locations ?? more like a functor than a continuation.

    My answer would be “it only needs to include what it needs”. Maybe in one instance it only needs a couple of integer variables, in another it might
    need a (gasp) multi-megabyte dynamic array.

    The only register contents the hardware needs to worry about would be:

    * the program counter
    * the pointer to the call frame itself

    and perhaps that’s it. Everything else (e.g. the concept of a “return address”) could be defined by the ABI.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jul 16 19:41:12 2024
    On Tue, 16 Jul 2024 20:51:17 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.

    I will agree with that; the 68000 and the x86 don't have those things
    either.

    Even if my personal inclination has been to include them too.

    On the other hand, I am advocating that some Elementary Functions be
    raised to first class citizens of ISA so that SIM() is 20 cycles instead
    of 150 cycles. Why, because today, we understand enough about the cal- >culations, polynomials, error sources, to build in HW things that are
    just as accurate as SW can build and nearly 7-10 faster.

    I thought this was already the case; the 8087 had transcendental
    functions, and so did the 486 DX and from that onwards, and this was
    also the case with the 68881 and 68882.

    Of course, even if they're citizens of the ISA, they could be
    second-class citizens, whatever the definition of that might be.

    Here, my personal inclination is to simply include the raw CORDIC
    calculation as an instruction, with bounds checking, special cases,
    and any fancy stuff in software. On the grounds that some things are
    too complicated for a classic old-style ISA.

    That, too, is of course a bad decision from a practical perspective.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed Jul 17 03:51:50 2024
    On Tue, 16 Jul 2024 20:51:17 +0000, MitchAlsup1 wrote:

    On the other hand, I am advocating that some Elementary Functions be
    raised to first class citizens of ISA so that SIM() is 20 cycles instead
    of 150 cycles. Why, because today, we understand enough about the cal- culations, polynomials, error sources, to build in HW things that are
    just as accurate as SW can build and nearly 7×-10× faster.

    I can remember a Comp Sci professor waxing lyrical about p-adic number
    systems, as an alternative to conventional p-ary ones. The key thing was
    that division was no more complex than multiplication.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Koenig on Wed Jul 17 09:23:00 2024
    In article <v7440p$sgca$1@dont-email.me>, tkoenig@netcologne.de (Thomas
    Koenig) wrote:

    Some operating systems put a hard limit on this, for example MacOS,
    and some an absurdly low limit on per-thread stack size (512 K
    or something like that).

    That's the default limit, but you can raise it, provided you do so before creating the threads.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 17 16:17:22 2024
    On Wed, 17 Jul 2024 1:41:12 +0000, John Savard wrote:

    On Tue, 16 Jul 2024 20:51:17 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I think that history shows that those string facilities were, at best, >>overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.

    I will agree with that; the 68000 and the x86 don't have those things
    either.

    Even if my personal inclination has been to include them too.

    On the other hand, I am advocating that some Elementary Functions be
    raised to first class citizens of ISA so that SIM() is 20 cycles instead
    of 150 cycles. Why, because today, we understand enough about the cal- >>culations, polynomials, error sources, to build in HW things that are
    just as accurate as SW can build and nearly 7×-10× faster.

    I thought this was already the case; the 8087 had transcendental
    functions, and so did the 486 DX and from that onwards, and this was
    also the case with the 68881 and 68882.

    Of course, even if they're citizens of the ISA, they could be
    second-class citizens, whatever the definition of that might be.

    While they were "in" the ISAs, they would take thousands of cycles
    while FP would only take hundreds of cycles.

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Here, my personal inclination is to simply include the raw CORDIC
    calculation as an instruction, with bounds checking, special cases,
    and any fancy stuff in software. On the grounds that some things are
    too complicated for a classic old-style ISA.

    This is a tradeoff in size versus cycles. I choose to stop when
    sin() was as fast as FDIV.

    That, too, is of course a bad decision from a practical perspective.

    Imagine a situation where::

    s = sin(x);
    c = cos(x);

    is faster and more accurate than:

    s = sin(x);
    c = sqrt(1-s^2);

    !!!
    {{ignoring the fact that cos() can be negative where sqrt() cannot.}}

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed Jul 17 16:41:37 2024
    MitchAlsup1 wrote:

    On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

    On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    It only has to be Turing-equivalent (or Lisp-equivalent) to be a “general-
    purpose” machine. There’s nothing in that criterion that requires gotos.

    It's true that Turing-equivalence is sufficient to allow a computer
    with sufficient storage available to do anything.

    But we also want that stuff to be fast (under some metric).

    However, the term "general-purpose" can have other meanings. The
    sense in which I was using it was: a machine that is efficient at
    running programs written in all popular programming languages for
    the kinds of applications for which computers are normally used.
    Thus, the IBM System/360 was general-purpose since it had string manipulation instructions and decimal arithmetic in addition to floating-point and binary integer arithmetic.

    I think that history shows that those string facilities were, at best, overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.


    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have been put
    into S/360.

    2. Putting them into S/360 was the right decision at the time, but the changing technology (i.e. they wouldn't fit into the initial CISC microprocessors and RISC showed the functionality could be done other
    ways) made putting them into newer designs a bad idea.

    3. Putting them into S/360 was the right decision at the time but the workloads changed. There was less requirement for things like actually
    printing checks and general ledger reports, and programmers moved away
    from COBOL, which was where EDIT was a natural fit, to languages where
    there wasn't as natural fit, so not putting EDIT into newer CPUs was
    the right decision.

    I suspect it was mostly number 3, but I think number 2 was a part of it.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Jul 17 16:50:27 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Wed Jul 17 17:09:44 2024
    "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
    MitchAlsup1 wrote:

    On Tue, 16 Jul 2024 16:00:59 +0000, John Savard wrote:

    On Mon, 15 Jul 2024 01:59:58 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    It only has to be Turing-equivalent (or Lisp-equivalent) to be a
    “general-
    purpose” machine. There’s nothing in that criterion that requires
    gotos.

    It's true that Turing-equivalence is sufficient to allow a computer
    with sufficient storage available to do anything.

    But we also want that stuff to be fast (under some metric).

    However, the term "general-purpose" can have other meanings. The
    sense in which I was using it was: a machine that is efficient at
    running programs written in all popular programming languages for
    the kinds of applications for which computers are normally used.
    Thus, the IBM System/360 was general-purpose since it had string
    manipulation instructions and decimal arithmetic in addition to
    floating-point and binary integer arithmetic.

    I think that history shows that those string facilities were, at best,
    overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.


    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have been put >into S/360.

    2. Putting them into S/360 was the right decision at the time, but the >changing technology (i.e. they wouldn't fit into the initial CISC >microprocessors and RISC showed the functionality could be done other
    ways) made putting them into newer designs a bad idea.

    3. Putting them into S/360 was the right decision at the time but the >workloads changed. There was less requirement for things like actually >printing checks and general ledger reports, and programmers moved away
    from COBOL, which was where EDIT was a natural fit, to languages where
    there wasn't as natural fit, so not putting EDIT into newer CPUs was
    the right decision.

    I suspect it was mostly number 3, but I think number 2 was a part of it.

    I also suspect it was (3). Burroughs medium systems (B3500 etc) used
    the edit instruction (EDT) extensively for COBOL.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Thomas Koenig on Wed Jul 17 17:01:20 2024
    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    s/done/suggested/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jul 17 17:41:49 2024
    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Wed Jul 17 20:17:45 2024
    On Wed, 17 Jul 2024 16:50:27 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    So, where is a story?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 17 17:45:39 2024
    On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

    "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
    MitchAlsup1 wrote:


    Especially the COBOL stuff like EDIT and EDIT-and-MARK.


    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have been put >>into S/360.

    2. Putting them into S/360 was the right decision at the time, but the >>changing technology (i.e. they wouldn't fit into the initial CISC >>microprocessors and RISC showed the functionality could be done other
    ways) made putting them into newer designs a bad idea.

    3. Putting them into S/360 was the right decision at the time but the >>workloads changed. There was less requirement for things like actually >>printing checks and general ledger reports, and programmers moved away
    from COBOL, which was where EDIT was a natural fit, to languages where >>there wasn't as natural fit, so not putting EDIT into newer CPUs was
    the right decision.

    I suspect it was mostly number 3, but I think number 2 was a part of it.

    I also suspect it was (3). Burroughs medium systems (B3500 etc) used
    the edit instruction (EDT) extensively for COBOL.

    I suspect that as compute power grew, the kinds of things we wanted out
    of EDIT changed.

    negative numbers printed in red
    negative numbers surrounded with parens (-123.45)
    decimal point and comma LOCAL selection
    NaN
    INFINITY
    ..

    making a fixed EDIT instruction fade into insignificance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Wed Jul 17 17:57:34 2024
    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have been put
    into S/360.

    Probably not, given the environment at the time.

    2. Putting them into S/360 was the right decision at the time, but the changing technology (i.e. they wouldn't fit into the initial CISC microprocessors and RISC showed the functionality could be done other
    ways) made putting them into newer designs a bad idea.

    3. Putting them into S/360 was the right decision at the time but the workloads changed. There was less requirement for things like actually printing checks and general ledger reports, and programmers moved away
    from COBOL, which was where EDIT was a natural fit, to languages where
    there wasn't as natural fit, so not putting EDIT into newer CPUs was
    the right decision.

    I think it mostly has to do with the decimal machines that IBM sold
    previously, and that the /360 should do everything that the old
    machines could do. (They messed up with floating point, which
    was much worse than for the 704ff, as previously discussed).

    Plus, microcode was cheapter these days than the hideously expensive
    main storage (less work for reading, because the cards for microcode
    didn't have a destructive read, and no magnetic cores with wiring).

    If you wanted to read a punched card into a decimal number which
    is encoded on a punched card and read in as EBCDIC, or print out
    a decimal number, you didn't want to spend many instructions in
    precious core memory on that task.

    Tasks changed, memory became cheaper and bigger, and decimal
    arithmetic was not so important any more.

    I suspect it was mostly number 3, but I think number 2 was a part of it.

    So, basically I agree.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed Jul 17 18:30:47 2024
    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took months, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and the
    EXP instruction can be done in parallel?



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to mitchalsup@aol.com on Wed Jul 17 18:28:52 2024
    It appears that MitchAlsup1 <mitchalsup@aol.com> said:
    applications for which computers are normally used. Thus, the IBM
    System/360 was general-purpose since it had string manipulation
    instructions and decimal arithmetic in addition to floating-point and
    binary integer arithmetic.

    I think that history shows that those string facilities were, at best, >overkill. None of the RISC machines had any of that and were faster
    and at least as general purpose as 360.

    Especially the COBOL stuff like EDIT and EDIT-and-MARK.

    They made sense at the time when memories were small and CPUs were slow. Decimal arithmetic and pack/unpack and EDMK do exactly what every RPG
    and Cobol program did, simple arithmetic on columns of numbers, and print
    them in pretty formats. Sure, you can do all that with libraries, but
    on a 4K or 8K machine, every byte was precious, and putting the libraries
    in microcode made sense.

    These days z still has them all, but I'm sure they're all in millicode, subroutines in the hardware subset of the instruction set that implement
    the rest of it.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 17 18:42:44 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

    "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
    MitchAlsup1 wrote:


    Especially the COBOL stuff like EDIT and EDIT-and-MARK.


    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have been put >>>into S/360.

    2. Putting them into S/360 was the right decision at the time, but the >>>changing technology (i.e. they wouldn't fit into the initial CISC >>>microprocessors and RISC showed the functionality could be done other >>>ways) made putting them into newer designs a bad idea.

    3. Putting them into S/360 was the right decision at the time but the >>>workloads changed. There was less requirement for things like actually >>>printing checks and general ledger reports, and programmers moved away >>>from COBOL, which was where EDIT was a natural fit, to languages where >>>there wasn't as natural fit, so not putting EDIT into newer CPUs was
    the right decision.

    I suspect it was mostly number 3, but I think number 2 was a part of it.

    I also suspect it was (3). Burroughs medium systems (B3500 etc) used
    the edit instruction (EDT) extensively for COBOL.

    I suspect that as compute power grew, the kinds of things we wanted out
    of EDIT changed.

    negative numbers printed in red
    negative numbers surrounded with parens (-123.45)
    decimal point and comma LOCAL selection
    NaN
    INFINITY

    EDIT/EDT instructions were designed from the start for
    COBOL fixed point arithmetic. No Nan, No Infinities.

    COBOL had verbs to assign the correct characters as
    the decimal point and thousands separator (on the
    Burroughs B3500 those characters were stored in reserved
    low memory which was accessed the EDT instruction when
    processing the edit rule(s)).

    Printing in various colors or print trains were not the
    responsibility of the COBOL program, but rather the operator
    who mounted the print chain/train and/or ribbon for
    the print job.

    For certain applications, an edt instruction would still
    be useful, but probably wouldn't be significantly more
    time efficient than a set of risc-ish instructions (it
    would likely be more space efficient).

    The problem is that the modern languages don't have
    a modern concept of a PICture clause. An instruction
    to format based on a printf-like format string could
    be useful, I suspect, but the die area is probably more
    useful as cache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Jul 17 19:19:26 2024
    According to Scott Lurndal <slp53@pacbell.net>:
    I suspect that as compute power grew, the kinds of things we wanted out
    of EDIT changed.

    negative numbers printed in red
    negative numbers surrounded with parens (-123.45)
    decimal point and comma LOCAL selection
    NaN
    INFINITY

    EDIT/EDT instructions were designed from the start for
    COBOL fixed point arithmetic. No Nan, No Infinities.

    Yes, but don't forget RPG. It was never fashionable but
    there sure was a lot of it. Still is on whatever they
    call AS/400 aka System i these days.

    COBOL had verbs to assign the correct characters as
    the decimal point and thousands separator

    DECIMAL POINT IS COMMA.

    The problem is that the modern languages don't have
    a modern concept of a PICture clause. An instruction
    to format based on a printf-like format string could
    be useful, I suspect, but the die area is probably more
    useful as cache.

    More important, libraries are basically free now. Every program
    links to megabytes of shard libraries and nobody cares.




    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 19:22:52 2024
    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took months, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and the
    EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are
    serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Wed Jul 17 20:56:06 2024
    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10×
    more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations
    which included fluid flow (including turbulence modelling and diffusion) and quite a few chemical reactions together. So, he evaluated a huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors
    (aka small steps). His calculaiton spent most of the time
    evaluating the Arrhenius equation above many, many, many, many
    times.

    A single calculation took months, and he didn't use weak
    hardware.

    A fully pipelined evaluation of, let's say, four parallel exp
    and four parallel fdiv instructions would have reduced his
    calculation time by orders of magnitude, and allowed him to
    explore the design space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the Arrhenius equation into your ISA, I would have done so already
    :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and
    the EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.


    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Scott Lurndal on Wed Jul 17 21:14:43 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 17 Jul 2024 17:09:44 +0000, Scott Lurndal wrote:

    "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> writes:
    MitchAlsup1 wrote:


    Especially the COBOL stuff like EDIT and EDIT-and-MARK.


    Regarding EDIT, etc., I think there are three possibilities:

    1. They were a bad idea from the start and never should have
    been put into S/360.

    2. Putting them into S/360 was the right decision at the time,
    but the changing technology (i.e. they wouldn't fit into the
    initial CISC microprocessors and RISC showed the functionality
    could be done other ways) made putting them into newer designs
    a bad idea.

    3. Putting them into S/360 was the right decision at the time
    but the workloads changed. There was less requirement for
    things like actually printing checks and general ledger
    reports, and programmers moved away from COBOL, which was where
    EDIT was a natural fit, to languages where there wasn't as
    natural fit, so not putting EDIT into newer CPUs was the right decision.

    I suspect it was mostly number 3, but I think number 2 was a
    part of it.

    I also suspect it was (3). Burroughs medium systems (B3500 etc)
    used >> the edit instruction (EDT) extensively for COBOL.

    I suspect that as compute power grew, the kinds of things we wanted
    out of EDIT changed.

    negative numbers printed in red
    negative numbers surrounded with parens (-123.45)
    decimal point and comma LOCAL selection
    NaN
    INFINITY

    EDIT/EDT instructions were designed from the start for
    COBOL fixed point arithmetic. No Nan, No Infinities.

    COBOL had verbs to assign the correct characters as
    the decimal point and thousands separator (on the
    Burroughs B3500 those characters were stored in reserved
    low memory which was accessed the EDT instruction when
    processing the edit rule(s)).

    Printing in various colors or print trains were not the
    responsibility of the COBOL program, but rather the operator
    who mounted the print chain/train and/or ribbon for
    the print job.

    For certain applications, an edt instruction would still
    be useful, but probably wouldn't be significantly more
    time efficient than a set of risc-ish instructions (it
    would likely be more space efficient).

    The problem is that the modern languages don't have
    a modern concept of a PICture clause. An instruction
    to format based on a printf-like format string could
    be useful, I suspect, but the die area is probably more
    useful as cache.

    There are two separate issues here. One is whether modern languages
    should have syntax for esily formatting numbers that is "more
    expressive" than what things like printf currently provide. The other
    question is about whether such should be implemented in hardware.

    As I have expressed before here, I believe the answer to the first part
    is "yes", but not as feature rich as a full picture clause. I tend to
    agree with you about the second.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 23:19:18 2024
    On Wed, 17 Jul 2024 21:14:43 +0000, Stephen Fuld wrote:

    Scott Lurndal wrote:

    The problem is that the modern languages don't have
    a modern concept of a PICture clause. An instruction
    to format based on a printf-like format string could
    be useful, I suspect, but the die area is probably more
    useful as cache.

    There are two separate issues here. One is whether modern languages
    should have syntax for esily formatting numbers that is "more
    expressive" than what things like printf currently provide. The other question is about whether such should be implemented in hardware.

    As I have expressed before here, I believe the answer to the first part
    is "yes", but not as feature rich as a full picture clause. I tend to
    agree with you about the second.

    The fact that output formatting has advanced so much during my lifetime contraindicates putting what we currently understand in HW.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jul 17 23:17:37 2024
    On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10×
    more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations
    which included fluid flow (including turbulence modelling and
    diffusion) and quite a few chemical reactions together. So, he
    evaluated a huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors
    (aka small steps). His calculaiton spent most of the time
    evaluating the Arrhenius equation above many, many, many, many
    times.

    A single calculation took months, and he didn't use weak
    hardware.

    A fully pipelined evaluation of, let's say, four parallel exp
    and four parallel fdiv instructions would have reduced his
    calculation time by orders of magnitude, and allowed him to
    explore the design space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already
    :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and
    the EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are
    serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.


    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    The FDIV and EXP instructions consume all the FMAC cycles, so even if
    you completely unrolled the loop, you are not going to get more than
    6-cycles less in performing the repeated calculations.

    A really BIG implementation with 4 FMAC units per core could unroll
    the loop (by reservation stations) such that each iteration would
    still be 44-cycles, but you could run 4 in parallel and achieve 4
    results every 44-cycles--which to most people smells like 1 result
    every 11-cycles.

    {Would be an interesting reservation station design, though}




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jul 17 20:48:18 2024
    MitchAlsup1 wrote:
    On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10×
    more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations
    which included fluid flow (including turbulence modelling and
    diffusion) and quite a few chemical reactions together. So, he
    evaluated a huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors
    (aka small steps). His calculaiton spent most of the time
    evaluating the Arrhenius equation above many, many, many, many
    times.

    A single calculation took months, and he didn't use weak
    hardware.

    A fully pipelined evaluation of, let's say, four parallel exp
    and four parallel fdiv instructions would have reduced his
    calculation time by orders of magnitude, and allowed him to
    explore the design space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already
    :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and
    the EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are
    serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.


    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    The FDIV and EXP instructions consume all the FMAC cycles, so even if
    you completely unrolled the loop, you are not going to get more than
    6-cycles less in performing the repeated calculations.

    A really BIG implementation with 4 FMAC units per core could unroll
    the loop (by reservation stations) such that each iteration would
    still be 44-cycles, but you could run 4 in parallel and achieve 4
    results every 44-cycles--which to most people smells like 1 result
    every 11-cycles.

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jul 18 06:00:46 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took months, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and the
    EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    Ah, OK.


    They are all performed in the FMAC unit and here the instructions are serially dependent.

    A loop containing the calculation could be unrolled, but without
    a large effect.


    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.

    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 07:39:35 2024
    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
    There are two separate issues here. One is whether modern languages
    should have syntax for esily formatting numbers that is "more
    expressive" than what things like printf currently provide.

    Take a look at what Fortran can do, it does quite a few things
    (and is correspondingly difficult to implement :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 07:54:23 2024
    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

    [Arrhenius]

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    I ran a little Arrhenius benchmark on an i7-11700. Main program was

    program main
    implicit none
    integer, parameter :: n = 1024
    double precision, dimension(n) :: k, a, ea, t
    integer :: i
    call random_number (a)
    call random_number(ea)
    ea = 10000+ea*30000
    call random_number(t)
    t = 400 + 200*t
    do i=1,1024*1024
    call arrhenius(k,a,ea,t,n)
    end do
    end program main

    and the called routine was (in a separate file, so the compiler
    could not notice that the results were actually never used)

    subroutine arrhenius(k, a, ea, t, n)
    implicit none
    integer, intent(in) :: n
    double precision, dimension(n), intent(out) :: k
    double precision, dimension(n), intent(in) :: a, ea, t
    double precision, parameter :: r = 8.314
    k = a * exp(-ea/(r*t))
    end subroutine arrhenius

    Timing result (wall-clock time only):

    -O0: 5.343s
    -O2: 4.560s
    -Ofast: 2.237s
    -Ofast -march=native -mtune=native: 2.154

    Of course, you kever know what speed your CPU is actually running
    at these days, but if I assume 5GHz, that would give around 10
    cycles per Arrhenius evaluation, which is quite fast (IMHO).
    It uses an AVX2 version of exp, or so I gather from the function
    name, _ZGVdN4v_exp_avx2 .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Jul 18 08:10:45 2024
    MitchAlsup1 wrote:
    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took months, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and the
    EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.

    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
    the multiplier, can the pipeline get a mix of multiple operations going
    at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
    so each can only use one of the 4 pipeline slots.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to ldo@nz.invalid on Thu Jul 18 06:45:29 2024
    On Wed, 17 Jul 2024 03:51:50 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    I can remember a Comp Sci professor waxing lyrical about p-adic number >systems, as an alternative to conventional p-ary ones. The key thing was
    that division was no more complex than multiplication.

    And addition, I suppose, is the same complexity too. But comparison
    kills you; you need to apply the Chinese Remainder Theorem every time
    you want to print out a number.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Thu Jul 18 13:35:33 2024
    Thomas Koenig wrote:

    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
    There are two separate issues here. One is whether modern languages
    should have syntax for esily formatting numbers that is "more
    expressive" than what things like printf currently provide.

    Take a look at what Fortran can do, it does quite a few things
    (and is correspondingly difficult to implement :-)


    I used to write a fair amount of Fortran, but haven't recently. While
    it does have a lot of functionality (most of which I never had occasion
    to use). But it at least didn't have support for my pet peeve, the
    separator character (whatever one is used) every three digits. This
    really improves readability for numbers with lots of digits. Does it
    now support this?




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 16:16:56 2024
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point
    ops in parallel. The same applies to SSE2 implementations on few
    post-Merom "small" cores, both from Intel and from AMD.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Jul 18 13:58:37 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point

    All of which have been obsolete for almost two decades.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Jul 18 16:40:49 2024
    On Wed, 17 Jul 2024 23:17:37 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    A really BIG implementation with 4 FMAC units per core could unroll
    the loop (by reservation stations) such that each iteration would
    still be 44-cycles, but you could run 4 in parallel and achieve 4
    results every 44-cycles--which to most people smells like 1 result
    every 11-cycles.

    {Would be an interesting reservation station design, though}




    Your "really BIG" implementation sound not so big by modern standards.
    I think, all x86 cores sold in last 5 years except those that are called "small" or "economical" have at least 8 FMAC units. The last
    mainstream x86 core with 4 FMAC was AMD Zen1.
    On the ARM side "mainstream" Cortex-A cores still feature 4 FMAC units,
    but "premium" Cortex-X cores have more. So do "performance" cores from
    Apple and Qualcomm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Thu Jul 18 17:52:44 2024
    On Thu, 18 Jul 2024 13:58:37 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point


    All of which have been obsolete for almost two decades.

    According to my understanding, Thomas Koenig was/is around for more
    than three decades.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to SFuld@alumni.cmu.edu.invalid on Thu Jul 18 09:18:14 2024
    On Wed, 17 Jul 2024 20:56:06 -0000 (UTC), "Stephen Fuld" <SFuld@alumni.cmu.edu.invalid> wrote:

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    in a way, that's the wrong question.

    If you need to have exponentiations and divides running in parallel
    _on a single CPU_ because the problem isn't parallel _enough_ to
    dispense with them being very tightly coupled... then, yes, that's a
    reasonable question.

    But _otherwise_ there is no limit on the number of calculations that
    can take place in parallel, just throw more CPUs at the problem.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 15:35:50 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 13:58:37 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point


    All of which have been obsolete for almost two decades.

    According to my understanding, Thomas Koenig was/is around for more
    than three decades.

    I have been (starting doing things on mainframes in the late 1980s),
    but I was specifically asking about what Mitch had in mind for
    My 66000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 18 15:43:07 2024
    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
    Thomas Koenig wrote:

    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:
    There are two separate issues here. One is whether modern languages
    should have syntax for esily formatting numbers that is "more
    expressive" than what things like printf currently provide.

    Take a look at what Fortran can do, it does quite a few things
    (and is correspondingly difficult to implement :-)


    I used to write a fair amount of Fortran, but haven't recently. While
    it does have a lot of functionality (most of which I never had occasion
    to use). But it at least didn't have support for my pet peeve, the
    separator character (whatever one is used) every three digits. This
    really improves readability for numbers with lots of digits. Does it
    now support this?

    It does not, you would have to roll your own.





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Thu Jul 18 15:47:29 2024
    John Savard <quadibloc@servername.invalid> schrieb:
    On Wed, 17 Jul 2024 20:56:06 -0000 (UTC), "Stephen Fuld"
    <SFuld@alumni.cmu.edu.invalid> wrote:

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    in a way, that's the wrong question.

    If you need to have exponentiations and divides running in parallel
    _on a single CPU_ because the problem isn't parallel _enough_ to
    dispense with them being very tightly coupled... then, yes, that's a reasonable question.

    But _otherwise_ there is no limit on the number of calculations that
    can take place in parallel, just throw more CPUs at the problem.

    That is often, if not usually not the case; single core performance
    is still important. There are several reasons:

    People usually don't have unlimited number of CPUs at their
    disposal, the efficiency of parallel computing goes down with
    increasing number of cores (Amdahl's law). Finally, if you use
    commercial software, you often only have a certain number of
    licenses for prallel computation. That varies a lot by vendor,
    though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Jul 18 12:33:13 2024
    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
    the multiplier, can the pipeline get a mix of multiple operations going
    at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
    so each can only use one of the 4 pipeline slots.

    Something I've been wondering for a while, indeed.
    IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to keep the FMAC fully busy (my naive understanding is that there isn't)?
    If not, do current CPU make the FMAC available for other operations while
    an FDIV/SQRT is in progress? If not, how hard would it be?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 19:38:03 2024
    On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

    [Arrhenius]

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    I ran a little Arrhenius benchmark on an i7-11700. Main program was

    program main
    implicit none
    integer, parameter :: n = 1024
    double precision, dimension(n) :: k, a, ea, t
    integer :: i
    call random_number (a)
    call random_number(ea)
    ea = 10000+ea*30000
    call random_number(t)
    t = 400 + 200*t
    do i=1,1024*1024
    call arrhenius(k,a,ea,t,n)
    end do
    end program main

    and the called routine was (in a separate file, so the compiler
    could not notice that the results were actually never used)

    subroutine arrhenius(k, a, ea, t, n)
    implicit none
    integer, intent(in) :: n
    double precision, dimension(n), intent(out) :: k
    double precision, dimension(n), intent(in) :: a, ea, t
    double precision, parameter :: r = 8.314
    k = a * exp(-ea/(r*t))
    end subroutine arrhenius

    Timing result (wall-clock time only):

    -O0: 5.343s
    -O2: 4.560s
    -Ofast: 2.237s
    -Ofast -march=native -mtune=native: 2.154

    Of course, you kever know what speed your CPU is actually running
    at these days, but if I assume 5GHz, that would give around 10
    cycles per Arrhenius evaluation, which is quite fast (IMHO).
    It uses an AVX2 version of exp, or so I gather from the function
    name, _ZGVdN4v_exp_avx2 .

    Does the benchmark represent a real-world use?
    In particular,
    1. Is there really a large number of different EA values or only around
    dozen or hundred?
    2. Does temperature vary all the time or there are relatively big
    groups of points calculated at the same temperature?
    A similar question can be asked about A, but it is of little practical importance.
    3. Is double precision really needed? According to my understanding,
    empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single
    precision is sufficient with digits to spare.
    4. Dies the equation work at all when the temperature is not close to
    the point of equilibrium ? If not, what is a sane range for ea/(r*t) ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 16:40:11 2024
    On Thu, 18 Jul 2024 12:10:45 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:

    MitchAlsup1 wrote:

    On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more. >>>>>
    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took months, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)

    FMUL Rt,RR,RT
    FDIV Rt,-RE,Rt
    EXP Rt,Rt
    FMUL Rk,RA,Rt

    Does not look "all that bad" to me.

    So for your GbOoO CPU, how many of the various FP operations, and the
    EXP instruction can be done in parallel?

    FMUL is 4 cycles of latency fully pipelined
    FDIV is ~20 cycles of latency not pipelined
    EXP is ~16 cycles of latency not pipelined

    They are all performed in the FMAC unit and here the instructions are
    serially dependent.

    So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
    see the same latency; that is, GBOoO is not a differentiator.

    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
    the multiplier, can the pipeline get a mix of multiple operations going
    at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
    so each can only use one of the 4 pipeline slots.

    Over the 20 cycles the multiplier is doing Goldschmidt iterations, there
    are only 3 slots where a different instruction could sneak through.

    Note: the multiplier used in Goldschmidt iterations is used every cycle
    first for the denominator being driven towards 1.0, the second driving
    the numerator towards quotient.

    That is, its a 4 cycle pipeline unit from the outside, but a 2 cycle
    pipeline unit from within the function unit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 18 16:43:26 2024
    On Thu, 18 Jul 2024 12:45:29 +0000, John Savard wrote:

    On Wed, 17 Jul 2024 03:51:50 -0000 (UTC), Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    I can remember a Comp Sci professor waxing lyrical about p-adic number >>systems, as an alternative to conventional p-ary ones. The key thing was >>that division was no more complex than multiplication.

    And addition, I suppose, is the same complexity too. But comparison
    kills you; you need to apply the Chinese Remainder Theorem every time
    you want to print out a number.

    You can make FADD/FSUB/FCMP easy (Standard IEEE organization)
    OR
    You can make FMUL/FDIV easy (logarithmic organization)

    In both cases performing the "other" calculation is rather
    equally hard. FADD/FSUB/FCMP <-> FMULL/FDIV

    I suggest that logarithmic organization makes FMAC very hard.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 16:51:29 2024
    On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

    MitchAlsup1 wrote:

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    The instruction stream consists of 4 FMAC-bound instructions unrolled
    as many times as will fit in register file.

    You typical reservation station can accept 1 new instruction per cycle
    from the decoder. So, either the decoder has to spew the instructions
    across the stations (and remember they are data dependent) or the
    station has to fire more than one per cycle to the FMAC units.

    So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
    and maybe some kind of exotic routing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Jul 18 16:47:35 2024
    On Thu, 18 Jul 2024 15:35:50 +0000, Thomas Koenig wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 13:58:37 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point


    All of which have been obsolete for almost two decades.

    According to my understanding, Thomas Koenig was/is around for more
    than three decades.

    I have been (starting doing things on mainframes in the late 1980s),
    but I was specifically asking about what Mitch had in mind for
    My 66000.

    I have a spectrum in mind; everything from an in order 1-wide to (as of
    now) 6-wide GBOoO design with 4 FMAC units.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stefan Monnier on Thu Jul 18 12:56:46 2024
    Stefan Monnier wrote:
    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
    the multiplier, can the pipeline get a mix of multiple operations going
    at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >> so each can only use one of the 4 pipeline slots.

    Something I've been wondering for a while, indeed.
    IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to keep the FMAC fully busy (my naive understanding is that there isn't)?
    If not, do current CPU make the FMAC available for other operations while
    an FDIV/SQRT is in progress? If not, how hard would it be?


    Stefan

    And if they can't mix then to what extent can the end of one op,
    as it drains from the pipeline, overlap with the start of the next?
    Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
    with the end of a prior FDIV? An EXP?

    I was thinking about reservation station schedulers and wondering
    what they might have to optimize.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 17:06:52 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

    [Arrhenius]

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    I ran a little Arrhenius benchmark on an i7-11700. Main program was

    program main
    implicit none
    integer, parameter :: n = 1024
    double precision, dimension(n) :: k, a, ea, t
    integer :: i
    call random_number (a)
    call random_number(ea)
    ea = 10000+ea*30000
    call random_number(t)
    t = 400 + 200*t
    do i=1,1024*1024
    call arrhenius(k,a,ea,t,n)
    end do
    end program main

    and the called routine was (in a separate file, so the compiler
    could not notice that the results were actually never used)

    subroutine arrhenius(k, a, ea, t, n)
    implicit none
    integer, intent(in) :: n
    double precision, dimension(n), intent(out) :: k
    double precision, dimension(n), intent(in) :: a, ea, t
    double precision, parameter :: r = 8.314
    k = a * exp(-ea/(r*t))
    end subroutine arrhenius

    Timing result (wall-clock time only):

    -O0: 5.343s
    -O2: 4.560s
    -Ofast: 2.237s
    -Ofast -march=native -mtune=native: 2.154

    Of course, you kever know what speed your CPU is actually running
    at these days, but if I assume 5GHz, that would give around 10
    cycles per Arrhenius evaluation, which is quite fast (IMHO).
    It uses an AVX2 version of exp, or so I gather from the function
    name, _ZGVdN4v_exp_avx2 .

    Does the benchmark represent a real-world use?
    In particular,
    1. Is there really a large number of different EA values or only around
    dozen or hundred?

    Usually, one for each chemical reaction. If you have a complex
    reaction network, it can be a few hundred.

    It is possible to pre-calculate Ea/R, but the division by T still
    needs to be done.

    2. Does temperature vary all the time or there are relatively big
    groups of points calculated at the same temperature?

    It varies all the time because of the exothermic/endothermic
    character of the reactions. This is not what was calculated, but you
    can think of Methane combusion with oxygen. There are numerous
    radical, high-energy species appearing and disappearing all the time.
    And if you're unlucky, you will also have to calculate radiation :-(

    But for each fluid element, the energy equation is solved again
    each time step.

    And if you try to average the temperatueres over groups of cells...
    don't. You average enough already by selecting the grid, and also by
    applying multigrid solvers.

    A similar question can be asked about A, but it is of little practical importance.
    3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single
    precision is sufficient with digits to spare.

    To calculate values, yes, but if you try to differentiate things,
    things can go wrong pretty fast. I'm actually not sure what he used
    in this particular case.

    There is, however, a tendency with engineers to use double precision
    because quite often, you do not go wrong with that, and you can
    go wrong with single precision.

    A lot of numerical work in general, and CFD work in particular,
    consists of fighting for convergence. You usually don't want to
    change anything that would endanger that.

    4. Dies the equation work at all when the temperature is not close to
    the point of equilibrium ? If not, what is a sane range for ea/(r*t) ?

    It does work far away from the equilibrium (which is the point).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 20:28:43 2024
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 07:54:23 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Stephen Fuld <SFuld@alumni.cmu.edu.invalid> schrieb:

    [Arrhenius]

    Good, I get that. But Thomas' original discussion of the problem
    indicated that it was very parallel, so the question is, in your
    design, how many of those calculations can go in in parallel?

    I ran a little Arrhenius benchmark on an i7-11700. Main program
    was

    program main
    implicit none
    integer, parameter :: n = 1024
    double precision, dimension(n) :: k, a, ea, t
    integer :: i
    call random_number (a)
    call random_number(ea)
    ea = 10000+ea*30000
    call random_number(t)
    t = 400 + 200*t
    do i=1,1024*1024
    call arrhenius(k,a,ea,t,n)
    end do
    end program main

    and the called routine was (in a separate file, so the compiler
    could not notice that the results were actually never used)

    subroutine arrhenius(k, a, ea, t, n)
    implicit none
    integer, intent(in) :: n
    double precision, dimension(n), intent(out) :: k
    double precision, dimension(n), intent(in) :: a, ea, t
    double precision, parameter :: r = 8.314
    k = a * exp(-ea/(r*t))
    end subroutine arrhenius

    Timing result (wall-clock time only):

    -O0: 5.343s
    -O2: 4.560s
    -Ofast: 2.237s
    -Ofast -march=native -mtune=native: 2.154

    Of course, you kever know what speed your CPU is actually running
    at these days, but if I assume 5GHz, that would give around 10
    cycles per Arrhenius evaluation, which is quite fast (IMHO).
    It uses an AVX2 version of exp, or so I gather from the function
    name, _ZGVdN4v_exp_avx2 .

    Does the benchmark represent a real-world use?
    In particular,
    1. Is there really a large number of different EA values or only
    around dozen or hundred?

    Usually, one for each chemical reaction. If you have a complex
    reaction network, it can be a few hundred.

    It is possible to pre-calculate Ea/R, but the division by T still
    needs to be done.


    The idea I was thinking about was to pre-calculate R/Ea rather than
    Ea/R. And then to find fast algorithm for approximation of exp(-1/x) on
    the range of interest.
    This sort of tricks no longer make sense on post-Skylake Intel
    or on post-Zen2 AMD, or on few later generations of Apple. But just 5-6
    years ago it made a good sense on pretty much anything.

    2. Does temperature vary all the time or there are relatively big
    groups of points calculated at the same temperature?

    It varies all the time because of the exothermic/endothermic
    character of the reactions. This is not what was calculated, but you
    can think of Methane combusion with oxygen. There are numerous
    radical, high-energy species appearing and disappearing all the time.
    And if you're unlucky, you will also have to calculate radiation :-(

    But for each fluid element, the energy equation is solved again
    each time step.

    And if you try to average the temperatueres over groups of cells...
    don't. You average enough already by selecting the grid, and also by applying multigrid solvers.

    A similar question can be asked about A, but it is of little
    practical importance.
    3. Is double precision really needed? According to my understanding, empirical equations like this one have precision of something like 2 significant digits, or 3 digits at best. So I'd expect that single precision is sufficient with digits to spare.

    To calculate values, yes, but if you try to differentiate things,
    things can go wrong pretty fast. I'm actually not sure what he used
    in this particular case.

    There is, however, a tendency with engineers to use double precision
    because quite often, you do not go wrong with that, and you can
    go wrong with single precision.

    A lot of numerical work in general, and CFD work in particular,
    consists of fighting for convergence. You usually don't want to
    change anything that would endanger that.

    4. Dies the equation work at all when the temperature is not close
    to the point of equilibrium ? If not, what is a sane range for
    ea/(r*t) ?

    It does work far away from the equilibrium (which is the point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Thu Jul 18 18:48:45 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean that
    a reaction is is slowed down by increasing temperature (and by this
    don't mean the reverse reaction). If you see that in data, that
    means that the Arrhenius model is not applicable because there is
    in fact no activation energy to overcome, that something strange
    happens in thermodynamics (say, you get solid precipitation of
    one component at a certain temperature) or that some there are
    some intermediate steps which are not described in the model.

    And to make sense, -Ea/(RT) is always less then zero. Reasonable
    example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
    argument of the exponent, so the term is then ~ 0.002444 .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 22:48:53 2024
    On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean that
    a reaction is is slowed down by increasing temperature (and by this
    don't mean the reverse reaction).

    In the branch of physical chemistry that I was taught before switching
    to completely different profession, that's not that rare.
    For example, when temperature grows from 1500 degrees toward 1550
    the transfer of phosphorus from molten steel to slag decelerates
    greatly.

    If you see that in data, that
    means that the Arrhenius model is not applicable because there is
    in fact no activation energy to overcome, that something strange
    happens in thermodynamics (say, you get solid precipitation of
    one component at a certain temperature) or that some there are
    some intermediate steps which are not described in the model.

    And to make sense, -Ea/(RT) is always less then zero. Reasonable
    example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
    argument of the exponent, so the term is then ~ 0.002444 .

    So, something like [-10:0] ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jul 18 19:54:05 2024
    On Thu, 18 Jul 2024 16:56:46 +0000, EricP wrote:

    Stefan Monnier wrote:
    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
    the multiplier, can the pipeline get a mix of multiple operations going
    at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially >>> so each can only use one of the 4 pipeline slots.

    Something I've been wondering for a while, indeed.
    IOW, is there enough parallelism inside the FDIV/SQRT "microcode" to
    keep the FMAC fully busy (my naive understanding is that there isn't)?
    If not, do current CPU make the FMAC available for other operations
    while
    an FDIV/SQRT is in progress? If not, how hard would it be?

    In an SRT FDIV unit: no absolutely not
    In a Goldschmidt FDIV unit 3 slots out of 20

    Neither of these uses microcode--sequencing: yes; microcode: no.


    Stefan

    And if they can't mix then to what extent can the end of one op,
    as it drains from the pipeline, overlap with the start of the next?

    By the pipeline depth of the function unit minus 1.

    For example: FDIV = 20 cycles unpipelined, FMUL 4-cycles pipelined;
    the multiplier tree is in cycle 2 of the 4 stages. Sequencer knows
    the last cycle FDIV of the multiplier (17) and enables its reservation
    station to spit out an FMUL on cycle 15 which arrives at FU on cycle
    17 after forwarding, so the FMUL takes cycle 18-19-20-21 and is done
    in its normal 4-cycles.

    Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
    with the end of a prior FDIV? An EXP?

    The "busy" time of a FU is most often:

    busy = latency - (pipeline_depth - 1);

    I was thinking about reservation station schedulers and wondering
    what they might have to optimize.

    The interesting nature is spitting out an instruction and then routing
    it to a non-busy FU of appropriate capability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jul 18 22:24:52 2024
    On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean that
    a reaction is is slowed down by increasing temperature (and by this
    don't mean the reverse reaction). If you see that in data, that
    means that the Arrhenius model is not applicable because there is
    in fact no activation energy to overcome, that something strange
    happens in thermodynamics (say, you get solid precipitation of
    one component at a certain temperature) or that some there are
    some intermediate steps which are not described in the model.

    And to make sense, -Ea/(RT) is always less then zero. Reasonable
    example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
    argument of the exponent, so the term is then ~ 0.002444 .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Jul 18 20:06:49 2024
    On Thu, 18 Jul 2024 17:28:43 +0000, Michael S wrote:

    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    The idea I was thinking about was to pre-calculate R/Ea rather than
    Ea/R. And then to find fast algorithm for approximation of exp(-1/x) on
    the range of interest.

    exp( -x ) = 1/exp( x )

    I guess, in the past, when -x was at most 1 cycle, and FDIV was 20
    cycles
    and exp() was 150 cycles, you did not worry about creating an elementary function to calculate 1/exp(x). With exp(x) now down in the 16 cycle
    range, it might be different as 1/exp(x) would likely also take 16
    cycles.

    My 66000 exp(x) instruction has sign control on the operands so one can calculate exp(-x) as:

    EXP Rd,-Rx // still 16 cycles

    exp(1/x) is pretty much confined to the serial latency of FDIV and EXP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Fri Jul 19 08:19:17 2024
    On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

    The problem is that the modern languages don't have a modern concept of
    a PICture clause.

    Perhaps one thing that killed them was they are not localizable?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Fri Jul 19 14:45:46 2024
    On Fri, 19 Jul 2024 11:28:17 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the
    point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean
    that a reaction is is slowed down by increasing temperature (and
    by this don't mean the reverse reaction).

    In the branch of physical chemistry that I was taught before
    switching to completely different profession, that's not that rare.
    For example, when temperature grows from 1500 degrees toward 1550
    the transfer of phosphorus from molten steel to slag decelerates
    greatly.

    Do you happen to recall if that process was limited by thermodynamic equilibrium, by mass transfer or indeed by a chemical reaction?

    I don't know, but would think that what causes it is an approuch to equilibrium. But at 1550 degrees equilibrium is still pretty far away
    (50-70 degrees away? I don't remember) and despite that a slowdown is
    already very significant.

    To me, that would sound like maybe one of the first two, but
    knowing nothing about this, I am now firmly speculating :-)

    And to make sense, -Ea/(RT) is always less then zero. Reasonable
    example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
    argument of the exponent, so the term is then ~ 0.002444 .

    So, something like [-10:0] ?

    Activation energies can also be 180 kJ/mol, but I don't really
    know the range that people might want to calculate these values.

    I would think that when reaction is slowed down by factor of more than
    10K then an exact rate is of lesser interest and the aithmetic with
    precision of one octave should be o.k. since precision of the model is
    worse than that anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Fri Jul 19 11:28:17 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean that
    a reaction is is slowed down by increasing temperature (and by this
    don't mean the reverse reaction).

    In the branch of physical chemistry that I was taught before switching
    to completely different profession, that's not that rare.
    For example, when temperature grows from 1500 degrees toward 1550
    the transfer of phosphorus from molten steel to slag decelerates
    greatly.

    Do you happen to recall if that process was limited by thermodynamic equilibrium, by mass transfer or indeed by a chemical reaction?
    To me, that would sound like maybe one of the first two, but
    knowing nothing about this, I am now firmly speculating :-)

    And to make sense, -Ea/(RT) is always less then zero. Reasonable
    example: Ea = 30 kJ/mol, R=8.314, T=600 K gives you -6.01 as the
    argument of the exponent, so the term is then ~ 0.002444 .

    So, something like [-10:0] ?

    Activation energies can also be 180 kJ/mol, but I don't really
    know the range that people might want to calculate these values.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Fri Jul 19 16:16:01 2024
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    Back when I first looked at Invsqrt(), I did so because an Computation
    Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
    combining the 1/x and the sqrt and doing three of them pipelind together
    (all the water molecules having three atoms), his weeklong simulation
    runs ran in half the time, on both PentiumPro and Alpha hardware.


    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a) term?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Jul 19 14:41:52 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    Back when I first looked at Invsqrt(), I did so because an Computation
    Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
    combining the 1/x and the sqrt and doing three of them pipelind together
    (all the water molecules having three atoms), his weeklong simulation
    runs ran in half the time, on both PentiumPro and Alpha hardware.

    Not bad.

    Using the square root approximation for 1/x has not yet been
    metiioned, but it is of course quite possible, especially since
    the benchmarks I posted, with -Ofast, do enable this option.



    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a) term?

    I don't think that formula is correct.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Fri Jul 19 15:56:32 2024
    MitchAlsup1 wrote:

    On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

    MitchAlsup1 wrote:

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    The instruction stream consists of 4 FMAC-bound instructions unrolled
    as many times as will fit in register file.

    You typical reservation station can accept 1 new instruction per cycle
    from the decoder. So, either the decoder has to spew the instructions
    across the stations (and remember they are data dependent) or the
    station has to fire more than one per cycle to the FMAC units.

    So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
    and maybe some kind of exotic routing.


    Please bear in mind that I am a big fan of VVM over SIMD instructions.
    But . . . is this a potential drawback of VVM versus SIMD? i.e. that
    as the number of execution units increases, the design of the
    reservation sations gets more complex for VVM, but not for SIMD. i.e.
    It sounds like a CPU with 8 FMAC units would be a more difficult design
    for VVM.

    Is that an issue?





    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jul 19 16:32:43 2024
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:

    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    What I am talking about is to improve their performance until a
    sin() takes about the same number of cycles of FDIV, not 10× more.

    Maybe time for a little story.

    Some unspecified time ago, a colleague did CFD calculations which
    included fluid flow (including turbulence modelling and diffusion)
    and quite a few chemical reactions together. So, he evaluated a
    huge number of Arrhenius equations,

    k = A * exp(-E_a/(R*T))

    and because some of the reactions he looked at were highly
    exothermic or endothermic, he needed tiny relaxation factors (aka
    small steps). His calculaiton spent most of the time evaluating
    the Arrhenius equation above many, many, many, many times.

    A single calculation took _months_, and he didn't use weak hardware.

    A fully pipelined evaluation of, let's say, four parallel exp and
    four parallel fdiv instructions would have reduced his calculation
    time by orders of magnitude, and allowed him to explore the design
    space instead of just scratching the surface.

    Back when I first looked at Invsqrt(), I did so because an Computation
    Fluid Chemistry researcher from Sweden asked for help speeding up his reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
    combining the 1/x and the sqrt and doing three of them pipelind together
    (all the water molecules having three atoms), his weeklong simulation
    runs ran in half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.


    (By the way, if I had found a reasonable way to incorporate the
    Arrhenius equation into your ISA, I would have done so already :-)


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a)
    term?

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Fri Jul 19 16:15:47 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

    The problem is that the modern languages don't have a modern concept of
    a PICture clause.

    Perhaps one thing that killed them was they are not localizable?

    They certainly were localizable, at least with respect to
    punctuation and currency symbol. And COBOL is far from 'killed'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jul 19 17:46:23 2024
    According to Scott Lurndal <slp53@pacbell.net>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Wed, 17 Jul 2024 18:42:44 GMT, Scott Lurndal wrote:

    The problem is that the modern languages don't have a modern concept of
    a PICture clause.

    Perhaps one thing that killed them was they are not localizable?

    They certainly were localizable, at least with respect to
    punctuation and currency symbol. And COBOL is far from 'killed'.

    COBOL now has CURRENCY SIGN which lets you set the currency symbol to
    edit into a picture, and DECIMAL POINT IS COMMA, to switch the two.
    That covers a lot of local currency conventions but not all. Some
    countries put the currency symbol before the amount, some after it,
    some between the units and the decimals. Some use two decimals, some
    three, many use none. Some show negative values with a minus sign,
    some with parens around the amount, some other ways.

    If you want to do this right, you need a library that can
    format all this per country, like this python library:

    https://babel.pocoo.org/en/latest/api/numbers.html#babel.numbers.format_currency

    It's certainly possible to do all that in COBOL, but it's painful.

    I think the problem with picture clauses is that they while they did
    the job, in practice they are mostly used in a few stylized ways, the ones
    that the 360's ED and EDMK can handle, and couldn't handle situations
    where the subfields are reordered.

    I admit printf can't reorder the subfields but many of its modern descendants like python formats can and often do.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Fri Jul 19 20:55:47 2024
    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an Computation
    Fluid Chemistry researcher from Sweden asked for help speeding up his
    reciprocal calculations (sqrt(1/(dx^2+dy^2+dz^2))), I found that by
    combining the 1/x and the sqrt and doing three of them pipelind together
    (all the water molecules having three atoms), his weeklong simulation
    runs ran in half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    Yeah, that was eye-opening to me as well, to the level where I consider
    the invsqrt() NR iteration as a mainstay, it can be useful for both sqrt
    and 1/x as well. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Jul 19 20:25:51 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    x86_64 has rcpss, accurate to 1/6144, with (looking at the
    instruction tables) 6 for newer architectures, with a throuhtput
    of 1/4. So, if your business depends on calculating many inaccurate
    square roots, fast, buy a POWER :-)

    Other architectures I have tried don't seem to have it.

    Does it make sense? Well, if you want to calculate lots of Arrhenius
    equations, you don't need full accuracy and (like in Mitch's case)
    exp has become as fast as division, then it could actually make a
    lot of sense. It is still possible to add Newton steps afterwards,
    which is what gcc does if you add -mrecip -ffast-math.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jul 19 20:34:22 2024
    On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a)
    term?

    I don't think that formula is correct.

    It is not::

    exp(x+y) = exp(x)×exp(y)

    so add/subtract can be simplified, but multiply and divide cannot
    (easily).
    Now, if we were dealing with Ln() those multiplies and divides can be simplified.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Jul 20 08:45:13 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 18 Jul 2024 15:35:50 +0000, Thomas Koenig wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 13:58:37 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 18 Jul 2024 06:00:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:



    What about SIMD width underlying the the VVM implementation?
    All SIMD implementations I know of allow performing floating point
    ops in paralell. Is it planned that My 66000 can also do that?
    (If not, that would be a big disadvantage for scientific/technical
    work).

    All pre-Merom implementation of SSE2, i.e. Intel Pentium 4, Intel
    Pentium-M and AMD K8, do not perform double precision floating point


    All of which have been obsolete for almost two decades.

    According to my understanding, Thomas Koenig was/is around for more
    than three decades.

    I have been (starting doing things on mainframes in the late 1980s),
    but I was specifically asking about what Mitch had in mind for
    My 66000.

    I have a spectrum in mind; everything from an in order 1-wide to (as of
    now) 6-wide GBOoO design with 4 FMAC units.

    If you're thinking about 64-bit FMA units, that would
    be half of what Intel apparently plans to put into their
    "economy" cores with Skymont, which is four 128-bit FMA units: https://chipsandcheese.com/2024/06/15/intel-details-skymont/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sat Jul 20 08:53:26 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 19 Jul 2024 11:28:17 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 18:48:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Thu, 18 Jul 2024 17:06:52 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    It does work far away from the equilibrium (which is the
    point).

    But how far?
    Does it work for EA/RT outside of, say [-10:+12] ?

    Ea is almost always positive; if if was negative, it would mean
    that a reaction is is slowed down by increasing temperature (and
    by this don't mean the reverse reaction).

    In the branch of physical chemistry that I was taught before
    switching to completely different profession, that's not that rare.
    For example, when temperature grows from 1500 degrees toward 1550
    the transfer of phosphorus from molten steel to slag decelerates
    greatly.

    Do you happen to recall if that process was limited by thermodynamic
    equilibrium, by mass transfer or indeed by a chemical reaction?

    I don't know, but would think that what causes it is an approuch to equilibrium. But at 1550 degrees equilibrium is still pretty far away
    (50-70 degrees away? I don't remember) and despite that a slowdown is
    already very significant.

    That would also be covered by the Arrhenius equation - you have one
    reate for the forward reaction and one rate for the backward reaction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sat Jul 20 19:04:12 2024
    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a)
    term?

    I don't think that formula is correct.

    It is not::

    exp(x+y) = exp(x)×exp(y)

    so add/subtract can be simplified, but multiply and divide cannot
    (easily).
    Now, if we were dealing with Ln() those multiplies and divides can be simplified.

    Yeah, I mixed up exp() and ln() pretty badly. :-(

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 20 19:01:15 2024
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    I've taken both a second and third look at InvSqrt() over the last few
    years, it turns out that the Quake version is far from optimal: With the
    exact same instructions, just a different set of constants, you can get
    about 1.5 bits more than quake does, i.e. about 10 bits after that
    single NR step (which isn't really NR since it modifies both the 1.5 and
    the 0.5 factors).

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jul 20 20:12:07 2024
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    I've taken both a second and third look at InvSqrt() over the last few
    years, it turns out that the Quake version is far from optimal: With the
    exact same instructions, just a different set of constants, you can get
    about 1.5 bits more than quake does, i.e. about 10 bits after that
    single NR step (which isn't really NR since it modifies both the 1.5 and
    the 0.5 factors).

    Wikipedia has something on that, also literature; Moroz et al. https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
    method to do this, with one or two steps.

    Impressive amounts of math here, unfortunately they completely miss the
    point!

    What they derive is the exact optimal value for the magic constant,
    using zero, one or two text-book NR iterations.

    However, if you are willing to take that first NR iteration

    halfnumber = 0.5f*x
    ...
    i = R-(i>>1);
    ...
    x = x*(1.5f-halfnumber*x*x);

    and then make both the 0.5f and 1.5f constants free variables, you can
    in fact get 1.5 more bits than what they show in this paper.

    In fact, using slightly different values for R (the magic constant)
    results in very marginal changes in max rel error, while optimizing all
    three constants improves that max error from the stated 1.75e-3 to about 0.6e-3!


    I've also looked a little bit at simplifying exp(-1/x) directly, and
    it seems an unpleasent function to implement; at least there is no
    argument reduction which comes to mind. With exp2(x), you can split
    off the integer part and do an appoximation on the rest over a finite interval, but if there is a fast approximation of the integer part
    of 1/x without dividing, I am not aware of one :-)

    :-(

    Fast is probably what you get from a pure lookup table, but with no
    obvious NR iteration to improve it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Jul 20 17:19:56 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Fri, 19 Jul 2024 14:41:52 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:


    I'm guessing that you would do the exp(-E_a/(R*T)) as
    exp(-E-a)-exp(R*T), since that should give the same result and now you
    could interleave the two exp) calculations, and/or hoist the exp(-E_a)
    term?

    I don't think that formula is correct.

    It is not::

    exp(x+y) = exp(x)×exp(y)

    so add/subtract can be simplified, but multiply and divide cannot
    (easily).
    Now, if we were dealing with Ln() those multiplies and divides can be simplified.

    Well, you can sometimes the relation exp(a)**b = exp(a*b), but it is
    not useful in this case.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 20 17:17:53 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that converge
    faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    I've taken both a second and third look at InvSqrt() over the last few
    years, it turns out that the Quake version is far from optimal: With the exact same instructions, just a different set of constants, you can get
    about 1.5 bits more than quake does, i.e. about 10 bits after that
    single NR step (which isn't really NR since it modifies both the 1.5 and
    the 0.5 factors).

    Wikipedia has something on that, also literature; Moroz et al. https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
    method to do this, with one or two steps.

    I've also looked a little bit at simplifying exp(-1/x) directly, and
    it seems an unpleasent function to implement; at least there is no
    argument reduction which comes to mind. With exp2(x), you can split
    off the integer part and do an appoximation on the rest over a finite
    interval, but if there is a fast approximation of the integer part
    of 1/x without dividing, I am not aware of one :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Sat Jul 20 21:10:52 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Wikipedia has something on that, also literature; Moroz et al.
    https://arxiv.org/abs/1603.04483 seem to give the optimum quasi-NR
    method to do this, with one or two steps.

    Impressive amounts of math here, unfortunately they completely miss the point!

    What they derive is the exact optimal value for the magic constant,
    using zero, one or two text-book NR iterations.

    However, if you are willing to take that first NR iteration

    halfnumber = 0.5f*x
    ...
    i = R-(i>>1);
    ...
    x = x*(1.5f-halfnumber*x*x);

    and then make both the 0.5f and 1.5f constants free variables, you can
    in fact get 1.5 more bits than what they show in this paper.

    Wikipedia also has something about that...

    In fact, using slightly different values for R (the magic constant)
    results in very marginal changes in max rel error, while optimizing all
    three constants improves that max error from the stated 1.75e-3 to about 0.6e-3!

    Looks like https://web.archive.org/web/20180709021629/http://rrrola.wz.cz/inv_sqrt.html
    who reports 6.50196699E−4 as the maximum error (also from the
    Wikipedia article).

    That's 10.5 bits of accuracy, not bad at all.

    However... assume you want to do another NR step. In that case,
    you might be better off not loading different constants from memory,
    so having the same constants might actually be an advantage
    (whch does not mean that they have to be the original Newton steps).




    I've also looked a little bit at simplifying exp(-1/x) directly, and
    it seems an unpleasent function to implement; at least there is no
    argument reduction which comes to mind. With exp2(x), you can split
    off the integer part and do an appoximation on the rest over a finite
    interval, but if there is a fast approximation of the integer part
    of 1/x without dividing, I am not aware of one :-)

    :-(

    Fast is probably what you get from a pure lookup table, but with no
    obvious NR iteration to improve it.

    I stronlgy suspect you're fight, but it is too late in the evening
    to calculate this :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to All on Sun Jul 21 00:51:40 2024
    Above one table was misformated. Here is a correct format.
    Accurate sqrt (single precision)
    Arch Latency Throughput (scalar/128b/256b)
    Zen3 14 0.20/0.200/0.200
    SkyLake 12 0.33/0.333/0.167
    Ice Lake 12 0.33/0.333/0.167
    Power9 26 0.20/0.095/N/A

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Jul 21 00:23:44 2024
    On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    x86_64 has rcpss, accurate to 1/6144, with (looking at the
    instruction tables) 6 for newer architectures, with a throuhtput
    of 1/4.

    It seems, you looked at the wrong instruction table.
    Here are not the very modern x86-64 cores:
    Arch Latency Throughput (scalar/128b/256b)
    Zen3 3 2/2/1
    Skylake 4 1/1/1
    Ice Lake 4 1/1/1
    Power9 5-7 4/2/N/A

    So, if your business depends on calculating many inaccurate
    square roots, fast, buy a POWER :-)


    If you are have enough of independent rsqrt to do, all four processors
    have the same theoretical peak throughput, but x86 tend to have more
    cores and to run at faster clock. And lower latency makes achieving
    peak throughput easier. Also, depending on target precision, higher
    initial precision of x86 estimate means that sometimes you can get away
    with 1 less NR iteration.

    Also, if what you really need is sqrt rather than rsqrt, then depending
    on how much inaccuracy you can accept, sometimes on modern x86 the
    calculating accurate sqrt can be better solution than calculating approximation. It is less likely to be the case on POWER9 Accurate sqrt
    (single precision) Zen3 14 0.20/0.200/0.200 SkyLake 12 0.33/0.333/0.167 Ice Lake 12 0.33/0.333/0.167
    Power9 26 0.20/0.095/N/A

    Accurate sqrt (double precision)
    Zen3 20 0.111/0.111/0.111
    Skylake 12 0.167/0.167/0.083
    Ice Lake 12 0.167/0.167/0.083
    Power9 36 0.111/0.067/N/A


    Other architectures I have tried don't seem to have it.


    Arm64 has it. It is called FRSQRTE.


    Does it make sense? Well, if you want to calculate lots of Arrhenius equations, you don't need full accuracy and (like in Mitch's case)
    exp has become as fast as division, then it could actually make a
    lot of sense. It is still possible to add Newton steps afterwards,
    which is what gcc does if you add -mrecip -ffast-math.

    I don't know about POWER, but on x86 I wouldn't do it.
    I'd either use plain division that on modern cores is quite fast
    or will use NR to calculate normal reciprocal. x86 provides initial
    estimate for that too (RCPSS).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sat Jul 20 21:58:59 2024
    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    x86_64 has rcpss, accurate to 1/6144, with (looking at the
    instruction tables) 6 for newer architectures, with a throuhtput
    of 1/4.

    It seems, you looked at the wrong instruction table.

    [Note I was not writing about inverse squre root, I was writing
    about inverse].

    I have to admit to being almost terminally confused by Intel
    generation names, so I am likely to mix up what is old and what
    is new.

    Here are not the very modern x86-64 cores:
    Arch Latency Throughput (scalar/128b/256b)
    Zen3 3 2/2/1
    Skylake 4 1/1/1
    Ice Lake 4 1/1/1
    Power9 5-7 4/2/N/A

    Power9 has it for 128-bit, but not for 256 bits (it doesn't have
    those registers), and if I read the handbook correctly, that
    would also be 4 operations in parallel.


    So, if your business depends on calculating many inaccurate
    square roots, fast, buy a POWER :-)


    If you are have enough of independent rsqrt to do, all four processors
    have the same theoretical peak throughput, but x86 tend to have more
    cores and to run at faster clock. And lower latency makes achieving
    peak throughput easier. Also, depending on target precision, higher
    initial precision of x86 estimate means that sometimes you can get away
    with 1 less NR iteration.

    Also, if what you really need is sqrt rather than rsqrt, then depending
    on how much inaccuracy you can accept, sometimes on modern x86 the calculating accurate sqrt can be better solution than calculating approximation. It is less likely to be the case on POWER9 Accurate sqrt

    [table reformatted, hope I got this right]

    (single precision)
    Zen3 14 0.20/0.200/0.200
    SkyLake 12 0.33/0.333/0.167
    Ice Lake 12 0.33/0.333/0.167
    Power9 26 0.20/0.095/N/A

    Accurate sqrt (double precision)
    Zen3 20 0.111/0.111/0.111
    Skylake 12 0.167/0.167/0.083
    Ice Lake 12 0.167/0.167/0.083
    Power9 36 0.111/0.067/N/A


    Other architectures I have tried don't seem to have it.


    Arm64 has it. It is called FRSQRTE.

    Interesting that "gcc -O3 -ffast-meth -mrecip" does not
    appear to use it.



    Does it make sense? Well, if you want to calculate lots of Arrhenius
    equations, you don't need full accuracy and (like in Mitch's case)
    exp has become as fast as division, then it could actually make a
    lot of sense. It is still possible to add Newton steps afterwards,
    which is what gcc does if you add -mrecip -ffast-math.

    I don't know about POWER, but on x86 I wouldn't do it.
    I'd either use plain division that on modern cores is quite fast
    or will use NR to calculate normal reciprocal. x86 provides initial
    estimate for that too (RCPSS).

    Note that I was talking about the inverse in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun Jul 21 01:46:53 2024
    On Sat, 20 Jul 2024 21:58:59 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x)
    equivalent.

    I can well believe that.

    It is interesting to see what different architectures offer for
    faster reciprocals.

    POWER has fre and fres (double and single version) for approximate
    divisin, which are accurate to 1/256. These operations are quite
    fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
    so obviously fully pipelined. With 1/256 accuracy, this could
    actually be the original Quake algorithm (or its modification)
    with a single Newton step, but this is of course much better in
    hardware where exponent handling can be much simplified (and
    done only once).

    x86_64 has rcpss, accurate to 1/6144, with (looking at the
    instruction tables) 6 for newer architectures, with a throuhtput
    of 1/4.

    It seems, you looked at the wrong instruction table.

    [Note I was not writing about inverse squre root, I was writing
    about inverse].


    Oh, sorry. I misaunderstood.

    I have to admit to being almost terminally confused by Intel
    generation names, so I am likely to mix up what is old and what
    is new.


    Gen 2 is Sandy Bridg
    Gen 3 is Ivy Bridge which is very similar core microarchitecture
    Gen 4 is Haswell
    Gen 5 is Broadwell which is similar, but has few changes
    Gen 6,7,8,9 and majority of 10 is all the same microarchitecture -
    Skylake.
    11 is mostly Tiger lake and partially Ice Lake and Rocket Lake. All
    three are different from each other on silicon proces side, but ver
    similar on core microarcheticture.
    12 is Alder Lake that has P cores and E cores. Microarchitecture of P
    core is called Golden Cove.


    Here are not the very modern x86-64 cores:
    Arch Latency Throughput (scalar/128b/256b)
    Zen3 3 2/2/1
    Skylake 4 1/1/1
    Ice Lake 4 1/1/1
    Power9 5-7 4/2/N/A

    Power9 has it for 128-bit, but not for 256 bits (it doesn't have
    those registers), and if I read the handbook correctly, that
    would also be 4 operations in parallel.


    Handbook is quite clear that both xvresp and xvredp have
    max throughput=2.


    So, if your business depends on calculating many inaccurate
    square roots, fast, buy a POWER :-)


    That's the sentence that caused my misunderstanding.


    If you are have enough of independent rsqrt to do, all four
    processors have the same theoretical peak throughput, but x86 tend
    to have more cores and to run at faster clock. And lower latency
    makes achieving peak throughput easier. Also, depending on target precision, higher initial precision of x86 estimate means that
    sometimes you can get away with 1 less NR iteration.

    Also, if what you really need is sqrt rather than rsqrt, then
    depending on how much inaccuracy you can accept, sometimes on
    modern x86 the calculating accurate sqrt can be better solution
    than calculating approximation. It is less likely to be the case on
    POWER9 Accurate sqrt

    [table reformatted, hope I got this right]

    (single precision)
    Zen3 14 0.20/0.200/0.200
    SkyLake 12 0.33/0.333/0.167
    Ice Lake 12 0.33/0.333/0.167
    Power9 26 0.20/0.095/N/A

    Accurate sqrt (double precision)
    Zen3 20 0.111/0.111/0.111
    Skylake 12 0.167/0.167/0.083
    Ice Lake 12 0.167/0.167/0.083
    Power9 36 0.111/0.067/N/A


    Other architectures I have tried don't seem to have it.


    Arm64 has it. It is called FRSQRTE.

    Interesting that "gcc -O3 -ffast-meth -mrecip" does not
    appear to use it.


    You mean, on other architectures gcc does emit approximate rsqrt from
    plain C or plain Fortran? In which situations?



    Does it make sense? Well, if you want to calculate lots of
    Arrhenius equations, you don't need full accuracy and (like in
    Mitch's case) exp has become as fast as division, then it could
    actually make a lot of sense.

    Mitch was told more than one time that floating point division on
    modern cores (esp. Apple, but others too) is much faster than he thinks.
    But he tend to forget it quickly.

    It is still possible to add Newton
    steps afterwards, which is what gcc does if you add -mrecip
    -ffast-math.

    I don't know about POWER, but on x86 I wouldn't do it.
    I'd either use plain division that on modern cores is quite fast
    or will use NR to calculate normal reciprocal. x86 provides initial estimate for that too (RCPSS).

    Note that I was talking about the inverse in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun Jul 21 07:24:41 2024
    On Fri, 19 Jul 2024 17:46:23 -0000 (UTC), John Levine wrote:

    COBOL now has CURRENCY SIGN which lets you set the currency symbol to
    edit into a picture, and DECIMAL POINT IS COMMA, to switch the two.

    All that involves changes to the code, though. You need to ability to get
    such settings from the system locale, for easy (re)configuration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Sun Jul 21 08:33:09 2024
    Michael S <already5chosen@yahoo.com> schrieb:

    You mean, on other architectures gcc does emit approximate rsqrt from
    plain C or plain Fortran? In which situations?

    AMD64: -mrecip -ffast-math, 32-bit real only
    POWER: -mrecip -ffast-math
    aarch64: -ffast-math -mlow-precision-div (it's a different option
    name, whihc is why I didn't find it before). There, you can also
    adjust the number of Newton iterations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Jul 21 12:27:15 2024
    Michael S <already5chosen@yahoo.com> writes:
    Gen 2 is Sandy Bridg
    Gen 3 is Ivy Bridge which is very similar core microarchitecture
    Gen 4 is Haswell
    Gen 5 is Broadwell which is similar, but has few changes
    Gen 6,7,8,9 and majority of 10 is all the same microarchitecture -
    Skylake.
    11 is mostly Tiger lake and partially Ice Lake and Rocket Lake. All
    three are different from each other on silicon proces side, but ver
    similar on core microarcheticture.
    12 is Alder Lake that has P cores and E cores. Microarchitecture of P
    core is called Golden Cove.

    Introduction year = Generation + 2009

    That's what's important to Intel's marketing, and it explains why
    there are different microarchitectures with the same generation, and
    the same microarchitecture with different generations.

    However, they gave Rocket Lake (introduce 2021) 11th-Generation
    numbers, probably in anticipation of the introduction of Alder Lake
    later that year.

    AFAIK all Ice Lakes are "10th generation". I also used to think that
    the microarchitectural difference between Ice Lake, Tiger Lake, and
    Rocket Lake are very minor, but it turns out that they are
    significant. I remember (IIRC the source was chipsncheese) one
    difference being that Rocket Lake can optimize dependent moves, but
    Tiger Lake cannot (apparently a bug that was fixed properly in Rocket
    Lake, but only by turning off that feature in Tiger Lake and Ice
    Lake). I also saw differences in measurements I made:

    http://www.complang.tuwien.ac.at/anton/tmp/intel-p-evo-eps-converted-to.pdf

    The different lines are for different benchmarks, and the y axis
    represents a speedup, of Gforth with the cib optimization over Gforth
    without it. If Rocket Lake and Tiger Lake had the same
    microarchitecture, I would expect flat lines between them (I don't
    think that these benchmarks are much influenced by the Tiger Lake's
    larger L2 cache), but we see quite a bit of difference for many
    benchmarks.

    Other views at the same data are:

    http://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps http://www.complang.tuwien.ac.at/anton/tmp/unopt-ipc-uarch.eps

    These are IPC results for the same benchmarks, with the results sorted
    for each microarchitecture. opt is with the cib optimization, unopt
    without. Tigerlake is the second-lightest blue and Rocket Lake the third-lightest blue, and they are close to the top in both graphs.
    But you can see that they have different IPC numbers for a number of benchmarks. It's better visible in the opt graph,

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jul 21 13:11:56 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    However, if you are willing to take that first NR iteration

    halfnumber = 0.5f*x
    ...
    i = R-(i>>1);
    ...
    x = x*(1.5f-halfnumber*x*x);

    and then make both the 0.5f and 1.5f constants free variables, you can
    in fact get 1.5 more bits than what they show in this paper.
    ...
    Looks like https://web.archive.org/web/20180709021629/http://rrrola.wz.cz/inv_sqrt.html
    who reports 6.50196699E−4 as the maximum error (also from the
    Wikipedia article).

    That's 10.5 bits of accuracy, not bad at all.

    However... assume you want to do another NR step. In that case,
    you might be better off not loading different constants from memory,
    so having the same constants might actually be an advantage
    (whch does not mean that they have to be the original Newton steps).

    The number of accurate digits doubles after each NR step, so starting
    with the better first "NR" iteration would result in an additional
    accuracy of 3 bits. And if you optimize the new constants for the
    second iteration, you may even get more. Or you could optimize for
    two iterations using the same constants ...

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Jul 21 12:28:43 2024
    MitchAlsup1 wrote:
    On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

    MitchAlsup1 wrote:

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    The instruction stream consists of 4 FMAC-bound instructions unrolled
    as many times as will fit in register file.

    You typical reservation station can accept 1 new instruction per cycle
    from the decoder. So, either the decoder has to spew the instructions
    across the stations (and remember they are data dependent) or the
    station has to fire more than one per cycle to the FMAC units.

    So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
    and maybe some kind of exotic routing.

    This is where I saw a benefits to using valued reservation stations vs valueless ones - when a uArch has multiple similar FU each with its own
    bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank. https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution). This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.) Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    With valued RS the operand values stored in each RS and, if ready,
    read at Dispatch (hand-off from the front end to the RS bank) or are
    received from the forwarding network if in-flight at Dispatch time.
    The number of PRF read ports scales with the number of dispatched uOp
    operands. Since the operand values are stored in each RS, each bank
    can then schedule and launch independently.

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
    and the read ports are potentially usable for all dispatches.
    Then all 14 FU can launch at once independently.

    Each FU can also have two kinds of valued RS banks,
    a simple one if all the operands are ready at Dispatch as this does
    not need a wake-up matrix entry or need to receive forwarded values,
    and a complex one that monitors the wake-up matrix and forwarding buses.
    If all the operands are ready, the Dispatcher can choose either RS bank for
    the FU, giving preference to the simpler. If all operands are not ready
    then Dispatcher selects from the complex bank.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Jul 21 17:03:08 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    This is where I saw a benefits to using valued reservation stations vs >valueless ones - when a uArch has multiple similar FU each with its own
    bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank. >https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution).

    If the operation was waiting in the RS, at least one operand comes in
    through a forwarding path (at least if the uarch has those).

    This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.)

    Yes, but ARM A64 seems to have been designed with that in mind.

    Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    ARM A64's stp instruction needs up to 4 register reads. But I think
    Oryon only supports two stores per cycle, so that raises the number to
    44. It also has a split register file: So for the integer register
    file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
    file, 3*4=12 ports would be needed. I think, though, that the number
    actually needed is much smaller, and various techniques are used to
    make do with far fewer register ports.

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
    and the read ports are potentially usable for all dispatches.

    Note that Oryon has 8-way rename/dispatch (I wonder how that works
    with, e.g., 8 ldp instructions that each can write three registers).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jul 21 19:44:49 2024
    On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

    MitchAlsup1 wrote:

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    The instruction stream consists of 4 FMAC-bound instructions unrolled
    as many times as will fit in register file.

    You typical reservation station can accept 1 new instruction per cycle
    from the decoder. So, either the decoder has to spew the instructions
    across the stations (and remember they are data dependent) or the
    station has to fire more than one per cycle to the FMAC units.

    So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
    and maybe some kind of exotic routing.

    This is where I saw a benefits to using valued reservation stations vs valueless ones - when a uArch has multiple similar FU each with its own
    bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank. https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution). This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.) Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    I always had RSs keep tack of which FU was delivering the final
    operand, so that these could be picked up by the forwarding logic
    and not need a RF port. This gets rid of 50%-75% of the RF port
    needs.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    With valued RS the operand values stored in each RS and, if ready,
    read at Dispatch (hand-off from the front end to the RS bank) or are
    received from the forwarding network if in-flight at Dispatch time.

    Delivering result at dispatch time.

    The number of PRF read ports scales with the number of dispatched uOp operands. Since the operand values are stored in each RS, each bank
    can then schedule and launch independently.

    The width of the decoder is narrower than the width of the data path.
    We used to call this "catch up bandwidth".

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,

    First, a 6-wide machine is not doing 6 3-operand instructions,
    it is more like 3-memory ops (2-reg+displacement), one 3-op,
    one general 2-op, and one 1-op (branch) so, you only need 12-ports
    instead of 18 Most of the time.

    The penalty is that each RS entry is 5× the size of the value-free
    RS designs. These work just fine when the execution window is
    reasonable (say 96 instructions) but fails when the window is
    larger than 150-ish.

    and the read ports are potentially usable for all dispatches.
    Then all 14 FU can launch at once independently.

    One should also note that these machines deliver 1-2 I/c RMS
    regardless of their Fetch-Decode-FU widths.

    Each FU can also have two kinds of valued RS banks,
    a simple one if all the operands are ready at Dispatch as this does
    not need a wake-up matrix entry or need to receive forwarded values,
    and a complex one that monitors the wake-up matrix and forwarding buses.
    If all the operands are ready, the Dispatcher can choose either RS bank
    for
    the FU, giving preference to the simpler. If all operands are not ready
    then Dispatcher selects from the complex bank.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jul 21 19:49:49 2024
    On Sun, 21 Jul 2024 17:03:08 +0000, Anton Ertl wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    This is where I saw a benefits to using valued reservation stations vs >>valueless ones - when a uArch has multiple similar FU each with its own >>bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank. >>https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution).

    If the operation was waiting in the RS, at least one operand comes in
    through a forwarding path (at least if the uarch has those).

    This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.)

    Yes, but ARM A64 seems to have been designed with that in mind.

    Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    ARM A64's stp instruction needs up to 4 register reads.

    2-reads to AGEN, then after Write permission is available, then 2-more.
    HW breaks up STs into "get started" and "finish off", finish off is
    often near retirement and definitely after one has seen the TLB.

    But I think
    Oryon only supports two stores per cycle, so that raises the number to
    44. It also has a split register file: So for the integer register
    file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
    file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller,

    Typically around 1/3rd to 1/4th of the static count as most operands
    arrive via forwarding.

    and various techniques are used to
    make do with far fewer register ports.

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
    and the read ports are potentially usable for all dispatches.

    Note that Oryon has 8-way rename/dispatch (I wonder how that works
    with, e.g., 8 ldp instructions that each can write three registers).

    Averaging 2 I/c saves dynamic port count by big numbers.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon Jul 22 14:01:15 2024
    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran in
    half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
    SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Jul 22 09:10:41 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    This is where I saw a benefits to using valued reservation stations vs
    valueless ones - when a uArch has multiple similar FU each with its own
    bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank.
    https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png

    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution).

    If the operation was waiting in the RS, at least one operand comes in
    through a forwarding path (at least if the uarch has those).

    I see the primary purpose of a reservation station as scheduling,
    with watching for operand forwarding and wake-up as a component of that.
    If any operand is ready when the RS is allocated then that operand's
    forwarding wake-up logic is unused.

    When all the operands are ready, the RS becomes ready and bids for the FU.
    If all operands are ready when the RS entry is allocated then the RS is immediately ready and immediately bids for the FU.

    It is an optimization to have a separate RS bank without forwarding
    wake-up logic for when all operands are ready at time of dispatch.

    This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.)

    Yes, but ARM A64 seems to have been designed with that in mind.

    Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    ARM A64's stp instruction needs up to 4 register reads. But I think
    Oryon only supports two stores per cycle, so that raises the number to
    44. It also has a split register file: So for the integer register
    file 6*3+2*4+2*2=30 ports would be needed, and for the vector register
    file, 3*4=12 ports would be needed. I think, though, that the number actually needed is much smaller, and various techniques are used to
    make do with far fewer register ports.

    Yes, they could be more optimal than the brute force approach.
    Unfortunately the article doesn't say how many read ports the integer
    PRF has.

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
    and the read ports are potentially usable for all dispatches.

    Note that Oryon has 8-way rename/dispatch (I wonder how that works
    with, e.g., 8 ldp instructions that each can write three registers).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Mon Jul 22 11:02:06 2024
    MitchAlsup1 wrote:
    On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:

    MitchAlsup1 wrote:

    {Would be an interesting reservation station design, though}

    In what way would the RS be interesting or different?

    The instruction stream consists of 4 FMAC-bound instructions unrolled
    as many times as will fit in register file.

    You typical reservation station can accept 1 new instruction per cycle
    from the decoder. So, either the decoder has to spew the instructions
    across the stations (and remember they are data dependent) or the
    station has to fire more than one per cycle to the FMAC units.

    So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
    and maybe some kind of exotic routing.

    This is where I saw a benefits to using valued reservation stations vs
    valueless ones - when a uArch has multiple similar FU each with its own
    bank of RS that is scheduled for that FU.

    Example of horizontal scaling of similar FU each with its own RS bank.
    https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png


    With valueless RS, each RS stores only the source register number of
    its operands and each FU has to be able to read all its operands
    when a uOp launches (begins execution). This means the number of
    PRF read ports scales according to the total number of FU operands.
    (One could do read port sharing but then you have to schedule that too
    and could have contention.) Also if an FU is unused on any cycle then
    all its (expensive) operand read ports are unused.

    I always had RSs keep tack of which FU was delivering the final
    operand, so that these could be picked up by the forwarding logic
    and not need a RF port. This gets rid of 50%-75% of the RF port
    needs.

    And this also adds another thing to consider to the central port scheduling
    and assignment logic.

    I considered tracking FU's but, at least for me, it wouldn't help.
    In my case with valued RS, each FU result has to be forwarded to all FU operands anyway so it has to drive the same number of destination loads.
    This is what allows back-to-back execution.

    Also because the Dispatcher passes ready values to RS at allocation,
    if a result destination register is written in the same cycle as an OP
    is dispatched using the same register as a source, then the result value
    is forwarded from the write port to the dispatcher operand.
    This allows an OP to launch immediately on the next cycle after Dispatch
    if all its operands are ready and the FU available.

    Using the above Oryon as an example, with valueless RS, to launch
    all 14 FU with 3 operands all at once needs 42 read ports.

    With valued RS the operand values stored in each RS and, if ready,
    read at Dispatch (hand-off from the front end to the RS bank) or are
    received from the forwarding network if in-flight at Dispatch time.

    Delivering result at dispatch time.

    The valued RS either receives its operand values from the Dispatcher
    when the RS is allocated, or later from the result forwarding bus.
    When RS has all operand values for its operation it bids for the FU.

    PRF read ports are only required for Dispatcher not for FU launching.
    FU launching only needs to read its own local RS operands or the
    forwarding buses.

    The number of PRF read ports scales with the number of dispatched uOp
    operands. Since the operand values are stored in each RS, each bank
    can then schedule and launch independently.

    The width of the decoder is narrower than the width of the data path.
    We used to call this "catch up bandwidth".

    Yes, this would be set by the number of result buses = PRF result write ports.

    I was also thinking that even an in-order uArch might make use of
    multiple result buses and write ports for limited simple overlapped
    execution to catch up after stalls or multi-cycle ops like multiply.

    With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,

    First, a 6-wide machine is not doing 6 3-operand instructions,
    it is more like 3-memory ops (2-reg+displacement), one 3-op,
    one general 2-op, and one 1-op (branch) so, you only need 12-ports
    instead of 18 Most of the time.

    Yes, the average case is much smaller than the worst case.
    This is true whether RS are valued or valueless.

    I was starting from the view that any instruction can dispatch from
    any lane any time, which implies equal, maximal resources for each
    dispatch lane but also make the routing much simpler.

    One can then move off that, saving read port resources by incurring
    increased routing costs as now operands can come from varying sources.
    The operand buses become more like cross-bars.

    For valueless RS, such optimizations also requires centralized port and bus allocating and scheduling, to route the register number from the RS that
    is launching to the PRF read port and route the value back to the FU.

    The penalty is that each RS entry is 5× the size of the value-free
    RS designs. These work just fine when the execution window is
    reasonable (say 96 instructions) but fails when the window is
    larger than 150-ish.

    Yes but its just flip flops, or even better latches, for each operand.
    (Nobody bats an eye at using kilobytes for the branch predictor.)
    And it locates the operands closer to the FU minimizing wire delay
    at launch, though forwarding is likely still the critical path.

    and the read ports are potentially usable for all dispatches.
    Then all 14 FU can launch at once independently.

    One should also note that these machines deliver 1-2 I/c RMS
    regardless of their Fetch-Decode-FU widths.

    Yes, its mostly all for naught.

    Each FU can also have two kinds of valued RS banks,
    a simple one if all the operands are ready at Dispatch as this does
    not need a wake-up matrix entry or need to receive forwarded values,
    and a complex one that monitors the wake-up matrix and forwarding buses.
    If all the operands are ready, the Dispatcher can choose either RS bank
    for
    the FU, giving preference to the simpler. If all operands are not ready
    then Dispatcher selects from the complex bank.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Jul 22 15:01:10 2024
    On Mon, 22 Jul 2024 11:01:15 +0000, Michael S wrote:

    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran in
    half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
    SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    There are certain N-R iterations that can be expressed with both::

    NR+1 = F( NR, SQRT() )
    and
    NR+1 = F'(NR, RSQRT() )

    Typically the one with RSQRT() converges slightly faster than the
    one using SQRT(). How much is slightly::maybe ½-1 more bit per
    iteration.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jul 22 18:50:10 2024
    On Mon, 22 Jul 2024 15:01:10 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 22 Jul 2024 11:01:15 +0000, Michael S wrote:

    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran
    in half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x)
    equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
    for SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    There are certain N-R iterations that can be expressed with both::

    NR+1 = F( NR, SQRT() )
    and
    NR+1 = F'(NR, RSQRT() )

    Typically the one with RSQRT() converges slightly faster than the
    one using SQRT(). How much is slightly::maybe ½-1 more bit per
    iteration.

    As I said above, I am not aware of the NR iterations for SQRT() that
    consist of fmul, fadd and nothing else.
    If you know one, please tell me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jul 22 17:39:10 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    MitchAlsup1 wrote:
    [...]
    (Nobody bats an eye at using kilobytes for the branch predictor.)

    IIRC it's in the hundreds of KB on the larger cores. But it pays
    off.

    One should also note that these machines deliver 1-2 I/c RMS
    regardless of their Fetch-Decode-FU widths.

    Yes, its mostly all for naught.

    Evidence? If it was all for naught, people would not build these big
    cores.

    Counterevidence:

    http://www.complang.tuwien.ac.at/anton/tmp/opt-ipc-uarch.eps

    Goldencove has a median IPC of almost 4 for these benchmarks (a bunch
    of Forth benchmarks running on Gforth). You also see a steady advance
    between generations, from Sandybridge to Goldencove, from Silvermont
    to Gracemont, from K8 to Zen3, and also the expected differences
    between A72, A76, and Firestorm. And the in-order cores are the
    dashed ones at the bottom. Particularly remarkable is Gracemont,
    which usually has better IPC than Skylake and comparable IPC to
    Firestorm.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Jul 23 14:35:20 2024
    Michael S wrote:
    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran in
    half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method for
    SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

    I.e. calculate rsqrt(x) to the precision you need and then do a single fmul?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Jul 23 16:26:21 2024
    On Tue, 23 Jul 2024 14:35:20 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran
    in half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x)
    equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
    for SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

    I.e. calculate rsqrt(x) to the precision you need and then do a
    single fmul?

    Terje


    That much I know.

    When precise rounding is not required, I even know slightly better
    "combined" method:
    1. Do N-1 iterations for RSQRT delivering r0 with 2 or 3 more
    significant bits than n/2
    2. Calculate SQRT estimate as y0 = x*r0
    3. Do last iteration using both y0 and r0 as y = y0 + (x-y0*y0)*0.5*r0.
    That would give max. error of something like 0.51 ULP.
    A similar combined method could be useful for sw calculation of
    correctly rounded quad-precision sqrt as well. In this case it serves
    as 'conditionally last step' rather than 'absolutely last step'.

    I was hoping that you'll tell me about NR formula for SQRT itself that
    can be applied recursively.
    The only formula that I know is y = (y*y + x)/(y*2). Obviously, when
    speed matters this formula is not useful.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Jul 23 15:14:44 2024
    On Mon, 22 Jul 2024 15:50:10 +0000, Michael S wrote:

    On Mon, 22 Jul 2024 15:01:10 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    There are certain N-R iterations that can be expressed with both::

    NR+1 = F( NR, SQRT() )
    and
    NR+1 = F'(NR, RSQRT() )

    Typically the one with RSQRT() converges slightly faster than the
    one using SQRT(). How much is slightly::maybe ½-1 more bit per
    iteration.

    As I said above, I am not aware of the NR iterations for SQRT() that
    consist of fmul, fadd and nothing else.
    If you know one, please tell me.

    A am NOT talking about a NR iteration for SQRT() these can be dome with
    bit twiddling and multiplication.

    I am talking about a NR iteration that uses SQRT() in the
    calculation
    of some other <harder> elementary function.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Jul 23 18:22:39 2024
    Michael S wrote:
    On Tue, 23 Jul 2024 14:35:20 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Fri, 19 Jul 2024 20:55:47 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup1 wrote:
    On Fri, 19 Jul 2024 14:16:01 +0000, Terje Mathisen wrote:
    Back when I first looked at Invsqrt(), I did so because an
    Computation Fluid Chemistry researcher from Sweden asked for help
    speeding up his reciprocal calculations
    (sqrt(1/(dx^2+dy^2+dz^2))), I found that by combining the 1/x and
    the sqrt and doing three of them pipelind together (all the water
    molecules having three atoms), his weeklong simulation runs ran
    in half the time, on both PentiumPro and Alpha hardware.

    I, personally, have found many Newton-Raphson iterators that
    converge faster using 1/SQRT(x) than using the SQRT(x)
    equivalent.

    Yeah, that was eye-opening to me as well, to the level where I
    consider the invsqrt() NR iteration as a mainstay, it can be useful
    for both sqrt and 1/x as well. :-)

    Terje


    What is this "SQRT(x) equivalent" all of you are talking about?
    I am not aware of any "direct" (i.e. not via RSQRT) NR-like method
    for SQRT that consists only of multiplicationa and additions.
    If it exists, I will be very interested to know.

    sqrt(x) <= x/sqrt(x) <= x*rsqrt(x)

    I.e. calculate rsqrt(x) to the precision you need and then do a
    single fmul?

    Terje


    That much I know.

    I knew that! :-)

    When precise rounding is not required, I even know slightly better
    "combined" method:
    1. Do N-1 iterations for RSQRT delivering r0 with 2 or 3 more
    significant bits than n/2
    2. Calculate SQRT estimate as y0 = x*r0
    3. Do last iteration using both y0 and r0 as y = y0 + (x-y0*y0)*0.5*r0.
    That would give max. error of something like 0.51 ULP.
    A similar combined method could be useful for sw calculation of
    correctly rounded quad-precision sqrt as well. In this case it serves
    as 'conditionally last step' rather than 'absolutely last step'.

    I was hoping that you'll tell me about NR formula for SQRT itself that
    can be applied recursively.
    The only formula that I know is y = (y*y + x)/(y*2). Obviously, when
    speed matters this formula is not useful.

    Sorry, no, I have not found anything approaching the rsqrt() NR for sqrt().

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Paul A. Clayton on Sun Jul 28 11:39:04 2024
    On Sat, 27 Jul 2024 20:06:11 -0400
    "Paul A. Clayton" <paaronclayton@gmail.com> wrote:

    On 7/18/24 12:40 PM, MitchAlsup1 wrote:
    [snip]
    Over the 20 cycles the multiplier is doing Goldschmidt iterations,
    there
    are only 3 slots where a different instruction could sneak through.


    Since initial iterations could (I think) use reduced precision, it
    might also be possible to use part of the multiplier for a
    different lower precision multiplication.

    (ISTR, an AMD implementation used different precision for
    different steps.)

    I never looked at Goldschmidt division algorithm before (partly
    because it does not look practical for software implementations and
    partly with no particular reason|) so it's possible that I am missing
    something obvious.
    At the first glance, it appears that cutting precision of F(1) and F(2)
    factors is not a matter of reduction in computational complexity. It is
    a necessity for correctness.
    If the number of significant bits in F(1) and F(2) not cut upfront, we
    will have to calculate the result products N(i) with impractically high precision.
    With F(i) properly cut, the required precision for N(i) is still high. Something like 67 bits for first iteration, 95 bits for 2nd, 151 bits
    for the 3rd and hopefully last iteration.
    I still didn't figure out if the same sequence applies to D(i) or it is possible to cut corners here.

    BTW, are we sure that modern processors really use Goldshmidt division?
    Intel and AMD both have FMA latency=4 and worst case DIVSD latency=14,
    so doing it with highly optimized Goldshmidt appears [border-line]
    possible. The same applies to high-end Arm Inc. cores (worst case
    FDIV latency=15). But on Apple Firestorm the latency of FDIV.FP64 = 10
    which sounds too fast for Goldshmidt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Sun Jul 28 13:08:19 2024
    On Thu, 18 Jul 2024 12:56:46 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Stefan Monnier wrote:
    If the FP multiplier is a 4-stage pipeline, and FDIV is iterating
    using the multiplier, can the pipeline get a mix of multiple
    operations going at once? FDIV for both Newton–Raphson and
    Goldschmidt iterates serially so each can only use one of the 4
    pipeline slots.

    Something I've been wondering for a while, indeed.
    IOW, is there enough parallelism inside the FDIV/SQRT "microcode"
    to keep the FMAC fully busy (my naive understanding is that there
    isn't)? If not, do current CPU make the FMAC available for other
    operations while an FDIV/SQRT is in progress? If not, how hard
    would it be?


    Stefan

    And if they can't mix then to what extent can the end of one op,
    as it drains from the pipeline, overlap with the start of the next?
    Obviously FMUL can pipeline with FMUL but can the next FMUL overlap
    with the end of a prior FDIV? An EXP?

    I was thinking about reservation station schedulers and wondering
    what they might have to optimize.



    On most (all ?) modern GBOoO cores FDIV can overlap with the next
    independent FDIV and sometimes with more than one. In that light
    it would be hard to believe that FDIV can not overlap with FMUL or FMAC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)