• Short Vectors Versus Long Vectors

    From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 00:29:32 2024
    Adding the typical kind of vector-processing instructions to an
    instruction set inevitably leads to a combinatorial explosion in the
    number of opcodes. This kind of thing makes a mockery of the “R” in “RISC”.

    Interesting to see that the RISC-V folks are staying off this path;
    instead, they are reviving an old idea from Seymour Cray’s original
    machines that bear his name: a vector pipeline. Instead of being limited
    to processing 4 or 8 operands at a time, the Cray machines could operate (sequentially, but rapidly) on variable-length vectors of up to 64
    elements with a single setup sequence. RISC-V seems to make the limit on
    vector length an implementation choice, with a value of 32 being mentioned
    in the spec.

    The way it avoids having separate instructions for each combination of
    operand types is to have operand-type registers as part of the vector
    unit. This way, only a small number of instructions is required to set up
    all the combinations of operand/result types. You then give it a kick in
    the guts and off it goes.

    Detailed spec here: <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 02:14:32 2024
    Lawrence D'Oliveiro wrote:

    Adding the typical kind of vector-processing instructions to an
    instruction set inevitably leads to a combinatorial explosion in the
    number of opcodes. This kind of thing makes a mockery of the “R” in “RISC”.

    It does indeed make a mockery of the R in RISC.

    Interesting to see that the RISC-V folks are staying off this path;
    instead, they are reviving an old idea from Seymour Cray’s original machines that bear his name: a vector pipeline. Instead of being limited
    to processing 4 or 8 operands at a time, the Cray machines could operate (sequentially, but rapidly) on variable-length vectors of up to 64
    elements with a single setup sequence. RISC-V seems to make the limit on vector length an implementation choice, with a value of 32 being mentioned
    in the spec.

    CRAY machines stayed "in style" as long as memory latency remained smaller
    than the length of a vector (64 cycles) and fell out of favor when the cores got fast enough that memory could no longer keep up.

    I whish them well, but I expect it will not work out as they desire.....

    The way it avoids having separate instructions for each combination of operand types is to have operand-type registers as part of the vector
    unit. This way, only a small number of instructions is required to set up
    all the combinations of operand/result types. You then give it a kick in
    the guts and off it goes.

    Detailed spec here: <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

    On the other hand, My 66000 has support for both SIMD and CRAY-like vectors
    and the ISA contains only 6-bits of state supporting vectorization and
    exactly 2 instructions--one that gives HW a register it can use in the
    "loop" and the LOOP instruction that performs the ADD-CMP-BC functionality. {{Not 2 for every kind of vectorized instruction, 2 total instructions}}

    There is nor 4KB of register file (context switch overhead),
    there is no need for Gather/Scatter, stride memory references,
    there is no masking register,
    the OS can use vectorization for small fast loops without overhead,
    the compiler does not have to solve memory address aliasing,
    cache activities are modified to suit vector workloads,
    exotic HW can execute across multiple lanes (as desired),
    simple HW can "do it all" in a 1-wide pipeline,
    the debugger presents scalar code to coder,
    and exceptions remain precise (for those that care),
    and the exception handler(s) sees only scalar code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 03:11:41 2024
    On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

    CRAY machines stayed "in style" as long as memory latency remained
    smaller than the length of a vector (64 cycles) and fell out of favor
    when the cores got fast enough that memory could no longer keep up.

    So why would conventional short vectors work better, then? Surely the
    latency discrepancy would be even worse for them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 06:22:38 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

    CRAY machines stayed "in style" as long as memory latency remained
    smaller than the length of a vector (64 cycles) and fell out of favor
    when the cores got fast enough that memory could no longer keep up.

    Mitch Alsup repeatedly makes this claim without giving any
    justification. Your question may shed some light on that.

    So why would conventional short vectors work better, then? Surely the
    latency discrepancy would be even worse for them.

    Thinking about it, they probably don't work better. They just don't
    work worse, so why spend area on 4096-bit vector registers like the
    Cray-1 did when 128-512-bit vector registers work just as well? Plus,
    they have 200 or so of these registers, so 4096-bit registers would be
    really expensive. How many vector registers does the Cray-1 (and its successors) have?

    On modern machines OoO machinery bridges the latency gap between the
    L2 cache, maybe even the L3 cache and the core for data-parallel code.
    For the latency gap to main memory there are the hardware prefetchers,
    and they use the L1 or L2 cache as intermediate buffer, while the
    Cray-1 and followons use vector registers.

    So what's the benefit of using vector/SIMD instructions at all rather
    than doing it with scalar code? A SIMD instruction that replaces n
    scalar instructions consumes fewer resources for instruction fetching, decoding, register renaming, administering the instruction in the OoO
    engine, and in retiring the instruction.

    So why not use SIMD instructions with longer vector registers? The
    progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
    suggests that this is happening, but with every doubling the cost in
    area doubles but the returns are diminishing thanks to Amdahl's law.
    So at some point you stop. Intel introduced AVX-512 for Larrabee (a special-purpose machine), and now is backpedaling with desktop, laptop
    and small-server CPUs (even though only the Golden/Raptor Cove cores
    are enabled on the small-server CPUs) only supporting AVX, and with
    AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
    vector registers are already too costly for the benefit they give in general-purpose computing.

    Back to old-style vector processors. There have been machines that
    supported longer vector registers and AFAIK also memory-to-memory
    machines. The question is why have they not been the answer of the vector-processor community to the problem of covering the latency? Or
    maybe they have? AFAIK NEC SX has been available in some form even in
    recent years, maybe still.

    Anyway, after thinking about this, the reason behind Mitch Alsup's
    statement is that in a

    doall(load process store)

    computation (like what SIMD is good at), the loads precede the
    corresponding processing by the load latency (i.e., memory latency on
    the Cray machines). If your OoO capabilities are limited (and I think
    they are on the Cray machines), you cannot start the second iteration
    of the doall loop before the processing step of the first iteration
    has finished with the register. You can do a bit of software
    pipelining and software register renaming by transforming this into

    load1 doall(load2 process1 store1 load1 process2 store2)

    but at some point you run out of vector registers.

    One thing that comes to mind is tracking individual parts of the
    vector registers, which allows to starting the next iteration as soon
    as the first part of the vector register no longer has any readers.
    However, it's probably not that far off in complexity to tracking
    shorter vector registers in an OoO engine. And if you support
    exceptions (the Crays probably don't), this becomes messy, while with
    short vector registers it's easier to implement the (ISA)
    architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Apr 23 08:31:21 2024
    On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

    So what's the benefit of using vector/SIMD instructions at all rather
    than doing it with scalar code?

    On the original Cray machines, I read somewhere the benefit of using the
    vector versions over the scalar ones was a net positive for a vector
    length as low as 2.

    If your OoO capabilities are limited (and I think
    they are on the Cray machines), you cannot start the second iteration of
    the doall loop before the processing step of the first iteration has
    finished with the register.

    How would out-of-order execution help, anyway, given all the operations on
    the vector elements are supposed to be identical? Unless it’s just greater parallelism.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 12:40:07 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:
    If your OoO capabilities are limited (and I think
    they are on the Cray machines), you cannot start the second iteration of
    the doall loop before the processing step of the first iteration has
    finished with the register.

    How would out-of-order execution help, anyway, given all the operations on >the vector elements are supposed to be identical?

    OoO does in hardware what software pipelining does in software: If I
    have a loop

    for (i=0; i<n; i++)
    b[i] = a[i]+c[i];

    the straightforward way to code this is:

    L0:
    load tmp1 = a[i]
    load tmp2 = b[i]
    add tmp3 = tmp1,tmp2
    store c[i] = tmp3
    add i = i+1
    branch L0 if i<n

    On an in-order CPU you then do things like loop unrolling, modulo
    scheduling and modulo variable renaming to get a steady state like:

    L0:
    store c[i], tmp1
    add tmp3 = tmp3,tmp4
    load tmp9 = a[i+4]
    load tmp10 = b[i+4]
    store c[i+1], tmp3
    add tmp5 = tmp5,tmp6
    load tmp1 = a[i+5]
    load tmp2 = b[i+5]
    store c[i+2], tmp5
    add tmp7 = tmp7,tmp8
    load tmp3 = a[i+6]
    load tmp4 = b[i+6]
    store c[i+3], tmp7
    add tmp9 = tmp9,tmp10
    load tmp1 = a[i+7]
    load tmp2 = b[i+7]
    store c[i+4], tmp9
    add tmp1 = tmp1,tmp2
    load tmp3 = a[i+8]
    load tmp4 = b[i+8]
    add i=i+5
    branch L0 if i<n-4

    And that's just to cover a load latency of 4 cycles, assuming that the
    machine can perform 2 loads and one store per cycle. And you have to
    generate the ramp-up and ramp-down code, and for more complicated
    loops it becomes more complicated.

    By contrast, on an OoO machine the straightforward code just works
    efficiently, and the hardware does the reordering and register
    renaming (and the Golden Cove with its 0-cycle constant additions
    eliminates even a part of the reason for loop unrolling). It creates
    the ramp-up automatically, and, if the loop exit is predicted
    correctly, even the ramp-down, and it overlaps the ramp-up (and
    possibly the ramp-down) with adjacent code.

    Back to the Crays: While the SIMD/vector semantics means that a
    straightforward loop will process 64 elements rather than one before
    the first load of the second iteration has to wait for the add of
    the first iteration to finish, you still have to do some software
    pipelining to get an overlap between that add and that load; the
    longer the latency, the more software pipelining and (for register
    renaming) the more registers you need.

    In OoO the corresponding condition is when the OoO engine has consumed
    all instances of one resource and has to wait for instructions to
    finish to free these resources; ideally the hardware prefetcher avoids
    that scenario, but in memory-bandwidth-limited situations it will
    occur.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 17:34:12 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

    CRAY machines stayed "in style" as long as memory latency remained
    smaller than the length of a vector (64 cycles) and fell out of favor
    when the cores got fast enough that memory could no longer keep up.

    So why would conventional short vectors work better, then? Surely the
    latency discrepancy would be even worse for them.

    Yes, the later NEC long vector machines grew their VRF up to 256 entries
    per register.

    As to why RISC-V went shorter I can only imagine they think vector codes
    can be compiled properly for a quicker memory hierarchy (i.e., hit in
    L1 or L2 caches.).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 17:51:44 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

    So what's the benefit of using vector/SIMD instructions at all rather
    than doing it with scalar code?

    On the original Cray machines, I read somewhere the benefit of using the vector versions over the scalar ones was a net positive for a vector
    length as low as 2.

    Somewhere in the neighborhood of 4-5 length vectors. There was a 3 cycle
    decode delay as pipeline scheduling slots were reserved for the vector writebacks.

    If your OoO capabilities are limited (and I think
    they are on the Cray machines), you cannot start the second iteration of
    the doall loop before the processing step of the first iteration has
    finished with the register.

    How would out-of-order execution help, anyway, given all the operations on the vector elements are supposed to be identical? Unless it’s just greater parallelism.

    Out of order makes it easier to "run into" undiscovered dynamic dependency
    free operations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 23 17:49:25 2024
    Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

    CRAY machines stayed "in style" as long as memory latency remained
    smaller than the length of a vector (64 cycles) and fell out of favor
    when the cores got fast enough that memory could no longer keep up.

    Mitch Alsup repeatedly makes this claim without giving any
    justification. Your question may shed some light on that.

    Consider a CRAY-like vector machine with 128-cycle main memory
    and 64-entry VRF registers. If it only takes 64 cycles to send
    out all the addresses, but takes 128 cycles to return, there is
    no "chain slot"--chain slot only works when the memory latency
    is shorter than vector length.

    And without chain slot, vectors are not higher performing (by
    much) compared to scalar operation. Vectors were a way of
    appearing to perform one beat of work per cycle per active
    function unit.

    So why would conventional short vectors work better, then? Surely the >>latency discrepancy would be even worse for them.

    Context switch latency...

    Thinking about it, they probably don't work better. They just don't
    work worse, so why spend area on 4096-bit vector registers like the
    Cray-1 did when 128-512-bit vector registers work just as well?

    But do they work as well ??

    Plus,
    they have 200 or so of these registers, so 4096-bit registers would be
    really expensive. How many vector registers does the Cray-1 (and its successors) have?

    On modern machines OoO machinery bridges the latency gap between the
    L2 cache, maybe even the L3 cache and the core for data-parallel code.

    Mc 88120 would run MATRIX 300 at just under 6 I/C with massive cache
    misses (~33%).

    For the latency gap to main memory there are the hardware prefetchers,
    and they use the L1 or L2 cache as intermediate buffer, while the
    Cray-1 and followons use vector registers.

    Opening yourself up to Spectré-like attacks.

    So what's the benefit of using vector/SIMD instructions at all rather
    than doing it with scalar code? A SIMD instruction that replaces n
    scalar instructions consumes fewer resources for instruction fetching, decoding, register renaming, administering the instruction in the OoO
    engine, and in retiring the instruction.

    I can argue that SIMD is "just a waste of ISA encoding space".

    So why not use SIMD instructions with longer vector registers? The progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
    suggests that this is happening, but with every doubling the cost in
    area doubles but the returns are diminishing thanks to Amdahl's law.

    Not to mention that the 512 version can only run a few SIMD instructions
    at that width before thermally throttling itself.

    So at some point you stop. Intel introduced AVX-512 for Larrabee (a special-purpose machine), and now is backpedaling with desktop, laptop
    and small-server CPUs (even though only the Golden/Raptor Cove cores
    are enabled on the small-server CPUs) only supporting AVX, and with
    AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
    vector registers are already too costly for the benefit they give in general-purpose computing.

    Back to old-style vector processors. There have been machines that
    supported longer vector registers and AFAIK also memory-to-memory
    machines. The question is why have they not been the answer of the vector-processor community to the problem of covering the latency? Or
    maybe they have? AFAIK NEC SX has been available in some form even in
    recent years, maybe still.

    Anyway, after thinking about this, the reason behind Mitch Alsup's
    statement is that in a

    doall(load process store)

    computation (like what SIMD is good at), the loads precede the
    corresponding processing by the load latency (i.e., memory latency on
    the Cray machines). If your OoO capabilities are limited (and I think
    they are on the Cray machines), you cannot start the second iteration
    of the doall loop before the processing step of the first iteration
    has finished with the register.

    Unless the compiler can solve the memory aliasing problem.

    You can do a bit of software
    pipelining and software register renaming by transforming this into

    load1 doall(load2 process1 store1 load1 process2 store2)

    but at some point you run out of vector registers.

    One thing that comes to mind is tracking individual parts of the
    vector registers, which allows to starting the next iteration as soon
    as the first part of the vector register no longer has any readers.

    A vector scoreboard anyone ??

    However, it's probably not that far off in complexity to tracking
    shorter vector registers in an OoO engine. And if you support
    exceptions (the Crays probably don't), this becomes messy, while with
    short vector registers it's easier to implement the (ISA)
    architecture.

    All of which is solved with VVM. Consider::

    for( int64_t i = 0; i < max; i++ )
    a[i] = a[max-i];

    This can be vectorized under VVM, the parts far from i = ½×max run
    at vector speeds, those near i = ½×max run at scalar speeds, from
    the same instruction sequence !! .....

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 21:58:50 2024
    On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

    As to why RISC-V went shorter ...

    They didn’t fix a length.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Apr 23 22:39:44 2024
    BGB wrote:

    On 4/23/2024 1:22 AM, Anton Ertl wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
    big snip>


    As can be noted, SIMD is easy to implement.

    ADD/SUB is, MUL and DIV and SHIFTs and CMPs are not; especially when
    MUL does 2n = n × n and DIV does 2n / n -> n (quotient) + n (remainder)

    Main obvious drawback is the potential for combinatorial explosions of instructions. One needs to keep a fairly careful watch over this.

    Like, if one is faced with an NxN or NxM grid of possibilities, naive strategy is to be like "I will define an instruction for every
    possibility in the grid.", but this is bad. More reasonable to devise a minimal set of instructions that will allow the operation to be done
    within in a reasonable number of instructions.

    But, then again, I can also note that I axed things like packed-byte operations and saturating arithmetic, which are pretty much de-facto in packed-integer SIMD.

    MANY SIMD algorithms need saturating arithmetic because they cannot do
    b + b -> h and avoid the overflow. And they cannot do B + b -> h because
    that would consume vast amounts of encoding space.

    Likewise, a lot of the gaps are filled in with specialized converter and helper ops. Even here, some conversion chains will require multiple instructions.

    Well, and if there is no practical difference between a scalar and SIMD version of an instruction, may well just use the SIMD version for scalar.

    ....


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 22:40:25 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

    As to why RISC-V went shorter ...

    They didn’t fix a length.

    Nor do they want to have to save a page of VRF at context switch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed Apr 24 00:25:59 2024
    On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

    As to why RISC-V went shorter ...

    They didn’t fix a length.

    Nor do they want to have to save a page of VRF at context switch.

    But then, you don’t need a whole array of registers, do you: you just need operand (one for each operand) and destination address registers, plus a counter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 00:34:03 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

    As to why RISC-V went shorter ...

    They didn’t fix a length.

    Nor do they want to have to save a page of VRF at context switch.

    But then, you don’t need a whole array of registers, do you: you just need operand (one for each operand) and destination address registers, plus a counter.

    If by 'you' you mean My 66000's VVM::
    a) yes I avoid any SW visible register file
    b) and I use the miss buffers as the VRF register file pool
    c) they vanish on an interrupt or exception
    d) the counter is the loop valiable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Apr 24 00:37:11 2024
    BGB wrote:

    On 4/23/2024 5:39 PM, MitchAlsup1 wrote:
    BGB wrote:


    MANY SIMD algorithms need saturating arithmetic because they cannot do
    b + b -> h and avoid the overflow. And they cannot do B + b -> h because
    that would consume vast amounts of encoding space.


    There are ways to fake it.

    Though, granted, most end up involving extra instructions and 1 bit of dynamic range.

    1-bit for ADD and SUB, but MUL and shifts require more than 1-bit.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Apr 24 00:24:25 2024
    On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

    DIV:
    Didn't bother with this.
    Typically faked using multiply-by-reciprocal and taking the high result.

    Another Cray-ism! ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Apr 23 19:25:22 2024
    On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    CRAY machines stayed "in style" as long as memory latency remained smaller >than the length of a vector (64 cycles) and fell out of favor when the cores >got fast enough that memory could no longer keep up.

    I whish them well, but I expect it will not work out as they desire.....

    I know that you've said this about Cray-style vectors.

    I had thought the cause was much simpler. As soon as chiips like the
    486 DX and then the Pentium II became available, a Cray-style machine
    would have had to be implemented from smaller-scale integrated
    circuits, so it would have been wildly uneconomic for the performance
    it provided; it made much more sense to use off-the-shelf
    microprocessors. Despite their shortcomings theoretically in
    architectural terms compared to a Cray-style machine, they offered
    vastly more FLOPS for the dollar.

    After all, the reason the Cray I succeeded where the STAR-100 failed
    was that it had those big vector registers - so it did calculations on
    a register-to-register basis, rather than on a memory-to-memory basis.

    That doesn't make it immune to considerations of memory bandwidth, but
    that does mean that it was designed correctly for the circumstance
    where memory bandwidth is an issue. So if you have the kind of
    calculation to perform that is suited to a vector machine, wouldn't it
    still be better to use a vector machine than a whole bunch of scalar
    cores with no provision for vectors?

    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    There are ways to increase memory bandwidth. Use HBM. Use static RAM.
    Use graphics DRAM. The vector CPU of the last gasp of the Cray-style architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

    Also, the original Cray I did useful work with a memory no larger than
    many L3 caches these days. So a vector machine today wouldn't be as
    fast as it would be if it could have, say, a 1024-bit wide data bus to
    a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
    even when throttled by memory bandwidth, isn't an improvement over an
    ordinary CPU.

    Of course, though, the question is, is it an improvement enough? If
    most problems anyone would want to use a vector CPU for today do
    involve a large amount of memory, used in a random fashion, so as to
    fit poorly in cache, then it might well be that memory bandwidth would
    mean that even with a vector architecture well suited to doing a lot
    of work, the net result would be only a slight improvement over what
    an ordinary CPU could do with the same memory bandwidth.

    I would think that a chip is still useful if it can only provide an
    improvement for some problems, and that there are ways to increase
    memory bandwidth from what ordinary CPUs offer, making it seem likely
    that Cray-style vectors are worth doing as a way to improve what a CPU
    can do.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Apr 24 02:00:10 2024
    John Savard wrote:

    On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.

    I whish them well, but I expect it will not work out as they desire.....

    I know that you've said this about Cray-style vectors.

    I had thought the cause was much simpler. As soon as chiips like the
    486 DX and then the Pentium II became available, a Cray-style machine
    would have had to be implemented from smaller-scale integrated
    circuits, so it would have been wildly uneconomic for the performance
    it provided; it made much more sense to use off-the-shelf
    microprocessors. Despite their shortcomings theoretically in
    architectural terms compared to a Cray-style machine, they offered
    vastly more FLOPS for the dollar.

    CRAY-XMP was done in MECL 10K gate arrays, offering 10K gates per chip.

    After all, the reason the Cray I succeeded where the STAR-100 failed
    was that it had those big vector registers - so it did calculations on
    a register-to-register basis, rather than on a memory-to-memory basis.

    The CRAY-a had much shorter setup sequences than the STAR.
    Amdahl's law strikes again.

    That doesn't make it immune to considerations of memory bandwidth, but
    that does mean that it was designed correctly for the circumstance
    where memory bandwidth is an issue. So if you have the kind of
    calculation to perform that is suited to a vector machine, wouldn't it
    still be better to use a vector machine than a whole bunch of scalar
    cores with no provision for vectors?

    Let us face facts:: en the large; vector machines are DMA devices
    that happen to mangle the data on thee way through.

    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    a) It is not pure BW but BW at a latency less than K. CRAY-1 was
    about 16-cycles (DRAM) CRAY-1S was about 10 cycles (SRAM), XMP
    was about 22 cycles, and YMP was about 32 cycles. CRAY-1 and -1S
    had 1 port to memory, XMP and YMP had 2Rd and 1W to memory.

    b) GPUs use threading to absorb the latency to memory (roughly 400
    cycles), along with HW rasterizer, interpolator, texture access,
    and an HW OS that can clean up a thread and launch a new thread in
    about 8 cycles. That is: GPUs absorb latency by waiting in a way
    that does not prevent others from making forward progress.

    There are ways to increase memory bandwidth. Use HBM. Use static RAM.
    Use graphics DRAM. The vector CPU of the last gasp of the Cray-style architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

    Even HBM has a latency of standard DRAM (with smaller command cycle
    overheads) so, a 5-GHz core using 20ns DRAM with infinite BW between
    core and DRAM will still have the core see 100 cycles of latency.
    Bandwidth alone does not solve latency bound problems, latency alone
    does not solve BW bound problems.

    Also, the original Cray I did useful work with a memory no larger than
    many L3 caches these days. So a vector machine today wouldn't be as
    fast as it would be if it could have, say, a 1024-bit wide data bus to
    a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
    even when throttled by memory bandwidth, isn't an improvement over an ordinary CPU.

    The 128K DW memory was used for number crunching, but CRAY-1 had an
    I/O system that could consume as much BW as a core, so one could
    write out the last chunk and read in the next chunk while the
    current chunk was processing. And it was this I/O system that made
    a CRAY-1 faster than its equivalent NEC machine (excepting on certain benchmarks).

    Of course, though, the question is, is it an improvement enough? If
    most problems anyone would want to use a vector CPU for today do
    involve a large amount of memory, used in a random fashion, so as to
    fit poorly in cache, then it might well be that memory bandwidth would
    mean that even with a vector architecture well suited to doing a lot
    of work, the net result would be only a slight improvement over what
    an ordinary CPU could do with the same memory bandwidth.

    In essence, if you can teach the compiler to block the numeric algorithm
    to fit through (through not in) the cache(s) you can use vector style
    CPU architecture.

    I would think that a chip is still useful if it can only provide an improvement for some problems, and that there are ways to increase
    memory bandwidth from what ordinary CPUs offer, making it seem likely
    that Cray-style vectors are worth doing as a way to improve what a CPU
    can do.

    Everyone has to have hope on something.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Apr 24 02:38:02 2024
    On Tue, 23 Apr 2024 20:50:31 -0500, BGB wrote:

    There is an instruction to calculate an approximate reciprocal (say,
    for
    dividing two FP-SIMD vectors), at which a person can use Newton-Raphson
    to either get a more accurate version, or use it directly (possibly
    using N-R to fix up the result of the division).

    Cray had that: an approximate-reciprocal instruction, use it twice to get
    the full-accuracy result.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed Apr 24 02:47:59 2024
    On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:

    After all, the reason the Cray I succeeded where the STAR-100 failed was
    that it had those big vector registers ...

    Looking at an old Cray-1 manual, it mentions, among other things, sixty
    four 64-bit intermediate scalar “T” registers, and eight 64-element vector “V” registers of 64 bits per element. That’s a lot of registers.

    RISC-V has nothing like this, as far as I can tell. Right at the top of
    the spec I linked earlier, it says:

    The vector extension adds 32 architectural vector registers,
    v0-v31 to the base scalar RISC-V ISA.

    Each vector register has a fixed VLEN bits of state.

    So, no “big vector registers” that I can see? It says that VLEN must be a power of two no bigger than 2**16, which does sound like a lot, but then
    the example they give only has VLEN = 128.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Wed Apr 24 05:47:54 2024
    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.

    I whish them well, but I expect it will not work out as they desire.....

    I know that you've said this about Cray-style vectors.

    I had thought the cause was much simpler. As soon as chiips like the
    486 DX and then the Pentium II became available,

    The 486 came out in 1989.

    a Cray-style machine
    would have had to be implemented from smaller-scale integrated
    circuits, so it would have been wildly uneconomic for the performance
    it provided;

    The Cray C90 came out in 1991. That was still considered ecomomic
    by the people who bought it :-)

    The (low-level) competition for scientific computing at the time
    was workstations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Wed Apr 24 06:16:58 2024
    John Savard <quadibloc@servername.invalid> writes:
    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    The claim by Mitch Alsup is that latency makes the Crays impractical,
    because of chaining issues. Do GPUs have chaining? My understanding
    is that GPUs deal with latency in the barrel processor way: use
    another data-parallel thread while waiting for memory. Tera also
    pursued this idea, but the GPUs succeeded with it.

    If
    most problems anyone would want to use a vector CPU for today do
    involve a large amount of memory, used in a random fashion, so as to
    fit poorly in cache

    When the working set is larger than the cache, it does not fit even
    when accesses regularly. Prefetchers can reduce the latency, but they
    will not increase the bandwidth.

    So if you have a problem that walks through a lot of memory and
    performs only a few operations per data item, that's where CPUs will
    wait for memory a lot, due to limited bandwidth (and you won't benefit
    from SIMD/vector instructions on these kinds of problems). For that
    kind of stuff you better use GPUs, which have memory systems with more bandwidth.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 06:32:26 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:
    Looking at an old Cray-1 manual, it mentions, among other things, sixty
    four 64-bit intermediate scalar “T” registers, and eight 64-element vector >“V” registers of 64 bits per element. That’s a lot of registers.

    RISC-V has nothing like this, as far as I can tell. Right at the top of
    the spec I linked earlier, it says:

    The vector extension adds 32 architectural vector registers,
    v0-v31 to the base scalar RISC-V ISA.

    Each vector register has a fixed VLEN bits of state.

    So, no “big vector registers” that I can see? It says that VLEN must be a >power of two no bigger than 2**16, which does sound like a lot, but then
    the example they give only has VLEN = 128.

    It's an example. If you think you can make and profitably sell a CPU
    with VLEN=4096 (the number of bits in one of Cray-1's vector
    registers), that would be compliant with the spec, and would run
    programs written for RISC-V with the vector extension. Or you can
    make one with VLEN=65536 and claim that you have the longest one:-).

    This leaves you free to decide VLEN based on the costs and benefits in
    the context of the other design decisions you have made and on the
    programs you expect to run.

    Note that the Fujitsu A64FX (which implements the similar ARM Scalable
    Vector Extension and was designed for supercomputing) chooses a
    512-bit vector implementation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Apr 24 00:57:07 2024
    On Wed, 24 Apr 2024 02:00:10 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Everyone has to have hope on something.

    But false hopes are a waste of time.

    The reason for my interest in long vectors is primarily because I
    imagine that, if the Cray I was an improvement on the IBM System/360
    Model 195, then, apparently, today a chip like the Cray I would be
    the next logical step after the Pentium II (OoO plus cache, just like
    a Model 195).

    And that's a very nave way of looking at the issue, so of course it
    can be wrong.

    I can, however, believe that latency, not bandwith as such, is the
    killer. That's true for regular CPU compute, and so of course it would
    be a limiting factor for vector machines.

    What do vector machines do?

    Well, apparently they do things like multiply 2048 by 2048 matrices.
    Which is why they need stride. And since modern DRAMs like to give you
    16 consecutive values at a time... oh, well, you can multiply 16 rows
    of the matrix at once. Each matrix would take 32 megabytes of storage,
    so that does fit in cache, at least.

    But they've managed to get GPUs to multiply matrices - and they're
    quite good at it, which is why we're having all this amazing progress
    in AI recently. So it's quite possible that long vector machines have
    too narrow a niche, between plain CPUs (more flexible) and GPUs (less flexible).

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed Apr 24 07:08:01 2024
    On Wed, 24 Apr 2024 00:57:07 -0600, John Savard wrote:

    But they've managed to get GPUs to multiply matrices - and they're quite
    good at it, which is why we're having all this amazing progress in AI recently.

    Worth noting that this AI stuff requires very low-precision floats: 16-
    bit, even 8-bit. And they sacrifice mantissa bits in favour of exponents--
    down to something like maybe only a couple of mantissa bits in the 8-bit format.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 06:48:54 2024
    On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

    For that kind of stuff you better use GPUs, which have memory systems
    with more bandwidth.

    But with more limited memory, which is typically not upgradeable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 09:28:06 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

    For that kind of stuff you better use GPUs, which have memory systems
    with more bandwidth.

    But with more limited memory, which is typically not upgradeable.

    And yet, supercomputers these days often have lots of GPUs. The
    software crisis still is not yet there in supercomputing, so they
    manage to do with explicit moving of data between the high-bandwidth
    GPU memory and the lower-bandwidth bigger memory, just like in the
    days of the Cray-1 (or was it the CDC-6600?), which has a fast memory
    and a bigger slow memory.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Wed Apr 24 09:18:56 2024
    John Savard <quadibloc@servername.invalid> writes:
    On Wed, 24 Apr 2024 02:00:10 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    Everyone has to have hope on something.

    But false hopes are a waste of time.

    The reason for my interest in long vectors is primarily because I
    imagine that, if the Cray I was an improvement on the IBM System/360
    Model 195, then, apparently, today a chip like the Cray I would be
    the next logical step after the Pentium II (OoO plus cache, just like
    a Model 195).

    But the Cray-1 is not an improvement on the Model 195. It has no
    cache. Neither the Cray-1 nor the Model 195 have OoO as the term is
    commonly understood today: OoO execution, in-order completion,
    allowing register renaming, speculative execution, and precise
    exceptions. One may consider the Model 91/195 a predecessor of
    today's OoO, because it supports register renaming, and you "just"
    need to add a reorder buffer to get in-order completion and
    speculative execution.

    Well, apparently they do things like multiply 2048 by 2048 matrices.
    Which is why they need stride.

    You can multiply dense matrices of any size efficiently with stride 1.
    And caches help a lot for matrix multiply; in HPC circles, (dense)
    matrix multiply is known as cache-friendly problem.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to John Savard on Wed Apr 24 10:12:02 2024
    On 4/24/24 1:57 AM, John Savard wrote:
    What do vector machines do?

    They keep a pipeline full.

    So you can do something in 64+7 clock cycles instead of 64*7.

    If the pipeline gets shorter the benefit decreases of course. And if you
    have some other way to keep that pipeline full, you don't need vectors.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Apr 24 19:50:43 2024
    Anton Ertl wrote:

    John Savard <quadibloc@servername.invalid> writes:
    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    The claim by Mitch Alsup is that latency makes the Crays impractical,
    because of chaining issues. Do GPUs have chaining? My understanding
    is that GPUs deal with latency in the barrel processor way: use
    another data-parallel thread while waiting for memory. Tera also
    pursued this idea, but the GPUs succeeded with it.

    - anton

    Consider:: an 8 deep CRAY-like vector calculation with 8 cycle latency
    memory and 6 cycle latency FMAC::

    |LD|LD|LD|LD|LD|LD|LD|LD|
    |FM|FM|FM|FM|FM|FM|FM|FM|
    |ST|ST|ST|ST|ST|ST|ST|ST|

    Not much parallelism. Now consider the same machine above with longer
    vectors::

    |LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|
    |FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|
    |ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|

    Now we have considerable parallelism with no change in latencies.

    Later consider the top execution profile augmented with a bit of OoO
    and a second memory port::

    |LD|LD|LD|LD|LD|LD|LD|LD|
    |FM|FM|FM|FM|FM|FM|FM|FM|
    |SA|SA|SA|SA|SA|SA|SA|SA| |Sd|Sd|Sd|Sd|Sd|Sd|Sd|Sd|


    Finally consider a GBOoO implementation::

    |LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|...
    |Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|...
    |FM|FM|FM|FM|FM|FM|FM|FM|...
    |SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|...
    |Sd|Sd|Sd|Sd|Sd|Sd|Sd|Sd|...

    Here it takes an execution window 18 deep to reach pipeline saturation,
    but once you do, the core runs at 3 instructions and arguably 4 units
    of work per cycle {without including loop overheads}. In order to
    achieve such performance one needs to issue the whole loop in 1 cycle.

    You have to have the requisite bandwidths {AGEN, bank access, address
    routing bandwidth, result return bandwidth, FMAC bandwidth} but you
    also have to have the requisite latencies (and excecution window width)
    or it falls apart that enable the vector chaining to work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Apr 24 23:58:34 2024
    MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

    DIV:
    Didn't bother with this.
    Typically faked using multiply-by-reciprocal and taking the high result.

    Another Cray-ism! ;)

    Not IEEE 754 legal.

    Well, it _is_ legal if you carry enough bits in your reciprocal...but at
    that point you would instead use a better algorithm to get the correct
    result both faster and using less power.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 22:33:17 2024
    On Wed, 24 Apr 2024 09:28:06 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

    For that kind of stuff you better use GPUs, which have memory systems
    with more bandwidth.

    But with more limited memory, which is typically not upgradeable.

    And yet, supercomputers these days often have lots of GPUs.

    Some do, some don’t. I’m not sure that GPUs are accepted as de rigueur in supercomputer design yet. I think this is just another instance of Ivan Sutherland’s “wheel of reincarnation” <http://www.cap-lore.com/Hardware/Wheel.html>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 22:29:36 2024
    On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

    But the Cray-1 is not an improvement on the Model 195.

    The Cray-1 was widely regarded as the fastest computer in the world, when
    it came out. Cruising speed of something over 80 megaflops, hitting bursts
    of about 120.

    IBM did try to compete in the “supercomputer” field for a while longer,
    but I think by about ten years later, it had given up.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Apr 25 00:09:21 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    Lawrence D'Oliveiro wrote:

    On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

    DIV:
    Didn't bother with this.
    Typically faked using multiply-by-reciprocal and taking the high result. >>
    Another Cray-ism! ;)

    Not IEEE 754 legal.

    Well, it _is_ legal if you carry enough bits in your reciprocal...

    Maybe--at best. There are certain pair of numerator::denominator that require over 120-reciprocal bits* in order to deliver a properly rounded result using an intermediate reciprocation.

    (*) the reciprocal fraction bits--wider than long double.

    but at
    that point you would instead use a better algorithm to get the correct
    result both faster and using less power.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to ldo@nz.invalid on Wed Apr 24 23:10:47 2024
    On Wed, 24 Apr 2024 22:33:17 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:
    On Wed, 24 Apr 2024 09:28:06 GMT, Anton Ertl wrote:

    And yet, supercomputers these days often have lots of GPUs.

    Some do, some dont. Im not sure that GPUs are accepted as de rigueur in >supercomputer design yet. I think this is just another instance of Ivan >Sutherlands wheel of reincarnation ><http://www.cap-lore.com/Hardware/Wheel.html>.

    What do GPUs do, when they're included in supercomputers?

    One of the things that those supercomputers that _do_ include GPUs are
    praised for is being energy-efficient.

    The problem with GPUs is that since their computational capabilities
    are built on what the shader part does, their flexibility is limited.
    This is what has made me think there could be a place for Cray-style
    vectors. So some supercomputers don't have GPU accelerators, because
    they're intended to work on problems for which GPU accelerators
    wouldn't provide much help.

    Since when GPUs _can_ be used, they save lots of electricity, I doubt
    strongly that they're just a passing fad.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Savard on Thu Apr 25 05:39:55 2024
    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
    no.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Apr 25 11:57:47 2024
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

    But the Cray-1 is not an improvement on the Model 195.

    The Cray-1 was widely regarded as the fastest computer in the world, when
    it came out. Cruising speed of something over 80 megaflops, hitting bursts
    of about 120.

    Its main practical improvement was that you could get two Crays for the price of one 360/195. (Not exactly, but close enough.)

    IBM did try to compete in the “supercomputer” field for a while longer, >but I think by about ten years later, it had given up.

    IBM had tried to make computers very fast by making them very
    complicated. STRETCH was fantastically complex for something built out
    of individual transistors. The /91 and /195 had instrucion queues and reservation stations and loop mode. Cray went in the opposite
    direction, making a much simpler computer where each individual bit
    down to the chips and the wires, were as fast as possible.

    In many ways it was a preview of RISC.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 14:57:53 2024
    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs
    are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes,
    GPUs, no.

    If you never heard about *that*, I can only imagine what else you
    didn't hear about supercomputers.

    Back when Fugaku was new, it was highly praised for being GPU-less
    design that matches and slightly exceeds an efficiency of
    GPU-based (and other vector accelerator based) supercomputers. But that
    was possible only because NVidea had unusually long pause between
    successive generations of Tesla and at the same moment AMD and
    Intel GPGPUs were not yet considered fit for serious supercomputing.

    That was in November 2019. Never before or since.
    Right now the best GPU-less entry on Green500 list is #48 (still the
    same A64FX CPU as Fugaku, but smaller configuration) and it delivers 4x
    less sustained FLOP/Watt than the top spot, based on NVIDIA H100.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Apr 25 12:06:58 2024
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs are
    praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
    no.

    NVIDIA says their new Blackwell GPU takes 2000 watts, and is between
    7x and 25x more power efficient than the current H100, but that's
    still a heck of a lot of power. Data centers have had to come up with
    higher capacity power and cooling when each rack can use 40 to 50KW.

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 14:27:46 2024
    On Wed, 24 Apr 2024 22:29:36 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

    But the Cray-1 is not an improvement on the Model 195.

    The Cray-1 was widely regarded as the fastest computer in the world,
    when it came out. Cruising speed of something over 80 megaflops,
    hitting bursts of about 120.

    IBM did try to compete in the “supercomputer” field for a while
    longer, but I think by about ten years later, it had given up.

    In late 80s IBM joined forces with "attack of killing micros".
    Their first POWER CPU was released in 1990 and did 82 MFLOPS (peak).

    A single processor of contemporary Cray Y-MP was 4 times faster.
    A single processor of older Cray-2 was almost 6 times faster, but by
    1990 it was discontinued.
    Wikipedia says that power consumption of Cray-2 was 150–200 kW,
    probably for 4 processors with 2 GB of memory and peripherals.
    I can't find data about power consumption of IBM Power processor. My
    guess would be ~40 W for CPU and 1000-1500 W for a whole RS/6000
    Model 550 with 1 GB of memory.

    BTW, in the latest Top500 list you ca see IBM at #7 spot.
    Things that carry name of Cray listed at #2 and #5. They are,
    respectively, Intel Inside and AMD Inside.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to ldo@nz.invalid on Thu Apr 25 07:46:35 2024
    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs are
    praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
    no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Thu Apr 25 15:52:36 2024
    John Levine wrote:

    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

    But the Cray-1 is not an improvement on the Model 195.

    The Cray-1 was widely regarded as the fastest computer in the world, when >>it came out. Cruising speed of something over 80 megaflops, hitting bursts >>of about 120.

    Its main practical improvement was that you could get two Crays for the price of one 360/195. (Not exactly, but close enough.)

    IBM did try to compete in the “supercomputer” field for a while longer, >>but I think by about ten years later, it had given up.

    IBM had tried to make computers very fast by making them very
    complicated. STRETCH was fantastically complex for something built out
    of individual transistors. The /91 and /195 had instrucion queues and reservation stations and loop mode. Cray went in the opposite
    direction, making a much simpler computer where each individual bit
    down to the chips and the wires, were as fast as possible.

    In many ways it was a preview of RISC.

    Seymore only did fast and simple, starting before the CDC 6600.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 25 19:10:19 2024
    On Thu, 25 Apr 2024 15:52:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Seymore only did fast and simple, starting before the CDC 6600.....

    Do you attribute not exactly simple 6600 Scoreboard to Thornton?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Apr 25 17:34:32 2024
    Michael S wrote:

    On Thu, 25 Apr 2024 15:52:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Seymore only did fast and simple, starting before the CDC 6600.....

    Do you attribute not exactly simple 6600 Scoreboard to Thornton?

    If you measure simplicity by gate count--the scoreboard was considerably simpler than the reservation station design of Tomasulo.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Thu Apr 25 17:52:35 2024
    John Savard <quadibloc@servername.invalid> schrieb:
    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs are
    praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Thu Apr 25 17:49:11 2024
    John Levine <johnl@taugh.com> schrieb:
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include GPUs are
    praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.

    NVIDIA says their new Blackwell GPU takes 2000 watts, and is between
    7x and 25x more power efficient than the current H100, but that's
    still a heck of a lot of power. Data centers have had to come up with
    higher capacity power and cooling when each rack can use 40 to 50KW.

    GPUs are very energy efficient per theoretical peak performance of
    calculations per second. Said peak performance is extremely high,
    hence the huge power requirements...

    But programming for GPUs is _much_ harder than programming for
    vector computers used to be. Getting to 10% of theoretical peak
    performance is quite impressive. Getting above 50% requires
    the right problem, good knowledge of the GPU internals (which NVIDIA
    does not tend to share - don't they want to have people have good
    performance on their cards? and lots of thought and _very_ clever
    algorithms.

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    Good thing you're not living in Germany, your electricity bill
    would be enormous...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 25 20:45:32 2024
    On Thu, 25 Apr 2024 17:34:32 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Michael S wrote:

    On Thu, 25 Apr 2024 15:52:36 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    Seymore only did fast and simple, starting before the CDC
    6600.....

    Do you attribute not exactly simple 6600 Scoreboard to Thornton?

    If you measure simplicity by gate count--the scoreboard was
    considerably simpler than the reservation station design of Tomasulo.

    Both were far from simple by day's standards.

    BTW, was not low gate count of Scoreboard mostly due to creative usage
    of what was later named wired logic connections, i.e. something that
    stopped working in high-speed VLSI around 1985-1990 ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Apr 25 19:17:04 2024
    According to Thomas Koenig <tkoenig@netcologne.de>:
    https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Well, compared to what?

    In 1960 the total power generated in the US was about 750 TWh. In
    recent years it's over 4000 TWh.

    I see global data center power use in recent years of about 250 TWh,
    and about the same again in data transmission, but I don't know how
    much of that to attribute to the US.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 22:29:30 2024
    Lawrence D'Oliveiro wrote:

    On Thu, 25 Apr 2024 15:52:36 +0000, MitchAlsup1 wrote:

    [Seymour] only did fast and simple, starting before the CDC 6600.....

    And he didn’t seem to have much truck with “memory management” and “operating systems”, did he? He probably saw them as just getting in the way of sheer speed.

    Base and bounds was good enough for numerical programs.

    On the other hands, NOS did things no other OS did.....

    And he didn’t care for some of the niceties of floating-point arithmetic either, for the same reason.

    Heck, FP arithmetic is only approximate anyway--it is just more
    approximate on my machines.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Apr 25 22:23:49 2024
    On Thu, 25 Apr 2024 15:52:36 +0000, MitchAlsup1 wrote:

    [Seymour] only did fast and simple, starting before the CDC 6600.....

    And he didn’t seem to have much truck with “memory management” and “operating systems”, did he? He probably saw them as just getting in the way of sheer speed.

    And he didn’t care for some of the niceties of floating-point arithmetic either, for the same reason.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri Apr 26 00:26:58 2024
    On Thu, 25 Apr 2024 22:29:30 +0000, MitchAlsup1 wrote:

    On the other hands, NOS did things no other OS did.....

    Like what? I thought the original Cray OS was just a batch OS.

    Then they added this Unix-like “UNICOS” thing, but that seemed to me like an interactive front-end to the batch OS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Apr 26 00:33:50 2024
    On Thu, 25 Apr 2024 14:57:53 +0300, Michael S wrote:

    Back when Fugaku was new, it was highly praised for being GPU-less
    design that matches and slightly exceeds an efficiency of GPU-based (and other vector accelerator based) supercomputers. But that was possible
    only because NVidea had unusually long pause between successive
    generations of Tesla and at the same moment AMD and Intel GPGPUs were
    not yet considered fit for serious supercomputing.

    That was in November 2019. Never before or since.

    Fugaku is still at number 4 on the Top500, though--even after all these
    years. And don’t forget the Chinese systems, using their home-grown CPUs without access to Nvidia GPUs. There’s one at number 11.

    Should we be looking at the Green500 list instead?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri Apr 26 00:53:02 2024
    Lawrence D'Oliveiro wrote:

    On Thu, 25 Apr 2024 22:29:30 +0000, MitchAlsup1 wrote:

    On the other hands, NOS did things no other OS did.....

    Like what? I thought the original Cray OS was just a batch OS.


    One afternoon in 1978, I was in the typing room at NCR Cambridge typing
    in my 8085 ASM code; there were another 6 of us in there. NCR rented
    time on a CDC 7600 in San Diego.

    Suddenly there was a long pause where the silent 700's made no noise;
    and after 20 or so seconds, the pause ended and we proceeded along
    with our work. I discovered later that the San Diego machine had taken
    a hard crash and all our jobs had been picked up by the PPs and shipped
    en massé to a CDC 7600 in Chicago (including the files those jobs were
    using.)

    Then they added this Unix-like “UNICOS” thing, but that seemed to me like an interactive front-end to the batch OS.

    It was, and it was written in interpreted BASIC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Fri Apr 26 04:20:58 2024
    On Fri, 26 Apr 2024 00:33:50 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Thu, 25 Apr 2024 14:57:53 +0300, Michael S wrote:

    Back when Fugaku was new, it was highly praised for being GPU-less
    design that matches and slightly exceeds an efficiency of GPU-based
    (and other vector accelerator based) supercomputers. But that was
    possible only because NVidea had unusually long pause between
    successive generations of Tesla and at the same moment AMD and
    Intel GPGPUs were not yet considered fit for serious supercomputing.

    That was in November 2019. Never before or since.

    Fugaku is still at number 4 on the Top500, though--even after all
    these years. And don’t forget the Chinese systems, using their
    home-grown CPUs without access to Nvidia GPUs. There’s one at number
    11.


    From the very little info I found about it, Sunway TaihuLight
    processors are likely more similar to Intel KNC (a.k.a. Xeon Phi
    co-processor) than to Fujitsu A64FX. I.e. simple in-order cores, likely
    2-way superscalar, with single-issue wide VPUs. In other words,
    decisively non-general-purpose.

    Should we be looking at the Green500 list instead?

    Of course we should be looking at Green500 when discussing power
    efficiency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Apr 26 23:14:14 2024
    On Thu, 25 Apr 2024 14:27:46 +0300, Michael S wrote:

    On Wed, 24 Apr 2024 22:29:36 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

    But the Cray-1 is not an improvement on the Model 195.

    The Cray-1 was widely regarded as the fastest computer in the world,
    when it came out. Cruising speed of something over 80 megaflops,
    hitting bursts of about 120.

    IBM did try to compete in the “supercomputer” field for a while longer, >> but I think by about ten years later, it had given up.

    BTW, in the latest Top500 list you ca see IBM at #7 spot.

    Those are POWER machines, an entirely different architecture from the
    System 360-and-successors line (which I think was meant by “Model 195”). And one which still has a bit of oomph left in it, obviously.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From aph@littlepinkcloud.invalid@21:1/5 to mitchalsup@aol.com on Sat Apr 27 08:23:38 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    Let us face facts:: en the large; vector machines are DMA devices
    that happen to mangle the data on thee way through.

    John Savard wrote:

    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    a) It is not pure BW but BW at a latency less than K. CRAY-1 was
    about 16-cycles (DRAM)

    DRAM for CRAY-1 doesn't sound right. Intel made 1024-bit DRAM in 1970,
    but it was pretty flaky and not very fast. I think the CRAY-1 used
    Fairchild 10K ECL 10ns SRAM.

    Andrew,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Apr 27 11:30:29 2024
    BGB <cr88192@gmail.com> schrieb:

    Say, seemingly no one built an 8/16 bit mainframe,

    The IBM 360/30 and 360/40 actually had a 8 and 16-bit
    microarchitecture, respectively. Of course, they hid it cleverly
    behind the user-visible architecture which was 32 bits.

    But then, the Nova was a 4-bit system cleverly disguising itself
    as a 16-bit system, and the Z80 had a 4-bit ALU, as well.

    or say using 24-bit
    floats (Say: S.E7.F16) rather than bigger formats, ...

    Konrad Zuse used 22-bit floats.

    Like, seemingly, the smallest point of computers was seemingly things
    like the 6502 and similar...

    That was probably the PDP 8/S, which had (if Wikipedia is to be
    believed) around 519 logic gates. The 6502 had more.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Sat Apr 27 11:48:03 2024
    John Levine <johnl@taugh.com> schrieb:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Well, compared to what?

    Absolute figures, or relative :-)


    In 1960 the total power generated in the US was about 750 TWh. In
    recent years it's over 4000 TWh.

    My point was: Computers have become vastly more energy-efficient
    and powerful. I think one of the "What If 2" chapters is about
    building an Iphone out of vaccum tubes, which would end badly.

    This has led _much_ more widespread adoption of computers plus
    derivatives such as smartphones or tablets, which means that
    their overall energy consumption has increased by many orders
    of magnitude over the 1950s, when just a few vaccum-tube based
    computers were in operation.

    If people make the claim that GPUs are more power-efficient than CPUs,
    yes, they are for equal performance (if they can be programmed
    efficiently enough for the application at hand). In practice, this
    will not be used for energy savings, but for doing more calculations.

    Same thing happend with steam engines - Watt's engines were a huge
    improvement in fuel efficiency over the previous Newcomen models,
    which led to much more steam engines being built.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to aph@littlepinkcloud.invalid on Sat Apr 27 15:13:03 2024
    aph@littlepinkcloud.invalid wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    Let us face facts:: en the large; vector machines are DMA devices
    that happen to mangle the data on thee way through.

    John Savard wrote:

    And if memory bandwidth issues make Cray-style vector machines
    impractical, then wouldn't it be even worse for GPUs?

    a) It is not pure BW but BW at a latency less than K. CRAY-1 was
    about 16-cycles (DRAM)

    DRAM for CRAY-1 doesn't sound right. Intel made 1024-bit DRAM in 1970,
    but it was pretty flaky and not very fast. I think the CRAY-1 used
    Fairchild 10K ECL 10ns SRAM.

    That was CRAY-1s s stands for SRAM.

    Also note 16 cycles at 12.5ns (200ns) is plenty of time for even
    early RAS/CAS DRAM.

    Andrew,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Apr 27 16:41:19 2024
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Like, seemingly, the smallest point of computers was seemingly things
    like the 6502 and similar...

    That was probably the PDP 8/S, which had (if Wikipedia is to be
    believed) around 519 logic gates. The 6502 had more.

    I can believe it. The PDP-8 was a simple architecture and the S stood
    for bit Serial, and Stupendously Slow.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Apr 27 23:10:56 2024
    On Sat, 27 Apr 2024 16:41:19 -0000 (UTC), John Levine wrote:

    According to Thomas Koenig <tkoenig@netcologne.de>:

    That was probably the PDP 8/S, which had (if Wikipedia is to be
    believed) around 519 logic gates. The 6502 had more.

    I can believe it.

    You can probably find detailed schematics, on Bitsavers or elsewhere, to confirm it. DEC published that sort of thing as a matter of course, back
    in those days.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Apr 27 23:08:21 2024
    On Sat, 27 Apr 2024 11:48:03 -0000 (UTC), Thomas Koenig wrote:

    If people make the claim that GPUs are more power-efficient than CPUs,
    yes, they are for equal performance (if they can be programmed
    efficiently enough for the application at hand). In practice, this will
    not be used for energy savings, but for doing more calculations.

    “Rebound effect”, I think it’s called.

    Remember all those science-fiction predictions from the earlier part of
    the 20th century, about cities on the Moon, personal flying transportation
    and all the rest of it? All that was predicated on having large sources of power--i.e. atomic power.

    Instead of having atomic-scale sources of power at our disposal, we got information processing (computers) instead, and almost nobody saw how big
    a revolution that would be. Meanwhile, the atomic-energy industry seemed
    to take a wrong turn, putting more effort into power production systems
    that would also aid the production of atomic weapons, instead of
    concentrating on predominantly peaceful technologies.

    Now the information processing power is reaching the limits of the
    available physical power. The only way to make significant further
    progress is to start boosting that physical power generation again.

    Same thing happend with steam engines - Watt's engines were a huge improvement in fuel efficiency over the previous Newcomen models, which
    led to much more steam engines being built.

    Watt’s engine (like Newcomen’s one before it) was an “atmospheric” engine:
    the pressure to drive it came from the atmosphere, not from the steam.

    True high-pressure “steam” engines were developed by Trevithick and
    others, after Watt’s patent had expired and he could no longer stop them.

    And that is what kicked off the Industrial Revolution.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Apr 28 16:19:24 2024
    According to Lawrence D'Oliveiro <ldo@nz.invalid>:
    That was probably the PDP 8/S, which had (if Wikipedia is to be
    believed) around 519 logic gates. The 6502 had more.

    I can believe it.

    You can probably find detailed schematics, on Bitsavers or elsewhere, to >confirm it. DEC published that sort of thing as a matter of course, back
    in those days.

    The logic diagrams are in the back of the maintenance manual which
    Bitsavers does have, but at the moment I don't feel like going through
    and counting the gates.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Apr 28 14:06:24 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    BGB <cr88192@gmail.com> schrieb:

    Say, seemingly no one built an 8/16 bit mainframe,

    The IBM 360/30 and 360/40 actually had a 8 and 16-bit
    microarchitecture, respectively. Of course, they hid it cleverly
    behind the user-visible architecture which was 32 bits.

    But then, the Nova was a 4-bit system cleverly disguising itself
    as a 16-bit system, and the Z80 had a 4-bit ALU, as well.

    or say using 24-bit
    floats (Say: S.E7.F16) rather than bigger formats, ...

    Konrad Zuse used 22-bit floats.

    Like, seemingly, the smallest point of computers was seemingly things
    like the 6502 and similar...

    That was probably the PDP 8/S, which had (if Wikipedia is to be
    believed) around 519 logic gates. The 6502 had more.

    The LGP-30 had 113 tubes and 1450 diodes. The transistorized
    successor, the LGP-31, had about 460 transistors and about
    375 diodes (all per the wikipedia article on the LGP-30).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Apr 28 14:18:51 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    John Savard <quadibloc@servername.invalid> schrieb:

    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include
    GPUs are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs,
    yes, GPUs, no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/
    gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Total energy consumption by computers in the 1950s was lower
    than today by at least a factor of 10. It wouldn't surprise
    me to discover the energy consumption of just the servers in
    Amazon Web Services datacenters exceeds the 1950s total, and
    that's only AWS (reportedly more than 1.4 million servers).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tkoenig@netcologne.de on Mon Apr 29 00:48:45 2024
    On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    John Levine <johnl@taugh.com> schrieb:

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    Good thing you're not living in Germany, your electricity bill
    would be enormous...

    Possibly John meant to say "4Kwh", which actually would a be a bit on
    the high side for the *average* home in the US.

    If he really meant 4Kw continuous ... wow!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to George Neuner on Mon Apr 29 08:13:47 2024
    George Neuner wrote:
    On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

    John Levine <johnl@taugh.com> schrieb:

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    Good thing you're not living in Germany, your electricity bill
    would be enormous...

    Possibly John meant to say "4Kwh", which actually would a be a bit on
    the high side for the *average* home in the US.

    If he really meant 4Kw continuous ... wow!

    Here in Norway we abuse our hydro power as our primary house heating
    source, in our previous home we used about 60K KWh per year, which
    corresponds to 60K/(24*365.24) = 6.84 KW average, day & night.

    This was in fact while having a heat pump to handle the main part of the heating needs.

    The new house, which is from the same era (1962 vs 1963), uses
    significantly less, but probably still 30-40K /year.

    Electric power used to cost just under 1 NOK (about 9 cents at current
    exchange rates), including both primary power cost and transmission
    cost, but then we started exporting too much to Denmark/Sweden/Germany
    which means that we also imported their sometimes much higher power prices.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Apr 29 16:53:42 2024
    According to George Neuner <gneuner2@comcast.net>:
    On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig ><tkoenig@netcologne.de> wrote:

    John Levine <johnl@taugh.com> schrieb:

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    Good thing you're not living in Germany, your electricity bill
    would be enormous...

    Possibly John meant to say "4Kwh", which actually would a be a bit on
    the high side for the *average* home in the US.

    Last month we used 92 KWh/day, which is 3.8KW. The ground source heat
    pump is how we heat the house and it was a fairly cool month. We also
    have a separate heat pump for hot water (tying it to the main system
    was absurdly expensive) and an induction stove which can use up to
    10KW.

    During the summer we use a lot less power. On the other hand, our
    bills for gas, propane, and fuel oil are zero.

    FWIW we pay about 12c/kwh which is fairly low for the U.S., with a
    complicated remote net metering discount in which we pretend that part
    of a solar farm in a nearby town is on our roof.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to terje.mathisen@tmsw.no on Mon Apr 29 21:39:55 2024
    On Mon, 29 Apr 2024 08:13:47 +0200, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    George Neuner wrote:
    On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    John Levine <johnl@taugh.com> schrieb:

    I mean, my entire house is wired for 24KW and usually runs at more
    like 4KW including a heat pump that heats the house.

    Good thing you're not living in Germany, your electricity bill
    would be enormous...

    Possibly John meant to say "4Kwh", which actually would a be a bit on
    the high side for the *average* home in the US.

    If he really meant 4Kw continuous ... wow!

    Here in Norway we abuse our hydro power as our primary house heating
    source, in our previous home we used about 60K KWh per year, which >corresponds to 60K/(24*365.24) = 6.84 KW average, day & night.

    This was in fact while having a heat pump to handle the main part of the >heating needs.

    The new house, which is from the same era (1962 vs 1963), uses
    significantly less, but probably still 30-40K /year.

    Electric power used to cost just under 1 NOK (about 9 cents at current >exchange rates), including both primary power cost and transmission
    cost, but then we started exporting too much to Denmark/Sweden/Germany
    which means that we also imported their sometimes much higher power prices.

    Terje

    In the US, the majority of homes are heated with oil or gas (LNG or in
    rural areas it might be propane). Electric heat mainly is found in
    the south where overall need is low. Electric cooling is far more
    widespread. The majority of ovens are electric, but ~ 2/3 of cooktops
    are gas.

    Where I am, the per Kwh rates *currently* are
    0.17216 - generation
    0.09434 - distribution
    0.04052 - transmission
    0.00037 - transition (from? to?)
    0.00006
    0.00800
    0.00050
    0.02334 - efficiency (of what?)
    ------
    0.33929

    It's little wonder the current administration wants to force everyone
    to use only electricity ... it will bankrupt consumers trying to pay
    for energy, and bankrupt utilities trying to deliver it. Estimates
    are that the grid needs trillions of dollars in upgrades to handle the anticipated load [that the administration wants to force on it within
    5 years].

    I'd have a nuclear reactor in my basement if I could.

    YMMV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Apr 30 14:54:52 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Adding the typical kind of vector-processing instructions to an
    instruction set inevitably leads to a combinatorial explosion in the
    number of opcodes.

    Why is that a problem that needs solving?

    This kind of thing makes a mockery of the R in RISC.

    So what?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 30 16:26:35 2024
    Scott Lurndal wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Adding the typical kind of vector-processing instructions to an
    instruction set inevitably leads to a combinatorial explosion in the
    number of opcodes.

    Why is that a problem that needs solving?

    When your OpCode encoding space runs out of bits in the instruction.

    This kind of thing makes a mockery of the R in RISC.

    So what?

    Design + verification cost, time to market, Size of test vector set,
    and Compiler complexity.

    So, pretty close to the difference between binary floating point
    and decimal floating point.....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Apr 30 17:56:36 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    John Savard <quadibloc@servername.invalid> schrieb:

    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include
    GPUs are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs,
    yes, GPUs, no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/
    gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Total energy consumption by computers in the 1950s was lower
    than today by at least a factor of 10.

    Undoubtedly true, but I think you're missing quite a few
    orders of magnitude there.

    It wouldn't surprise
    me to discover the energy consumption of just the servers in
    Amazon Web Services datacenters exceeds the 1950s total, and
    that's only AWS (reportedly more than 1.4 million servers).

    https://smithsonianeducation.org/scitech/carbons/1960.html states
    that, in 1954, there were 15 computers in the US. That seems low
    (did they only count IBM 701 machines?), but it reportedly went up to
    17000 in 1964.

    Even if you put the number of computers at 100 for the mid-1950s, at
    100 kW each, you only get 10 MW of power when they ran (wich they often
    didn't; due to maintenance, these early computers seem to have been
    day shift only).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Apr 30 19:38:54 2024
    According to Thomas Koenig <tkoenig@netcologne.de>:
    It wouldn't surprise
    me to discover the energy consumption of just the servers in
    Amazon Web Services datacenters exceeds the 1950s total, and
    that's only AWS (reportedly more than 1.4 million servers).

    https://smithsonianeducation.org/scitech/carbons/1960.html states
    that, in 1954, there were 15 computers in the US. That seems low
    (did they only count IBM 701 machines?), but it reportedly went up to
    17000 in 1964.

    Wikipedia lists 18 UNIVACs shipped by 1954 so that's certainly low.
    With the 702, the ERA machines and the one-offs like JOHNNIAC I'd
    guess the number was more like 50, but soon increased with multiple
    IBM 704 and 650 machines starting in 1954.

    Even if you put the number of computers at 100 for the mid-1950s, at
    100 kW each, you only get 10 MW of power when they ran (wich they often >didn't; due to maintenance, these early computers seem to have been
    day shift only).

    The 650s at leat ran all night. Alan Perlis told me some amusing
    stories of tripping in the dark over sleeping grad student wives who
    were holding their husbands' place in line for the 650 in the middle
    of the night. They soon made the scheduling more humane.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Apr 30 20:25:48 2024
    On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

    The 650s at leat ran all night. Alan Perlis told me some amusing stories
    of tripping in the dark over sleeping grad student wives who were
    holding their husbands' place in line for the 650 in the middle of the
    night. They soon made the scheduling more humane.

    How did they do that, though? Other than by hiring operators to work those shifts, so the users could submit their jobs in a queue and go home?

    Those early computers were expensive, hence the need for 24-hour batch operation to keep them as busy as possible, to earn their keep.

    That batch mentality is still characteristic of (what’s left of) IBM mainframes today.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Apr 30 20:31:17 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Adding the typical kind of vector-processing instructions to an >>>instruction set inevitably leads to a combinatorial explosion in the >>>number of opcodes.

    Why is that a problem that needs solving?

    When your OpCode encoding space runs out of bits in the instruction.

    And has that been a real problem yet? Pretty much every
    instruction set can be easily extended (viz. 8086),
    particularly with variable length encodings, nothing prevents
    one from adding a special 32-bit encoding that extends the
    instruction to 64-bits even in a fixed size encoding scheme.


    This kind of thing makes a mockery of the R in RISC.

    So what?

    Design + verification cost, time to market, Size of test vector set,
    and Compiler complexity.

    As contrasted with usability. ARM doesn't add features just
    for the sake of adding features, nor does Intel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 30 21:12:04 2024
    Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    Scott Lurndal wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Adding the typical kind of vector-processing instructions to an >>>>instruction set inevitably leads to a combinatorial explosion in the >>>>number of opcodes.

    Why is that a problem that needs solving?

    When your OpCode encoding space runs out of bits in the instruction.

    And has that been a real problem yet? Pretty much every
    instruction set can be easily extended (viz. 8086),
    particularly with variable length encodings, nothing prevents
    one from adding a special 32-bit encoding that extends the
    instruction to 64-bits even in a fixed size encoding scheme.

    I suspect as long as RISC-V maintains its 32-bit only ISA,
    that REIC-V will hit that wall first.


    This kind of thing makes a mockery of the R in RISC.

    So what?

    Design + verification cost, time to market, Size of test vector set,
    and Compiler complexity.

    As contrasted with usability. ARM doesn't add features just
    for the sake of adding features, nor does Intel.

    Are you sure ?? Take SSE-512 (or whatever Intel calls it) !!

    When I was at AMD (99-06) every 6 months or so, we (AMD) got Intel's
    latest instructions additions, and they gout ours. Most of these
    additions end up at the 0.01% level of the dynamic instructions
    executed (over a wide range of programs (more than 40,000 traces)),
    and all cores had to have all of the instructions.

    Is this a burden on Intel:: not so much since they already have
    extensive (exhaustive??) tests and implementation libraries....

    Is this a burden on AMD:: yes, absolutely; the smaller design staff,
    they can afford based on their revenue stream, increases the burden significantly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Apr 30 23:08:54 2024
    On Tue, 30 Apr 2024 20:31:17 GMT, Scott Lurndal wrote:

    ARM doesn't add features just for the sake of adding features, nor does Intel.

    There is such a thing as painting yourself into a corner, where every new feature added to the SIMD instruction set involves adding combinations of instructions, not just for the new types, but also for every single old
    type as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 30 23:56:03 2024
    Lawrence D'Oliveiro wrote:

    On Tue, 30 Apr 2024 20:31:17 GMT, Scott Lurndal wrote:

    ARM doesn't add features just for the sake of adding features, nor does
    Intel.

    There is such a thing as painting yourself into a corner, where every new feature added to the SIMD instruction set involves adding combinations of instructions, not just for the new types, but also for every single old
    type as well.

    That is the combinational explosion mentioned above.
    {Although I would term it the Cartesian Product of types and OPs}

    Then contemplate for an instant that one would want SIMD instructions for Complex numbers and Hamiltonian Quaterions......

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 00:18:29 2024
    On Tue, 30 Apr 2024 23:56:03 +0000, MitchAlsup1 wrote:

    Then contemplate for an instant that one would want SIMD instructions
    for Complex numbers and Hamiltonian Quater[n]ions......

    Quaternions yeah! Interesting that they actually predated vector algebra <https://www.youtube.com/watch?v=M12CJIuX8D4> (from the wonderful “Kathy Loves Physics & History” channel), and then the mathematicians realized
    that it was a bit simpler to separate out the components and deal with
    them separately, rather than carry them around all the time. Some of the
    “old guard” resisted this move ...

    And now they’ve made a comeback in computer graphics, for representing rotations, particularly of armature “bones” used in posing and animating characters.

    I’m not sure you really need SIMD instructions for quaternions, though. Consider that the typical use of such instructions is to process millions
    or even billions of data items (e.g. pixels, maybe even geometry
    coordinates for complex models), whereas the number of bones in an
    armature is maybe a few thousand at most.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to ldo@nz.invalid on Wed May 1 01:22:21 2024
    It appears that Lawrence D'Oliveiro <ldo@nz.invalid> said:
    On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

    The 650s at least ran all night. Alan Perlis told me some amusing stories
    of tripping in the dark over sleeping grad student wives who were
    holding their husbands' place in line for the 650 in the middle of the
    night. They soon made the scheduling more humane.

    How did they do that, though? Other than by hiring operators to work those >shifts, so the users could submit their jobs in a queue and go home?

    Rather than just queueing up, they arranged it so the student could
    sign up ahead of time, and then show up whenever to do his work, and
    the wives could get some sleep.

    I also think he tried to round up some money to get another computer.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Wed May 1 03:06:04 2024
    John Levine wrote:

    It appears that Lawrence D'Oliveiro <ldo@nz.invalid> said:
    On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

    The 650s at least ran all night. Alan Perlis told me some amusing stories >>> of tripping in the dark over sleeping grad student wives who were
    holding their husbands' place in line for the 650 in the middle of the
    night. They soon made the scheduling more humane.

    How did they do that, though? Other than by hiring operators to work those >>shifts, so the users could submit their jobs in a queue and go home?

    Rather than just queueing up, they arranged it so the student could
    sign up ahead of time, and then show up whenever to do his work, and
    the wives could get some sleep.

    I also think he tried to round up some money to get another computer.

    I remember getting up at 3:00 AM to get exclusive access to the IBM 360/67
    to run various student programs with much better response time than when
    30 other people were trying to do the same. {Every body else, except the
    system operator had left by then::at least statistically.}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Apr 30 23:58:16 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    John Savard <quadibloc@servername.invalid> schrieb:

    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include
    GPUs are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs,
    yes, GPUs, no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/
    gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Total energy consumption by computers in the 1950s was lower
    than today by at least a factor of 10.

    Undoubtedly true, but I think you're missing quite a few
    orders of magnitude there.

    Probably not as many as you think. :)

    It wouldn't surprise
    me to discover the energy consumption of just the servers in
    Amazon Web Services datacenters exceeds the 1950s total, and
    that's only AWS (reportedly more than 1.4 million servers).

    https://smithsonianeducation.org/scitech/carbons/1960.html states
    that, in 1954, there were 15 computers in the US. That seems low
    (did they only count IBM 701 machines?), but it reportedly went up to
    17000 in 1964.

    Even if you put the number of computers at 100 for the mid-1950s, at
    100 kW each, you only get 10 MW of power when they ran (wich they often didn't; due to maintenance, these early computers seem to have been
    day shift only).

    Oh boy, numbers.

    First your question asked about the late 1950s, not the mid 1950s.

    I estimated between 10,000 and 20,000 computers by the end of
    the 1950s, and chose 5 KW as an average consumption. In those
    days computers were big. Probably the estimate for number of
    machines is a bit on the high side, and the average consumption
    is a bit on the low side. I'm only estimating.

    The most popular computer in the 1950s was the IBM 650. 2,000
    units sold (or in some cases given away).

    In contrast, the LGP-30 turned out only 500 units, at a mere
    1500 W each.

    Towards the end of the 1950s both the IBM 1620 and the IBM 1401
    came out. Of course none of either of these were delivered
    until the 1960s, but the IBM 1401 delivered 10,000 units all
    on its own.

    I looked up a few other IBM models, didn't get any unit numbers on
    any of them. I didn't even try to look up models or numbers of
    units from other manufacturers (not counting the LGP-30, since I
    happened to have a wikipedia page open already for that). But
    based on just the number of different IBM models, and knowing that
    the 650 produced 2,000 units, and keeping in mind the number of
    different computer manufacturers at that time, suggests that 10,000
    systems overall is a plausible guess.

    Also there is a noteworthy computer system developed in the 1950s
    that is often overlooked. Only 24 units were installed. Each
    installation occupied 22,000 square feet, weighed 250 tons, had
    60,000 tubes, and used 3 MW. So that's 72 MW all by itself (to be
    fair some parts were turned off at times for maintenance, but at
    least half of each installation was up at all times).

    I did a very different kind of calculation to estimate how much
    power is used in today's computers. The result was more than
    ten times as much, but less than 100 times as much. Remember,
    I'm just estimating. But I had enough confidence in the estimates
    to say at least a factor of 10, which seems more than adequate to
    answer the question asked (and that's all I was doing).

    What's the largest computer ever built? The AN/FSQ-7. Only 24
    installed, for an aggregate weight of 6,000 tons.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Wed May 1 08:56:47 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    John Savard <quadibloc@servername.invalid> schrieb:

    On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

    One of the things that those supercomputers that _do_ include
    GPUs are praised for is being energy-efficient.

    That I never heard before. I heard it in relation to ARM CPUs,
    yes, GPUs, no.

    Here's one example of an item about this:

    https://www.infoworld.com/article/2627720/
    gpus-boost-energy-efficiency-in-supercomputers.html

    Compared the late 1950s, was the total energy consumption by
    computers higher or lower than today? :-)

    Total energy consumption by computers in the 1950s was lower
    than today by at least a factor of 10.

    Undoubtedly true, but I think you're missing quite a few
    orders of magnitude there.

    Probably not as many as you think. :)

    It wouldn't surprise
    me to discover the energy consumption of just the servers in
    Amazon Web Services datacenters exceeds the 1950s total, and
    that's only AWS (reportedly more than 1.4 million servers).

    https://smithsonianeducation.org/scitech/carbons/1960.html states
    that, in 1954, there were 15 computers in the US. That seems low
    (did they only count IBM 701 machines?), but it reportedly went up to
    17000 in 1964.

    Even if you put the number of computers at 100 for the mid-1950s, at
    100 kW each, you only get 10 MW of power when they ran (wich they often
    didn't; due to maintenance, these early computers seem to have been
    day shift only).

    Oh boy, numbers.

    First your question asked about the late 1950s, not the mid 1950s.

    I estimated between 10,000 and 20,000 computers by the end of
    the 1950s, and chose 5 KW as an average consumption. In those
    days computers were big. Probably the estimate for number of
    machines is a bit on the high side, and the average consumption
    is a bit on the low side. I'm only estimating.

    The number of computers is probably high, the power maybe somewhat
    low, but let us take it as a basis - 2*10^4 computers with 5*10^3
    Watt, total power if they are all on at the same time 10^8 Watt.
    Let's assume an operating time of 4000 hours, so total energy
    consumption would be around 1.44*10^15 J or 4*10^8 kWh, or
    0.4 Terawatt-hours.

    For today, we don't need to make an estimate
    ourselves, we can use other people's. Looking at https://frontiergroup.org/resources/fact-file-computing-is-using-more-energy-than-ever/
    one finds that data centers alone use around 240-340 Terawatt-hours,
    so we have a factor of a bit less than 1000 alredy. The total
    sector, according to the same source, and also according to https://researchbriefings.files.parliament.uk/documents/POST-PN-0677/POST-PN-0677.pdf
    is around three times that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed May 1 08:20:58 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions for Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Thu May 2 10:13:33 2024
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions for
    Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    If you have the four operands spread across two SIMD registers, so
    (Re,Im) in each, then you need an initial pair of permutes to make
    flipped copies before you can start the fmul/fmac ops, right?

    This is exactly the kind of code where Mitch's transparent vector
    processing would be very nice to have.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 2 10:58:12 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions for >>> Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    If you have the four operands spread across two SIMD registers, so
    (Re,Im) in each, then you need an initial pair of permutes to make
    flipped copies before you can start the fmul/fmac ops, right?

    This is exactly the kind of code where Mitch's transparent vector
    processing would be very nice to have.

    I'm actually not sure how that would help. Could you elaborate?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Thu May 2 20:10:35 2024
    Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions
    for
    Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    If you have the four operands spread across two SIMD registers, so
    (Re,Im) in each, then you need an initial pair of permutes to make
    flipped copies before you can start the fmul/fmac ops, right?

    This is exactly the kind of code where Mitch's transparent vector
    processing would be very nice to have.

    I'm actually not sure how that would help. Could you elaborate?


    VVM synthesizes SIMD (lanes) and strip-mining (Cray-like vectors) while processing SCALAR code. So, as long as the compiler knows which operands
    are participating, almost any amount of <strange> Complexity drops out
    for free -- including things like Quaternions.

    Physicists like quaternions because it means they don't have to worry
    about
    whether to add or subtract, the {i,j,k} does it for them. Complex is OK
    for
    flat spaces but when one is dealing with non Cartesian coordinates (like
    within the radius of the proton) other effects makes quaternions a better
    path.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu May 2 20:14:09 2024
    BGB wrote:

    On 4/30/2024 8:22 PM, John Levine wrote:


    Sometimes seems odd in a way that people can manage to find wives, with
    as many difficulties and prerequisites there seem to be in being seen as

    "worthy of attention", etc...


    Then again, it seems that there is a split:
    Many people seem to marry off between their early to mid 20s;
    Like, somehow, they find someone where there is mutual interest.

    More than ½ of whom end up divorced within 7 years.

    Others, not so quickly, if at all.

    You mean the lucky ones ?!?

    On the female side, it seems there are several subgroups:
    Those who are waiting for "the perfect romance".
    Those who want someone with at least a "6 figure income", etc.
    Then there are the asexual females.
    And also lesbians.

    If you don't know what you are looking for, how do you know when
    you find it ?!!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 03:46:22 2024
    On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

    If you don't know what you are looking for, how do you know when you
    find it ?!!!

    Maybe the procedure for determining that you’ve found it is recursively enumerable, but that for doing the search is not? ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Fri May 3 10:23:43 2024
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions for >>>> Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    If you have the four operands spread across two SIMD registers, so
    (Re,Im) in each, then you need an initial pair of permutes to make
    flipped copies before you can start the fmul/fmac ops, right?

    This is exactly the kind of code where Mitch's transparent vector
    processing would be very nice to have.

    I'm actually not sure how that would help. Could you elaborate?

    Just that all his code is scalar, but when you have a bunch of these
    complex mul/mac operations in a loop, his hw will figure out the
    recurrences and run them as fast as possible, with all the (Re,Im) SIMD
    flips becoming NOPs.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Fri May 3 09:40:33 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Then contemplate for an instant that one would want SIMD instructions for >>>>> Complex numbers and Hamiltonian Quaterions......

    Quaternions would be a bit over the top, I tink. Complex
    multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

    fmul Rt1,Rc,Rb
    fmac Re,Rd,Ra,Rt1

    fmul Rt2,Rd,Rb
    fmac Rf,Rc,Ra,-Rt2

    So, you'd need both operands on both lanes. Not very SIMD-friendly,
    I would assume, but (probably) not impossible, either.

    If you have the four operands spread across two SIMD registers, so
    (Re,Im) in each, then you need an initial pair of permutes to make
    flipped copies before you can start the fmul/fmac ops, right?

    This is exactly the kind of code where Mitch's transparent vector
    processing would be very nice to have.

    I'm actually not sure how that would help. Could you elaborate?

    Just that all his code is scalar, but when you have a bunch of these
    complex mul/mac operations in a loop, his hw will figure out the
    recurrences and run them as fast as possible, with all the (Re,Im) SIMD
    flips becoming NOPs.

    Sure.

    This would then be something like (in the loop)

    vec r6,{}
    ldd r7,[r1,r5,0]
    ldd r8,[r1,r5,8]
    ldd r9,[r2,r5,0]
    ldd r10,[r2,r5,8]
    fmul r11,r9,r8
    fmac r11,r10,r7,r11
    fmul r8,r10,r8
    fmac r7,r9,r7,-r8
    std r7,[r3,r5,0]
    std r11,[r3,r5,8]
    loop1 lt,r5,r4,#16

    but it would not help in a case where previous results were already
    in registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat May 4 21:19:32 2024
    BGB wrote:

    On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:
    On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

    If you don't know what you are looking for, how do you know when you
    find it ?!!!

    Maybe the procedure for determining that you’ve found it is recursively
    enumerable, but that for doing the search is not? ;)

    I think it is a case of determining if someone responds in favorable
    ways to interactions, does not respond in unfavorable ways, and does not

    present any obvious "deal breakers".

    Presumably other people are doing something similar, but with different metrics.

    Different definitions !!

    Though, granted, the whole process tends to be horribly inefficient.

    If you are anything like a normal male, you are compatible with about
    1% of women of your age group. Likewise, 1% of any given women in your
    age group will be compatible with you ±.

    So, you (and her) will have to pass over 10,000 of the others to end
    up with a compatible partner.

    There generally doesn't exist any good way to determine who exists in a
    given area, or to get a general idea for who may or may not be worth the

    time/effort of interacting with them.

    Women, by and large, do the picking:: mem, by and large, do the
    acquiescing.

    Many dating sites (and people on them) seem to operate under the
    assumption of "will post pictures, good enough".

    Dating sites are for losers. P E R I O D

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat May 4 22:34:24 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    BGB wrote:

    If you are anything like a normal male, you are compatible with about
    1% of women of your age group. Likewise, 1% of any given women in your
    age group will be compatible with you ±.

    So, you (and her) will have to pass over 10,000 of the others to end
    up with a compatible partner.

    Even assuming that the numbers are true (far too low, IMHO), the
    calculation assumes that both quantities are uncorrelated.

    If it were really true, humans would long since have died out
    (unless "compatible" means something else :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun May 5 00:07:57 2024
    BGB wrote:

    On 5/4/2024 4:19 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:
    On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

    If you don't know what you are looking for, how do you know when you >>>>> find it ?!!!

    Maybe the procedure for determining that you’ve found it is
    recursively
    enumerable, but that for doing the search is not? ;)

    I think it is a case of determining if someone responds in favorable
    ways to interactions, does not respond in unfavorable ways, and does
    not

    present any obvious "deal breakers".

    Presumably other people are doing something similar, but with
    different metrics.

    Different definitions !!


    Not sure what you mean by this, exactly.

    The things that make a man attractive to a woman are completely different
    from the things that make a woman attractive to a man.

    Though, granted, the whole process tends to be horribly inefficient.

    If you are anything like a normal male, you are compatible with about
    1% of women of your age group. Likewise, 1% of any given women in your
    age group will be compatible with you ±.

    So, you (and her) will have to pass over 10,000 of the others to end
    up with a compatible partner.

    At this point, looks almost like it could be closer to 0.

    Or, at least, most of the ones I might be interested in talking to,
    aren't in the same geographic area.

    Or you are not frequenting areas those who might be compatible with
    you are frequenting.

    There generally doesn't exist any good way to determine who exists in
    a given area, or to get a general idea for who may or may not be worth

    the

    time/effort of interacting with them.

    Women, by and large, do the picking:: mem, by and large, do the
    acquiescing.


    Not much point in trying to interact with them though if there is no
    reason to think it might be worth the effort of doing so.



    Many dating sites (and people on them) seem to operate under the
    assumption of "will post pictures, good enough".

    Dating sites are for losers. P E R I O D


    Somehow, the actual sites still manage to be more dignified than the
    Facebook groups or phone apps, which lean much more heavily into the pointless aspects...

    Apps are no different than dating sites:: see above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sun May 5 18:38:19 2024
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    BGB wrote:

    If you are anything like a normal male, you are compatible with about
    1% of women of your age group. Likewise, 1% of any given women in your
    age group will be compatible with you ±.

    So, you (and her) will have to pass over 10,000 of the others to end
    up with a compatible partner.

    Even assuming that the numbers are true (far too low, IMHO), the
    calculation assumes that both quantities are uncorrelated.

    If it were really true, humans would long since have died out
    (unless "compatible" means something else :-)

    The 1% number is for me. {smart enough, pretty enough, frugal enough,
    sane enough, low maintenance.} I know of more typical male/females where
    their number is closer to 20%.

    1% may be a "little high" for BGB and whomever might be mutually acceptable.

    There is one thing worse than being alone--and that is being with someone
    you seriously dislike.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)