Forum: >>> Magnum BBS <<<

Short Vectors Versus Long Vectors

From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 00:29:32 2024

Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes. This kind of thing makes a mockery of the “R” in “RISC”.

Interesting to see that the RISC-V folks are staying off this path;
instead, they are reviving an old idea from Seymour Cray’s original
machines that bear his name: a vector pipeline. Instead of being limited
to processing 4 or 8 operands at a time, the Cray machines could operate (sequentially, but rapidly) on variable-length vectors of up to 64
elements with a single setup sequence. RISC-V seems to make the limit on
vector length an implementation choice, with a value of 32 being mentioned
in the spec.

The way it avoids having separate instructions for each combination of
operand types is to have operand-type registers as part of the vector
unit. This way, only a small number of instructions is required to set up
all the combinations of operand/result types. You then give it a kick in
the guts and off it goes.

Detailed spec here: <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 02:14:32 2024

Lawrence D'Oliveiro wrote:

Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes. This kind of thing makes a mockery of the “R” in “RISC”.

It does indeed make a mockery of the R in RISC.

Interesting to see that the RISC-V folks are staying off this path;
instead, they are reviving an old idea from Seymour Cray’s original machines that bear his name: a vector pipeline. Instead of being limited
to processing 4 or 8 operands at a time, the Cray machines could operate (sequentially, but rapidly) on variable-length vectors of up to 64
elements with a single setup sequence. RISC-V seems to make the limit on vector length an implementation choice, with a value of 32 being mentioned
in the spec.

CRAY machines stayed "in style" as long as memory latency remained smaller
than the length of a vector (64 cycles) and fell out of favor when the cores got fast enough that memory could no longer keep up.

I whish them well, but I expect it will not work out as they desire.....

The way it avoids having separate instructions for each combination of operand types is to have operand-type registers as part of the vector
unit. This way, only a small number of instructions is required to set up
all the combinations of operand/result types. You then give it a kick in
the guts and off it goes.

Detailed spec here: <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.

On the other hand, My 66000 has support for both SIMD and CRAY-like vectors
and the ISA contains only 6-bits of state supporting vectorization and
exactly 2 instructions--one that gives HW a register it can use in the
"loop" and the LOOP instruction that performs the ADD-CMP-BC functionality. {{Not 2 for every kind of vectorized instruction, 2 total instructions}}

There is nor 4KB of register file (context switch overhead),
there is no need for Gather/Scatter, stride memory references,
there is no masking register,
the OS can use vectorization for small fast loops without overhead,
the compiler does not have to solve memory address aliasing,
cache activities are modified to suit vector workloads,
exotic HW can execute across multiple lanes (as desired),
simple HW can "do it all" in a 1-wide pipeline,
the debugger presents scalar code to coder,
and exceptions remain precise (for those that care),
and the exception handler(s) sees only scalar code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 03:11:41 2024

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.

So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 06:22:38 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.

Mitch Alsup repeatedly makes this claim without giving any
justification. Your question may shed some light on that.

So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.

Thinking about it, they probably don't work better. They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well? Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive. How many vector registers does the Cray-1 (and its successors) have?

On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.
For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code? A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching, decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.

So why not use SIMD instructions with longer vector registers? The
progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.
So at some point you stop. Intel introduced AVX-512 for Larrabee (a special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in general-purpose computing.

Back to old-style vector processors. There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines. The question is why have they not been the answer of the vector-processor community to the problem of covering the latency? Or
maybe they have? AFAIK NEC SX has been available in some form even in
recent years, maybe still.

Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a

doall(load process store)

computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines). If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register. You can do a bit of software
pipelining and software register renaming by transforming this into

load1 doall(load2 process1 store1 load1 process2 store2)

but at some point you run out of vector registers.

One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.
However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine. And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Apr 23 08:31:21 2024

On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code?

On the original Cray machines, I read somewhere the benefit of using the
vector versions over the scalar ones was a net positive for a vector
length as low as 2.

If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.

How would out-of-order execution help, anyway, given all the operations on
the vector elements are supposed to be identical? Unless it’s just greater parallelism.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 12:40:07 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.

How would out-of-order execution help, anyway, given all the operations on >the vector elements are supposed to be identical?

OoO does in hardware what software pipelining does in software: If I
have a loop

for (i=0; i<n; i++)
b[i] = a[i]+c[i];

the straightforward way to code this is:

L0:
load tmp1 = a[i]
load tmp2 = b[i]
add tmp3 = tmp1,tmp2
store c[i] = tmp3
add i = i+1
branch L0 if i<n

On an in-order CPU you then do things like loop unrolling, modulo
scheduling and modulo variable renaming to get a steady state like:

L0:
store c[i], tmp1
add tmp3 = tmp3,tmp4
load tmp9 = a[i+4]
load tmp10 = b[i+4]
store c[i+1], tmp3
add tmp5 = tmp5,tmp6
load tmp1 = a[i+5]
load tmp2 = b[i+5]
store c[i+2], tmp5
add tmp7 = tmp7,tmp8
load tmp3 = a[i+6]
load tmp4 = b[i+6]
store c[i+3], tmp7
add tmp9 = tmp9,tmp10
load tmp1 = a[i+7]
load tmp2 = b[i+7]
store c[i+4], tmp9
add tmp1 = tmp1,tmp2
load tmp3 = a[i+8]
load tmp4 = b[i+8]
add i=i+5
branch L0 if i<n-4

And that's just to cover a load latency of 4 cycles, assuming that the
machine can perform 2 loads and one store per cycle. And you have to
generate the ramp-up and ramp-down code, and for more complicated
loops it becomes more complicated.

By contrast, on an OoO machine the straightforward code just works
efficiently, and the hardware does the reordering and register
renaming (and the Golden Cove with its 0-cycle constant additions
eliminates even a part of the reason for loop unrolling). It creates
the ramp-up automatically, and, if the loop exit is predicted
correctly, even the ramp-down, and it overlaps the ramp-up (and
possibly the ramp-down) with adjacent code.

Back to the Crays: While the SIMD/vector semantics means that a
straightforward loop will process 64 elements rather than one before
the first load of the second iteration has to wait for the add of
the first iteration to finish, you still have to do some software
pipelining to get an overlap between that add and that load; the
longer the latency, the more software pipelining and (for register
renaming) the more registers you need.

In OoO the corresponding condition is when the OoO engine has consumed
all instances of one resource and has to wait for instructions to
finish to free these resources; ideally the hardware prefetcher avoids
that scenario, but in memory-bandwidth-limited situations it will
occur.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 17:34:12 2024

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.

So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.

Yes, the later NEC long vector machines grew their VRF up to 256 entries
per register.

As to why RISC-V went shorter I can only imagine they think vector codes
can be compiled properly for a quicker memory hierarchy (i.e., hit in
L1 or L2 caches.).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 17:51:44 2024

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code?

On the original Cray machines, I read somewhere the benefit of using the vector versions over the scalar ones was a net positive for a vector
length as low as 2.

Somewhere in the neighborhood of 4-5 length vectors. There was a 3 cycle
decode delay as pipeline scheduling slots were reserved for the vector writebacks.

If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.

How would out-of-order execution help, anyway, given all the operations on the vector elements are supposed to be identical? Unless it’s just greater parallelism.

Out of order makes it easier to "run into" undiscovered dynamic dependency
free operations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Apr 23 17:49:25 2024

Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:

CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.

Mitch Alsup repeatedly makes this claim without giving any
justification. Your question may shed some light on that.

Consider a CRAY-like vector machine with 128-cycle main memory
and 64-entry VRF registers. If it only takes 64 cycles to send
out all the addresses, but takes 128 cycles to return, there is
no "chain slot"--chain slot only works when the memory latency
is shorter than vector length.

And without chain slot, vectors are not higher performing (by
much) compared to scalar operation. Vectors were a way of
appearing to perform one beat of work per cycle per active
function unit.

So why would conventional short vectors work better, then? Surely the >>latency discrepancy would be even worse for them.

Context switch latency...

Thinking about it, they probably don't work better. They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well?

But do they work as well ??

Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive. How many vector registers does the Cray-1 (and its successors) have?

On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.

Mc 88120 would run MATRIX 300 at just under 6 I/C with massive cache
misses (~33%).

For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.

Opening yourself up to Spectré-like attacks.

So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code? A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching, decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.

I can argue that SIMD is "just a waste of ISA encoding space".

So why not use SIMD instructions with longer vector registers? The progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.

Not to mention that the 512 version can only run a few SIMD instructions
at that width before thermally throttling itself.

So at some point you stop. Intel introduced AVX-512 for Larrabee (a special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in general-purpose computing.

Back to old-style vector processors. There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines. The question is why have they not been the answer of the vector-processor community to the problem of covering the latency? Or
maybe they have? AFAIK NEC SX has been available in some form even in
recent years, maybe still.

Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a

doall(load process store)

computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines). If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register.

Unless the compiler can solve the memory aliasing problem.

You can do a bit of software
pipelining and software register renaming by transforming this into

load1 doall(load2 process1 store1 load1 process2 store2)

but at some point you run out of vector registers.

One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.

A vector scoreboard anyone ??

However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine. And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.

All of which is solved with VVM. Consider::

for( int64_t i = 0; i < max; i++ )
a[i] = a[max-i];

This can be vectorized under VVM, the parts far from i = ½×max run
at vector speeds, those near i = ½×max run at scalar speeds, from
the same instruction sequence !! .....

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Apr 23 21:58:50 2024

On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

As to why RISC-V went shorter ...

They didn’t fix a length.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Apr 23 22:39:44 2024

BGB wrote:

On 4/23/2024 1:22 AM, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
big snip>

As can be noted, SIMD is easy to implement.

ADD/SUB is, MUL and DIV and SHIFTs and CMPs are not; especially when
MUL does 2n = n × n and DIV does 2n / n -> n (quotient) + n (remainder)

Main obvious drawback is the potential for combinatorial explosions of instructions. One needs to keep a fairly careful watch over this.

Like, if one is faced with an NxN or NxM grid of possibilities, naive strategy is to be like "I will define an instruction for every
possibility in the grid.", but this is bad. More reasonable to devise a minimal set of instructions that will allow the operation to be done
within in a reasonable number of instructions.

But, then again, I can also note that I axed things like packed-byte operations and saturating arithmetic, which are pretty much de-facto in packed-integer SIMD.

MANY SIMD algorithms need saturating arithmetic because they cannot do
b + b -> h and avoid the overflow. And they cannot do B + b -> h because
that would consume vast amounts of encoding space.

Likewise, a lot of the gaps are filled in with specialized converter and helper ops. Even here, some conversion chains will require multiple instructions.

Well, and if there is no practical difference between a scalar and SIMD version of an instruction, may well just use the SIMD version for scalar.

....

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 23 22:40:25 2024

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

As to why RISC-V went shorter ...

They didn’t fix a length.

Nor do they want to have to save a page of VRF at context switch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed Apr 24 00:25:59 2024

On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

As to why RISC-V went shorter ...

They didn’t fix a length.

Nor do they want to have to save a page of VRF at context switch.

But then, you don’t need a whole array of registers, do you: you just need operand (one for each operand) and destination address registers, plus a counter.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 00:34:03 2024

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:

As to why RISC-V went shorter ...

They didn’t fix a length.

Nor do they want to have to save a page of VRF at context switch.

But then, you don’t need a whole array of registers, do you: you just need operand (one for each operand) and destination address registers, plus a counter.

If by 'you' you mean My 66000's VVM::
a) yes I avoid any SW visible register file
b) and I use the miss buffers as the VRF register file pool
c) they vanish on an interrupt or exception
d) the counter is the loop valiable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Apr 24 00:37:11 2024

BGB wrote:

On 4/23/2024 5:39 PM, MitchAlsup1 wrote:

BGB wrote:

MANY SIMD algorithms need saturating arithmetic because they cannot do
b + b -> h and avoid the overflow. And they cannot do B + b -> h because
that would consume vast amounts of encoding space.

There are ways to fake it.

Though, granted, most end up involving extra instructions and 1 bit of dynamic range.

1-bit for ADD and SUB, but MUL and shifts require more than 1-bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Apr 24 00:24:25 2024

On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result.

Another Cray-ism! ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Apr 23 19:25:22 2024

On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

CRAY machines stayed "in style" as long as memory latency remained smaller >than the length of a vector (64 cycles) and fell out of favor when the cores >got fast enough that memory could no longer keep up.

I whish them well, but I expect it will not work out as they desire.....

I know that you've said this about Cray-style vectors.

I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available, a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided; it made much more sense to use off-the-shelf
microprocessors. Despite their shortcomings theoretically in
architectural terms compared to a Cray-style machine, they offered
vastly more FLOPS for the dollar.

After all, the reason the Cray I succeeded where the STAR-100 failed
was that it had those big vector registers - so it did calculations on
a register-to-register basis, rather than on a memory-to-memory basis.

That doesn't make it immune to considerations of memory bandwidth, but
that does mean that it was designed correctly for the circumstance
where memory bandwidth is an issue. So if you have the kind of
calculation to perform that is suited to a vector machine, wouldn't it
still be better to use a vector machine than a whole bunch of scalar
cores with no provision for vectors?

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

There are ways to increase memory bandwidth. Use HBM. Use static RAM.
Use graphics DRAM. The vector CPU of the last gasp of the Cray-style architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

Also, the original Cray I did useful work with a memory no larger than
many L3 caches these days. So a vector machine today wouldn't be as
fast as it would be if it could have, say, a 1024-bit wide data bus to
a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
even when throttled by memory bandwidth, isn't an improvement over an
ordinary CPU.

Of course, though, the question is, is it an improvement enough? If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache, then it might well be that memory bandwidth would
mean that even with a vector architecture well suited to doing a lot
of work, the net result would be only a slight improvement over what
an ordinary CPU could do with the same memory bandwidth.

I would think that a chip is still useful if it can only provide an
improvement for some problems, and that there are ways to increase
memory bandwidth from what ordinary CPUs offer, making it seem likely
that Cray-style vectors are worth doing as a way to improve what a CPU
can do.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Apr 24 02:00:10 2024

John Savard wrote:

On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.

I whish them well, but I expect it will not work out as they desire.....

I know that you've said this about Cray-style vectors.

I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available, a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided; it made much more sense to use off-the-shelf
microprocessors. Despite their shortcomings theoretically in
architectural terms compared to a Cray-style machine, they offered
vastly more FLOPS for the dollar.

CRAY-XMP was done in MECL 10K gate arrays, offering 10K gates per chip.

After all, the reason the Cray I succeeded where the STAR-100 failed
was that it had those big vector registers - so it did calculations on
a register-to-register basis, rather than on a memory-to-memory basis.

The CRAY-a had much shorter setup sequences than the STAR.
Amdahl's law strikes again.

That doesn't make it immune to considerations of memory bandwidth, but
that does mean that it was designed correctly for the circumstance
where memory bandwidth is an issue. So if you have the kind of
calculation to perform that is suited to a vector machine, wouldn't it
still be better to use a vector machine than a whole bunch of scalar
cores with no provision for vectors?

Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM) CRAY-1S was about 10 cycles (SRAM), XMP
was about 22 cycles, and YMP was about 32 cycles. CRAY-1 and -1S
had 1 port to memory, XMP and YMP had 2Rd and 1W to memory.

b) GPUs use threading to absorb the latency to memory (roughly 400
cycles), along with HW rasterizer, interpolator, texture access,
and an HW OS that can clean up a thread and launch a new thread in
about 8 cycles. That is: GPUs absorb latency by waiting in a way
that does not prevent others from making forward progress.

There are ways to increase memory bandwidth. Use HBM. Use static RAM.
Use graphics DRAM. The vector CPU of the last gasp of the Cray-style architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.

Even HBM has a latency of standard DRAM (with smaller command cycle
overheads) so, a 5-GHz core using 20ns DRAM with infinite BW between
core and DRAM will still have the core see 100 cycles of latency.
Bandwidth alone does not solve latency bound problems, latency alone
does not solve BW bound problems.

Also, the original Cray I did useful work with a memory no larger than
many L3 caches these days. So a vector machine today wouldn't be as
fast as it would be if it could have, say, a 1024-bit wide data bus to
a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
even when throttled by memory bandwidth, isn't an improvement over an ordinary CPU.

The 128K DW memory was used for number crunching, but CRAY-1 had an
I/O system that could consume as much BW as a core, so one could
write out the last chunk and read in the next chunk while the
current chunk was processing. And it was this I/O system that made
a CRAY-1 faster than its equivalent NEC machine (excepting on certain benchmarks).

Of course, though, the question is, is it an improvement enough? If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache, then it might well be that memory bandwidth would
mean that even with a vector architecture well suited to doing a lot
of work, the net result would be only a slight improvement over what
an ordinary CPU could do with the same memory bandwidth.

In essence, if you can teach the compiler to block the numeric algorithm
to fit through (through not in) the cache(s) you can use vector style
CPU architecture.

I would think that a chip is still useful if it can only provide an improvement for some problems, and that there are ways to increase
memory bandwidth from what ordinary CPUs offer, making it seem likely
that Cray-style vectors are worth doing as a way to improve what a CPU
can do.

Everyone has to have hope on something.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Apr 24 02:38:02 2024

On Tue, 23 Apr 2024 20:50:31 -0500, BGB wrote:

There is an instruction to calculate an approximate reciprocal (say,

for

dividing two FP-SIMD vectors), at which a person can use Newton-Raphson
to either get a more accurate version, or use it directly (possibly
using N-R to fix up the result of the division).

Cray had that: an approximate-reciprocal instruction, use it twice to get
the full-accuracy result.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed Apr 24 02:47:59 2024

On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:

After all, the reason the Cray I succeeded where the STAR-100 failed was
that it had those big vector registers ...

Looking at an old Cray-1 manual, it mentions, among other things, sixty
four 64-bit intermediate scalar “T” registers, and eight 64-element vector “V” registers of 64 bits per element. That’s a lot of registers.

RISC-V has nothing like this, as far as I can tell. Right at the top of
the spec I linked earlier, it says:

The vector extension adds 32 architectural vector registers,
v0-v31 to the base scalar RISC-V ISA.

Each vector register has a fixed VLEN bits of state.

So, no “big vector registers” that I can see? It says that VLEN must be a power of two no bigger than 2**16, which does sound like a lot, but then
the example they give only has VLEN = 128.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Wed Apr 24 05:47:54 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.

I whish them well, but I expect it will not work out as they desire.....

I know that you've said this about Cray-style vectors.

I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available,

The 486 came out in 1989.

a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided;

The Cray C90 came out in 1991. That was still considered ecomomic
by the people who bought it :-)

The (low-level) competition for scientific computing at the time
was workstations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Wed Apr 24 06:16:58 2024

John Savard <quadibloc@servername.invalid> writes:

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

The claim by Mitch Alsup is that latency makes the Crays impractical,
because of chaining issues. Do GPUs have chaining? My understanding
is that GPUs deal with latency in the barrel processor way: use
another data-parallel thread while waiting for memory. Tera also
pursued this idea, but the GPUs succeeded with it.

If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache

When the working set is larger than the cache, it does not fit even
when accesses regularly. Prefetchers can reduce the latency, but they
will not increase the bandwidth.

So if you have a problem that walks through a lot of memory and
performs only a few operations per data item, that's where CPUs will
wait for memory a lot, due to limited bandwidth (and you won't benefit
from SIMD/vector instructions on these kinds of problems). For that
kind of stuff you better use GPUs, which have memory systems with more bandwidth.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 06:32:26 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:
Looking at an old Cray-1 manual, it mentions, among other things, sixty
four 64-bit intermediate scalar “T” registers, and eight 64-element vector >“V” registers of 64 bits per element. That’s a lot of registers.

RISC-V has nothing like this, as far as I can tell. Right at the top of
the spec I linked earlier, it says:

The vector extension adds 32 architectural vector registers,
v0-v31 to the base scalar RISC-V ISA.

Each vector register has a fixed VLEN bits of state.

So, no “big vector registers” that I can see? It says that VLEN must be a >power of two no bigger than 2**16, which does sound like a lot, but then
the example they give only has VLEN = 128.

It's an example. If you think you can make and profitably sell a CPU
with VLEN=4096 (the number of bits in one of Cray-1's vector
registers), that would be compliant with the spec, and would run
programs written for RISC-V with the vector extension. Or you can
make one with VLEN=65536 and claim that you have the longest one:-).

This leaves you free to decide VLEN based on the costs and benefits in
the context of the other design decisions you have made and on the
programs you expect to run.

Note that the Fujitsu A64FX (which implements the similar ARM Scalable
Vector Extension and was designed for supercomputing) chooses a
512-bit vector implementation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Apr 24 00:57:07 2024

On Wed, 24 Apr 2024 02:00:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Everyone has to have hope on something.

But false hopes are a waste of time.

The reason for my interest in long vectors is primarily because I
imagine that, if the Cray I was an improvement on the IBM System/360
Model 195, then, apparently, today a chip like the Cray I would be
the next logical step after the Pentium II (OoO plus cache, just like
a Model 195).

And that's a very na�ve way of looking at the issue, so of course it
can be wrong.

I can, however, believe that latency, not bandwith as such, is the
killer. That's true for regular CPU compute, and so of course it would
be a limiting factor for vector machines.

What do vector machines do?

Well, apparently they do things like multiply 2048 by 2048 matrices.
Which is why they need stride. And since modern DRAMs like to give you
16 consecutive values at a time... oh, well, you can multiply 16 rows
of the matrix at once. Each matrix would take 32 megabytes of storage,
so that does fit in cache, at least.

But they've managed to get GPUs to multiply matrices - and they're
quite good at it, which is why we're having all this amazing progress
in AI recently. So it's quite possible that long vector machines have
too narrow a niche, between plain CPUs (more flexible) and GPUs (less flexible).

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Wed Apr 24 07:08:01 2024

On Wed, 24 Apr 2024 00:57:07 -0600, John Savard wrote:

But they've managed to get GPUs to multiply matrices - and they're quite
good at it, which is why we're having all this amazing progress in AI recently.

Worth noting that this AI stuff requires very low-precision floats: 16-
bit, even 8-bit. And they sacrifice mantissa bits in favour of exponents--
down to something like maybe only a couple of mantissa bits in the 8-bit format.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 06:48:54 2024

On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.

But with more limited memory, which is typically not upgradeable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Apr 24 09:28:06 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.

But with more limited memory, which is typically not upgradeable.

And yet, supercomputers these days often have lots of GPUs. The
software crisis still is not yet there in supercomputing, so they
manage to do with explicit moving of data between the high-bandwidth
GPU memory and the lower-bandwidth bigger memory, just like in the
days of the Cray-1 (or was it the CDC-6600?), which has a fast memory
and a bigger slow memory.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Wed Apr 24 09:18:56 2024

John Savard <quadibloc@servername.invalid> writes:

On Wed, 24 Apr 2024 02:00:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Everyone has to have hope on something.

But false hopes are a waste of time.

The reason for my interest in long vectors is primarily because I
imagine that, if the Cray I was an improvement on the IBM System/360
Model 195, then, apparently, today a chip like the Cray I would be
the next logical step after the Pentium II (OoO plus cache, just like
a Model 195).

But the Cray-1 is not an improvement on the Model 195. It has no
cache. Neither the Cray-1 nor the Model 195 have OoO as the term is
commonly understood today: OoO execution, in-order completion,
allowing register renaming, speculative execution, and precise
exceptions. One may consider the Model 91/195 a predecessor of
today's OoO, because it supports register renaming, and you "just"
need to add a reorder buffer to get in-order completion and
speculative execution.

Well, apparently they do things like multiply 2048 by 2048 matrices.
Which is why they need stride.

You can multiply dense matrices of any size efficiently with stride 1.
And caches help a lot for matrix multiply; in HPC circles, (dense)
matrix multiply is known as cache-friendly problem.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to John Savard on Wed Apr 24 10:12:02 2024

On 4/24/24 1:57 AM, John Savard wrote:

What do vector machines do?

They keep a pipeline full.

So you can do something in 64+7 clock cycles instead of 64*7.

If the pipeline gets shorter the benefit decreases of course. And if you
have some other way to keep that pipeline full, you don't need vectors.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Apr 24 19:50:43 2024

Anton Ertl wrote:

John Savard <quadibloc@servername.invalid> writes:

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

The claim by Mitch Alsup is that latency makes the Crays impractical,
because of chaining issues. Do GPUs have chaining? My understanding
is that GPUs deal with latency in the barrel processor way: use
another data-parallel thread while waiting for memory. Tera also
pursued this idea, but the GPUs succeeded with it.

- anton

Consider:: an 8 deep CRAY-like vector calculation with 8 cycle latency
memory and 6 cycle latency FMAC::

|LD|LD|LD|LD|LD|LD|LD|LD|
|FM|FM|FM|FM|FM|FM|FM|FM|
|ST|ST|ST|ST|ST|ST|ST|ST|

Not much parallelism. Now consider the same machine above with longer
vectors::

|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|
|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|FM|
|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|ST|

Now we have considerable parallelism with no change in latencies.

Later consider the top execution profile augmented with a bit of OoO
and a second memory port::

|LD|LD|LD|LD|LD|LD|LD|LD|
|FM|FM|FM|FM|FM|FM|FM|FM|
|SA|SA|SA|SA|SA|SA|SA|SA| |Sd|Sd|Sd|Sd|Sd|Sd|Sd|Sd|

Finally consider a GBOoO implementation::

|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|LD|...
|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|Fq|...
|FM|FM|FM|FM|FM|FM|FM|FM|...
|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|SA|...
|Sd|Sd|Sd|Sd|Sd|Sd|Sd|Sd|...

Here it takes an execution window 18 deep to reach pipeline saturation,
but once you do, the core runs at 3 instructions and arguably 4 units
of work per cycle {without including loop overheads}. In order to
achieve such performance one needs to issue the whole loop in 1 cycle.

You have to have the requisite bandwidths {AGEN, bank access, address
routing bandwidth, result return bandwidth, FMAC bandwidth} but you
also have to have the requisite latencies (and excecution window width)
or it falls apart that enable the vector chaining to work.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Apr 24 23:58:34 2024

MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result.

Another Cray-ism! ;)

Not IEEE 754 legal.

Well, it _is_ legal if you carry enough bits in your reciprocal...but at
that point you would instead use a better algorithm to get the correct
result both faster and using less power.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 22:33:17 2024

On Wed, 24 Apr 2024 09:28:06 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:

For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.

But with more limited memory, which is typically not upgradeable.

And yet, supercomputers these days often have lots of GPUs.

Some do, some don’t. I’m not sure that GPUs are accepted as de rigueur in supercomputer design yet. I think this is just another instance of Ivan Sutherland’s “wheel of reincarnation” <http://www.cap-lore.com/Hardware/Wheel.html>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Apr 24 22:29:36 2024

On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

But the Cray-1 is not an improvement on the Model 195.

The Cray-1 was widely regarded as the fastest computer in the world, when
it came out. Cruising speed of something over 80 megaflops, hitting bursts
of about 120.

IBM did try to compete in the “supercomputer” field for a while longer,
but I think by about ten years later, it had given up.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Apr 25 00:09:21 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:

DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result. >>

Another Cray-ism! ;)

Not IEEE 754 legal.

Well, it _is_ legal if you carry enough bits in your reciprocal...

Maybe--at best. There are certain pair of numerator::denominator that require over 120-reciprocal bits* in order to deliver a properly rounded result using an intermediate reciprocation.

(*) the reciprocal fraction bits--wider than long double.

but at
that point you would instead use a better algorithm to get the correct
result both faster and using less power.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Wed Apr 24 23:10:47 2024

On Wed, 24 Apr 2024 22:33:17 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 09:28:06 GMT, Anton Ertl wrote:

And yet, supercomputers these days often have lots of GPUs.

Some do, some don�t. I�m not sure that GPUs are accepted as de rigueur in >supercomputer design yet. I think this is just another instance of Ivan >Sutherland�s �wheel of reincarnation� ><http://www.cap-lore.com/Hardware/Wheel.html>.

What do GPUs do, when they're included in supercomputers?

One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.

The problem with GPUs is that since their computational capabilities
are built on what the shader part does, their flexibility is limited.
This is what has made me think there could be a place for Cray-style
vectors. So some supercomputers don't have GPU accelerators, because
they're intended to work on problems for which GPU accelerators
wouldn't provide much help.

Since when GPUs _can_ be used, they save lots of electricity, I doubt
strongly that they're just a passing fad.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Thu Apr 25 05:39:55 2024

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
no.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Apr 25 11:57:47 2024

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

But the Cray-1 is not an improvement on the Model 195.

The Cray-1 was widely regarded as the fastest computer in the world, when
it came out. Cruising speed of something over 80 megaflops, hitting bursts
of about 120.

Its main practical improvement was that you could get two Crays for the price of one 360/195. (Not exactly, but close enough.)

IBM did try to compete in the “supercomputer” field for a while longer, >but I think by about ten years later, it had given up.

IBM had tried to make computers very fast by making them very
complicated. STRETCH was fantastically complex for something built out
of individual transistors. The /91 and /195 had instrucion queues and reservation stations and loop mode. Cray went in the opposite
direction, making a much simpler computer where each individual bit
down to the chips and the wires, were as fast as possible.

In many ways it was a preview of RISC.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 14:57:53 2024

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs
are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes,
GPUs, no.

If you never heard about *that*, I can only imagine what else you
didn't hear about supercomputers.

Back when Fugaku was new, it was highly praised for being GPU-less
design that matches and slightly exceeds an efficiency of
GPU-based (and other vector accelerator based) supercomputers. But that
was possible only because NVidea had unusually long pause between
successive generations of Tesla and at the same moment AMD and
Intel GPGPUs were not yet considered fit for serious supercomputing.

That was in November 2019. Never before or since.
Right now the best GPU-less entry on Green500 list is #48 (still the
same A64FX CPU as Fugaku, but smaller configuration) and it delivers 4x
less sustained FLOP/Watt than the top spot, based on NVIDIA H100.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Apr 25 12:06:58 2024

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
no.

NVIDIA says their new Blackwell GPU takes 2000 watts, and is between
7x and 25x more power efficient than the current H100, but that's
still a heck of a lot of power. Data centers have had to come up with
higher capacity power and cooling when each rack can use 40 to 50KW.

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 14:27:46 2024

On Wed, 24 Apr 2024 22:29:36 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

But the Cray-1 is not an improvement on the Model 195.

The Cray-1 was widely regarded as the fastest computer in the world,
when it came out. Cruising speed of something over 80 megaflops,
hitting bursts of about 120.

IBM did try to compete in the “supercomputer” field for a while
longer, but I think by about ten years later, it had given up.

In late 80s IBM joined forces with "attack of killing micros".
Their first POWER CPU was released in 1990 and did 82 MFLOPS (peak).

A single processor of contemporary Cray Y-MP was 4 times faster.
A single processor of older Cray-2 was almost 6 times faster, but by
1990 it was discontinued.
Wikipedia says that power consumption of Cray-2 was 150–200 kW,
probably for 4 processors with 2 GB of memory and peripherals.
I can't find data about power consumption of IBM Power processor. My
guess would be ~40 W for CPU and 1000-1500 W for a whole RS/6000
Model 550 with 1 GB of memory.

BTW, in the latest Top500 list you ca see IBM at #7 spot.
Things that carry name of Cray listed at #2 and #5. They are,
respectively, Intel Inside and AMD Inside.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Thu Apr 25 07:46:35 2024

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Thu Apr 25 15:52:36 2024

John Levine wrote:

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

But the Cray-1 is not an improvement on the Model 195.

The Cray-1 was widely regarded as the fastest computer in the world, when >>it came out. Cruising speed of something over 80 megaflops, hitting bursts >>of about 120.

Its main practical improvement was that you could get two Crays for the price of one 360/195. (Not exactly, but close enough.)

IBM did try to compete in the “supercomputer” field for a while longer, >>but I think by about ten years later, it had given up.

IBM had tried to make computers very fast by making them very
complicated. STRETCH was fantastically complex for something built out
of individual transistors. The /91 and /195 had instrucion queues and reservation stations and loop mode. Cray went in the opposite
direction, making a much simpler computer where each individual bit
down to the chips and the wires, were as fast as possible.

In many ways it was a preview of RISC.

Seymore only did fast and simple, starting before the CDC 6600.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 25 19:10:19 2024

On Thu, 25 Apr 2024 15:52:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Seymore only did fast and simple, starting before the CDC 6600.....

Do you attribute not exactly simple 6600 Scoreboard to Thornton?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Apr 25 17:34:32 2024

Michael S wrote:

On Thu, 25 Apr 2024 15:52:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Seymore only did fast and simple, starting before the CDC 6600.....

Do you attribute not exactly simple 6600 Scoreboard to Thornton?

If you measure simplicity by gate count--the scoreboard was considerably simpler than the reservation station design of Tomasulo.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Thu Apr 25 17:52:35 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Thu Apr 25 17:49:11 2024

John Levine <johnl@taugh.com> schrieb:

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.

NVIDIA says their new Blackwell GPU takes 2000 watts, and is between
7x and 25x more power efficient than the current H100, but that's
still a heck of a lot of power. Data centers have had to come up with
higher capacity power and cooling when each rack can use 40 to 50KW.

GPUs are very energy efficient per theoretical peak performance of
calculations per second. Said peak performance is extremely high,
hence the huge power requirements...

But programming for GPUs is _much_ harder than programming for
vector computers used to be. Getting to 10% of theoretical peak
performance is quite impressive. Getting above 50% requires
the right problem, good knowledge of the GPU internals (which NVIDIA
does not tend to share - don't they want to have people have good
performance on their cards? and lots of thought and _very_ clever
algorithms.

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

Good thing you're not living in Germany, your electricity bill
would be enormous...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Apr 25 20:45:32 2024

On Thu, 25 Apr 2024 17:34:32 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Michael S wrote:

On Thu, 25 Apr 2024 15:52:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Seymore only did fast and simple, starting before the CDC
6600.....

Do you attribute not exactly simple 6600 Scoreboard to Thornton?

If you measure simplicity by gate count--the scoreboard was
considerably simpler than the reservation station design of Tomasulo.

Both were far from simple by day's standards.

BTW, was not low gate count of Scoreboard mostly due to creative usage
of what was later named wired logic connections, i.e. something that
stopped working in high-speed VLSI around 1985-1990 ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Apr 25 19:17:04 2024

According to Thomas Koenig <tkoenig@netcologne.de>:

https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Well, compared to what?

In 1960 the total power generated in the US was about 750 TWh. In
recent years it's over 4000 TWh.

I see global data center power use in recent years of about 250 TWh,
and about the same again in data transmission, but I don't know how
much of that to attribute to the US.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 22:29:30 2024

Lawrence D'Oliveiro wrote:

On Thu, 25 Apr 2024 15:52:36 +0000, MitchAlsup1 wrote:

[Seymour] only did fast and simple, starting before the CDC 6600.....

And he didn’t seem to have much truck with “memory management” and “operating systems”, did he? He probably saw them as just getting in the way of sheer speed.

Base and bounds was good enough for numerical programs.

On the other hands, NOS did things no other OS did.....

And he didn’t care for some of the niceties of floating-point arithmetic either, for the same reason.

Heck, FP arithmetic is only approximate anyway--it is just more
approximate on my machines.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Apr 25 22:23:49 2024

On Thu, 25 Apr 2024 15:52:36 +0000, MitchAlsup1 wrote:

[Seymour] only did fast and simple, starting before the CDC 6600.....

And he didn’t seem to have much truck with “memory management” and “operating systems”, did he? He probably saw them as just getting in the way of sheer speed.

And he didn’t care for some of the niceties of floating-point arithmetic either, for the same reason.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri Apr 26 00:26:58 2024

On Thu, 25 Apr 2024 22:29:30 +0000, MitchAlsup1 wrote:

On the other hands, NOS did things no other OS did.....

Like what? I thought the original Cray OS was just a batch OS.

Then they added this Unix-like “UNICOS” thing, but that seemed to me like an interactive front-end to the batch OS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Apr 26 00:33:50 2024

On Thu, 25 Apr 2024 14:57:53 +0300, Michael S wrote:

Back when Fugaku was new, it was highly praised for being GPU-less
design that matches and slightly exceeds an efficiency of GPU-based (and other vector accelerator based) supercomputers. But that was possible
only because NVidea had unusually long pause between successive
generations of Tesla and at the same moment AMD and Intel GPGPUs were
not yet considered fit for serious supercomputing.

That was in November 2019. Never before or since.

Fugaku is still at number 4 on the Top500, though--even after all these
years. And don’t forget the Chinese systems, using their home-grown CPUs without access to Nvidia GPUs. There’s one at number 11.

Should we be looking at the Green500 list instead?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri Apr 26 00:53:02 2024

Lawrence D'Oliveiro wrote:

On Thu, 25 Apr 2024 22:29:30 +0000, MitchAlsup1 wrote:

On the other hands, NOS did things no other OS did.....

Like what? I thought the original Cray OS was just a batch OS.

One afternoon in 1978, I was in the typing room at NCR Cambridge typing
in my 8085 ASM code; there were another 6 of us in there. NCR rented
time on a CDC 7600 in San Diego.

Suddenly there was a long pause where the silent 700's made no noise;
and after 20 or so seconds, the pause ended and we proceeded along
with our work. I discovered later that the San Diego machine had taken
a hard crash and all our jobs had been picked up by the PPs and shipped
en massé to a CDC 7600 in Chicago (including the files those jobs were
using.)

Then they added this Unix-like “UNICOS” thing, but that seemed to me like an interactive front-end to the batch OS.

It was, and it was written in interpreted BASIC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Fri Apr 26 04:20:58 2024

On Fri, 26 Apr 2024 00:33:50 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Thu, 25 Apr 2024 14:57:53 +0300, Michael S wrote:

Back when Fugaku was new, it was highly praised for being GPU-less
design that matches and slightly exceeds an efficiency of GPU-based
(and other vector accelerator based) supercomputers. But that was
possible only because NVidea had unusually long pause between
successive generations of Tesla and at the same moment AMD and
Intel GPGPUs were not yet considered fit for serious supercomputing.

That was in November 2019. Never before or since.

Fugaku is still at number 4 on the Top500, though--even after all
these years. And don’t forget the Chinese systems, using their
home-grown CPUs without access to Nvidia GPUs. There’s one at number
11.

From the very little info I found about it, Sunway TaihuLight
processors are likely more similar to Intel KNC (a.k.a. Xeon Phi
co-processor) than to Fujitsu A64FX. I.e. simple in-order cores, likely
2-way superscalar, with single-issue wide VPUs. In other words,
decisively non-general-purpose.

Should we be looking at the Green500 list instead?

Of course we should be looking at Green500 when discussing power
efficiency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Apr 26 23:14:14 2024

On Thu, 25 Apr 2024 14:27:46 +0300, Michael S wrote:

On Wed, 24 Apr 2024 22:29:36 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:

But the Cray-1 is not an improvement on the Model 195.

The Cray-1 was widely regarded as the fastest computer in the world,
when it came out. Cruising speed of something over 80 megaflops,
hitting bursts of about 120.

IBM did try to compete in the “supercomputer” field for a while longer, >> but I think by about ten years later, it had given up.

BTW, in the latest Top500 list you ca see IBM at #7 spot.

Those are POWER machines, an entirely different architecture from the
System 360-and-successors line (which I think was meant by “Model 195”). And one which still has a bit of oomph left in it, obviously.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to mitchalsup@aol.com on Sat Apr 27 08:23:38 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.

John Savard wrote:

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM)

DRAM for CRAY-1 doesn't sound right. Intel made 1024-bit DRAM in 1970,
but it was pretty flaky and not very fast. I think the CRAY-1 used
Fairchild 10K ECL 10ns SRAM.

Andrew,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sat Apr 27 11:30:29 2024

BGB <cr88192@gmail.com> schrieb:

Say, seemingly no one built an 8/16 bit mainframe,

The IBM 360/30 and 360/40 actually had a 8 and 16-bit
microarchitecture, respectively. Of course, they hid it cleverly
behind the user-visible architecture which was 32 bits.

But then, the Nova was a 4-bit system cleverly disguising itself
as a 16-bit system, and the Z80 had a 4-bit ALU, as well.

or say using 24-bit
floats (Say: S.E7.F16) rather than bigger formats, ...

Konrad Zuse used 22-bit floats.

Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...

That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Sat Apr 27 11:48:03 2024

John Levine <johnl@taugh.com> schrieb:

According to Thomas Koenig <tkoenig@netcologne.de>:

https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Well, compared to what?

Absolute figures, or relative :-)

In 1960 the total power generated in the US was about 750 TWh. In
recent years it's over 4000 TWh.

My point was: Computers have become vastly more energy-efficient
and powerful. I think one of the "What If 2" chapters is about
building an Iphone out of vaccum tubes, which would end badly.

This has led _much_ more widespread adoption of computers plus
derivatives such as smartphones or tablets, which means that
their overall energy consumption has increased by many orders
of magnitude over the 1950s, when just a few vaccum-tube based
computers were in operation.

If people make the claim that GPUs are more power-efficient than CPUs,
yes, they are for equal performance (if they can be programmed
efficiently enough for the application at hand). In practice, this
will not be used for energy savings, but for doing more calculations.

Same thing happend with steam engines - Watt's engines were a huge
improvement in fuel efficiency over the previous Newcomen models,
which led to much more steam engines being built.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to aph@littlepinkcloud.invalid on Sat Apr 27 15:13:03 2024

aph@littlepinkcloud.invalid wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.

John Savard wrote:

And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?

a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM)

DRAM for CRAY-1 doesn't sound right. Intel made 1024-bit DRAM in 1970,
but it was pretty flaky and not very fast. I think the CRAY-1 used
Fairchild 10K ECL 10ns SRAM.

That was CRAY-1s s stands for SRAM.

Also note 16 cycles at 12.5ns (200ns) is plenty of time for even
early RAS/CAS DRAM.

Andrew,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Apr 27 16:41:19 2024

According to Thomas Koenig <tkoenig@netcologne.de>:

Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...

That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.

I can believe it. The PDP-8 was a simple architecture and the S stood
for bit Serial, and Stupendously Slow.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sat Apr 27 23:10:56 2024

On Sat, 27 Apr 2024 16:41:19 -0000 (UTC), John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.

I can believe it.

You can probably find detailed schematics, on Bitsavers or elsewhere, to confirm it. DEC published that sort of thing as a matter of course, back
in those days.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Apr 27 23:08:21 2024

On Sat, 27 Apr 2024 11:48:03 -0000 (UTC), Thomas Koenig wrote:

If people make the claim that GPUs are more power-efficient than CPUs,
yes, they are for equal performance (if they can be programmed
efficiently enough for the application at hand). In practice, this will
not be used for energy savings, but for doing more calculations.

“Rebound effect”, I think it’s called.

Remember all those science-fiction predictions from the earlier part of
the 20th century, about cities on the Moon, personal flying transportation
and all the rest of it? All that was predicated on having large sources of power--i.e. atomic power.

Instead of having atomic-scale sources of power at our disposal, we got information processing (computers) instead, and almost nobody saw how big
a revolution that would be. Meanwhile, the atomic-energy industry seemed
to take a wrong turn, putting more effort into power production systems
that would also aid the production of atomic weapons, instead of
concentrating on predominantly peaceful technologies.

Now the information processing power is reaching the limits of the
available physical power. The only way to make significant further
progress is to start boosting that physical power generation again.

Same thing happend with steam engines - Watt's engines were a huge improvement in fuel efficiency over the previous Newcomen models, which
led to much more steam engines being built.

Watt’s engine (like Newcomen’s one before it) was an “atmospheric” engine:
the pressure to drive it came from the atmosphere, not from the steam.

True high-pressure “steam” engines were developed by Trevithick and
others, after Watt’s patent had expired and he could no longer stop them.

And that is what kicked off the Industrial Revolution.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Apr 28 16:19:24 2024

According to Lawrence D'Oliveiro <ldo@nz.invalid>:

That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.

I can believe it.

You can probably find detailed schematics, on Bitsavers or elsewhere, to >confirm it. DEC published that sort of thing as a matter of course, back
in those days.

The logic diagrams are in the back of the maintenance manual which
Bitsavers does have, but at the moment I don't feel like going through
and counting the gates.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Apr 28 14:06:24 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Say, seemingly no one built an 8/16 bit mainframe,

The IBM 360/30 and 360/40 actually had a 8 and 16-bit
microarchitecture, respectively. Of course, they hid it cleverly
behind the user-visible architecture which was 32 bits.

But then, the Nova was a 4-bit system cleverly disguising itself
as a 16-bit system, and the Z80 had a 4-bit ALU, as well.

or say using 24-bit
floats (Say: S.E7.F16) rather than bigger formats, ...

Konrad Zuse used 22-bit floats.

Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...

That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.

The LGP-30 had 113 tubes and 1450 diodes. The transistorized
successor, the LGP-31, had about 460 transistors and about
375 diodes (all per the wikipedia article on the LGP-30).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sun Apr 28 14:18:51 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10. It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tkoenig@netcologne.de on Mon Apr 29 00:48:45 2024

On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Levine <johnl@taugh.com> schrieb:

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

Good thing you're not living in Germany, your electricity bill
would be enormous...

Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.

If he really meant 4Kw continuous ... wow!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to George Neuner on Mon Apr 29 08:13:47 2024

George Neuner wrote:

On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

John Levine <johnl@taugh.com> schrieb:

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

Good thing you're not living in Germany, your electricity bill
would be enormous...

Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.

If he really meant 4Kw continuous ... wow!

Here in Norway we abuse our hydro power as our primary house heating
source, in our previous home we used about 60K KWh per year, which
corresponds to 60K/(24*365.24) = 6.84 KW average, day & night.

This was in fact while having a heat pump to handle the main part of the heating needs.

The new house, which is from the same era (1962 vs 1963), uses
significantly less, but probably still 30-40K /year.

Electric power used to cost just under 1 NOK (about 9 cents at current
exchange rates), including both primary power cost and transmission
cost, but then we started exporting too much to Denmark/Sweden/Germany
which means that we also imported their sometimes much higher power prices.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Apr 29 16:53:42 2024

According to George Neuner <gneuner2@comcast.net>:

On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig ><tkoenig@netcologne.de> wrote:

John Levine <johnl@taugh.com> schrieb:

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

Good thing you're not living in Germany, your electricity bill
would be enormous...

Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.

Last month we used 92 KWh/day, which is 3.8KW. The ground source heat
pump is how we heat the house and it was a fairly cool month. We also
have a separate heat pump for hot water (tying it to the main system
was absurdly expensive) and an induction stove which can use up to
10KW.

During the summer we use a lot less power. On the other hand, our
bills for gas, propane, and fuel oil are zero.

FWIW we pay about 12c/kwh which is fairly low for the U.S., with a
complicated remote net metering discount in which we pretend that part
of a solar farm in a nearby town is on our roof.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to terje.mathisen@tmsw.no on Mon Apr 29 21:39:55 2024

On Mon, 29 Apr 2024 08:13:47 +0200, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

George Neuner wrote:

On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Levine <johnl@taugh.com> schrieb:

I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.

Good thing you're not living in Germany, your electricity bill
would be enormous...

Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.

If he really meant 4Kw continuous ... wow!

Here in Norway we abuse our hydro power as our primary house heating
source, in our previous home we used about 60K KWh per year, which >corresponds to 60K/(24*365.24) = 6.84 KW average, day & night.

This was in fact while having a heat pump to handle the main part of the >heating needs.

The new house, which is from the same era (1962 vs 1963), uses
significantly less, but probably still 30-40K /year.

Electric power used to cost just under 1 NOK (about 9 cents at current >exchange rates), including both primary power cost and transmission
cost, but then we started exporting too much to Denmark/Sweden/Germany
which means that we also imported their sometimes much higher power prices.

Terje

In the US, the majority of homes are heated with oil or gas (LNG or in
rural areas it might be propane). Electric heat mainly is found in
the south where overall need is low. Electric cooling is far more
widespread. The majority of ovens are electric, but ~ 2/3 of cooktops
are gas.

Where I am, the per Kwh rates *currently* are
0.17216 - generation
0.09434 - distribution
0.04052 - transmission
0.00037 - transition (from? to?)
0.00006
0.00800
0.00050
0.02334 - efficiency (of what?)
------
0.33929

It's little wonder the current administration wants to force everyone
to use only electricity ... it will bankrupt consumers trying to pay
for energy, and bankrupt utilities trying to deliver it. Estimates
are that the grid needs trillions of dollars in upgrades to handle the anticipated load [that the administration wants to force on it within
5 years].

I'd have a nuclear reactor in my basement if I could.

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Apr 30 14:54:52 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes.

Why is that a problem that needs solving?

This kind of thing makes a mockery of the R in RISC.

So what?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 30 16:26:35 2024

Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes.

Why is that a problem that needs solving?

When your OpCode encoding space runs out of bits in the instruction.

This kind of thing makes a mockery of the R in RISC.

So what?

Design + verification cost, time to market, Size of test vector set,
and Compiler complexity.

So, pretty close to the difference between binary floating point
and decimal floating point.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Apr 30 17:56:36 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.

Undoubtedly true, but I think you're missing quite a few
orders of magnitude there.

It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).

https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.

Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often
didn't; due to maintenance, these early computers seem to have been
day shift only).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Apr 30 19:38:54 2024

According to Thomas Koenig <tkoenig@netcologne.de>:

It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).

https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.

Wikipedia lists 18 UNIVACs shipped by 1954 so that's certainly low.
With the 702, the ERA machines and the one-offs like JOHNNIAC I'd
guess the number was more like 50, but soon increased with multiple
IBM 704 and 650 machines starting in 1954.

Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often >didn't; due to maintenance, these early computers seem to have been
day shift only).

The 650s at leat ran all night. Alan Perlis told me some amusing
stories of tripping in the dark over sleeping grad student wives who
were holding their husbands' place in line for the 650 in the middle
of the night. They soon made the scheduling more humane.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Apr 30 20:25:48 2024

On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

The 650s at leat ran all night. Alan Perlis told me some amusing stories
of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.

How did they do that, though? Other than by hiring operators to work those shifts, so the users could submit their jobs in a queue and go home?

Those early computers were expensive, hence the need for 24-hour batch operation to keep them as busy as possible, to earn their keep.

That batch mentality is still characteristic of (what’s left of) IBM mainframes today.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Apr 30 20:31:17 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Adding the typical kind of vector-processing instructions to an >>>instruction set inevitably leads to a combinatorial explosion in the >>>number of opcodes.

Why is that a problem that needs solving?

When your OpCode encoding space runs out of bits in the instruction.

And has that been a real problem yet? Pretty much every
instruction set can be easily extended (viz. 8086),
particularly with variable length encodings, nothing prevents
one from adding a special 32-bit encoding that extends the
instruction to 64-bits even in a fixed size encoding scheme.

This kind of thing makes a mockery of the R in RISC.

So what?

Design + verification cost, time to market, Size of test vector set,
and Compiler complexity.

As contrasted with usability. ARM doesn't add features just
for the sake of adding features, nor does Intel.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Apr 30 21:12:04 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Adding the typical kind of vector-processing instructions to an >>>>instruction set inevitably leads to a combinatorial explosion in the >>>>number of opcodes.

Why is that a problem that needs solving?

When your OpCode encoding space runs out of bits in the instruction.

And has that been a real problem yet? Pretty much every
instruction set can be easily extended (viz. 8086),
particularly with variable length encodings, nothing prevents
one from adding a special 32-bit encoding that extends the
instruction to 64-bits even in a fixed size encoding scheme.

I suspect as long as RISC-V maintains its 32-bit only ISA,
that REIC-V will hit that wall first.

This kind of thing makes a mockery of the R in RISC.

So what?

Design + verification cost, time to market, Size of test vector set,
and Compiler complexity.

As contrasted with usability. ARM doesn't add features just
for the sake of adding features, nor does Intel.

Are you sure ?? Take SSE-512 (or whatever Intel calls it) !!

When I was at AMD (99-06) every 6 months or so, we (AMD) got Intel's
latest instructions additions, and they gout ours. Most of these
additions end up at the 0.01% level of the dynamic instructions
executed (over a wide range of programs (more than 40,000 traces)),
and all cores had to have all of the instructions.

Is this a burden on Intel:: not so much since they already have
extensive (exhaustive??) tests and implementation libraries....

Is this a burden on AMD:: yes, absolutely; the smaller design staff,
they can afford based on their revenue stream, increases the burden significantly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Scott Lurndal on Tue Apr 30 23:08:54 2024

On Tue, 30 Apr 2024 20:31:17 GMT, Scott Lurndal wrote:

ARM doesn't add features just for the sake of adding features, nor does Intel.

There is such a thing as painting yourself into a corner, where every new feature added to the SIMD instruction set involves adding combinations of instructions, not just for the new types, but also for every single old
type as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Apr 30 23:56:03 2024

Lawrence D'Oliveiro wrote:

On Tue, 30 Apr 2024 20:31:17 GMT, Scott Lurndal wrote:

ARM doesn't add features just for the sake of adding features, nor does
Intel.

There is such a thing as painting yourself into a corner, where every new feature added to the SIMD instruction set involves adding combinations of instructions, not just for the new types, but also for every single old
type as well.

That is the combinational explosion mentioned above.
{Although I would term it the Cartesian Product of types and OPs}

Then contemplate for an instant that one would want SIMD instructions for Complex numbers and Hamiltonian Quaterions......

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed May 1 00:18:29 2024

On Tue, 30 Apr 2024 23:56:03 +0000, MitchAlsup1 wrote:

Then contemplate for an instant that one would want SIMD instructions
for Complex numbers and Hamiltonian Quater[n]ions......

Quaternions yeah! Interesting that they actually predated vector algebra <https://www.youtube.com/watch?v=M12CJIuX8D4> (from the wonderful “Kathy Loves Physics & History” channel), and then the mathematicians realized
that it was a bit simpler to separate out the components and deal with
them separately, rather than carry them around all the time. Some of the
“old guard” resisted this move ...

And now they’ve made a comeback in computer graphics, for representing rotations, particularly of armature “bones” used in posing and animating characters.

I’m not sure you really need SIMD instructions for quaternions, though. Consider that the typical use of such instructions is to process millions
or even billions of data items (e.g. pixels, maybe even geometry
coordinates for complex models), whereas the number of bones in an
armature is maybe a few thousand at most.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to ldo@nz.invalid on Wed May 1 01:22:21 2024

It appears that Lawrence D'Oliveiro <ldo@nz.invalid> said:

On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

The 650s at least ran all night. Alan Perlis told me some amusing stories
of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.

How did they do that, though? Other than by hiring operators to work those >shifts, so the users could submit their jobs in a queue and go home?

Rather than just queueing up, they arranged it so the student could
sign up ahead of time, and then show up whenever to do his work, and
the wives could get some sleep.

I also think he tried to round up some money to get another computer.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Wed May 1 03:06:04 2024

John Levine wrote:

It appears that Lawrence D'Oliveiro <ldo@nz.invalid> said:

On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:

The 650s at least ran all night. Alan Perlis told me some amusing stories >>> of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.

How did they do that, though? Other than by hiring operators to work those >>shifts, so the users could submit their jobs in a queue and go home?

Rather than just queueing up, they arranged it so the student could
sign up ahead of time, and then show up whenever to do his work, and
the wives could get some sleep.

I also think he tried to round up some money to get another computer.

I remember getting up at 3:00 AM to get exclusive access to the IBM 360/67
to run various student programs with much better response time than when
30 other people were trying to do the same. {Every body else, except the
system operator had left by then::at least statistically.}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Apr 30 23:58:16 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.

Undoubtedly true, but I think you're missing quite a few
orders of magnitude there.

Probably not as many as you think. :)

It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).

https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.

Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often didn't; due to maintenance, these early computers seem to have been
day shift only).

Oh boy, numbers.

First your question asked about the late 1950s, not the mid 1950s.

I estimated between 10,000 and 20,000 computers by the end of
the 1950s, and chose 5 KW as an average consumption. In those
days computers were big. Probably the estimate for number of
machines is a bit on the high side, and the average consumption
is a bit on the low side. I'm only estimating.

The most popular computer in the 1950s was the IBM 650. 2,000
units sold (or in some cases given away).

In contrast, the LGP-30 turned out only 500 units, at a mere
1500 W each.

Towards the end of the 1950s both the IBM 1620 and the IBM 1401
came out. Of course none of either of these were delivered
until the 1960s, but the IBM 1401 delivered 10,000 units all
on its own.

I looked up a few other IBM models, didn't get any unit numbers on
any of them. I didn't even try to look up models or numbers of
units from other manufacturers (not counting the LGP-30, since I
happened to have a wikipedia page open already for that). But
based on just the number of different IBM models, and knowing that
the 650 produced 2,000 units, and keeping in mind the number of
different computer manufacturers at that time, suggests that 10,000
systems overall is a plausible guess.

Also there is a noteworthy computer system developed in the 1950s
that is often overlooked. Only 24 units were installed. Each
installation occupied 22,000 square feet, weighed 250 tons, had
60,000 tubes, and used 3 MW. So that's 72 MW all by itself (to be
fair some parts were turned off at times for maintenance, but at
least half of each installation was up at all times).

I did a very different kind of calculation to estimate how much
power is used in today's computers. The result was more than
ten times as much, but less than 100 times as much. Remember,
I'm just estimating. But I had enough confidence in the estimates
to say at least a factor of 10, which seems more than adequate to
answer the question asked (and that's all I was doing).

What's the largest computer ever built? The AN/FSQ-7. Only 24
installed, for an aggregate weight of 6,000 tons.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Wed May 1 08:56:47 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:

One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.

That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.

Here's one example of an item about this:

https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html

Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)

Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.

Undoubtedly true, but I think you're missing quite a few
orders of magnitude there.

Probably not as many as you think. :)

It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).

https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.

Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often
didn't; due to maintenance, these early computers seem to have been
day shift only).

Oh boy, numbers.

First your question asked about the late 1950s, not the mid 1950s.

I estimated between 10,000 and 20,000 computers by the end of
the 1950s, and chose 5 KW as an average consumption. In those
days computers were big. Probably the estimate for number of
machines is a bit on the high side, and the average consumption
is a bit on the low side. I'm only estimating.

The number of computers is probably high, the power maybe somewhat
low, but let us take it as a basis - 2*10^4 computers with 5*10^3
Watt, total power if they are all on at the same time 10^8 Watt.
Let's assume an operating time of 4000 hours, so total energy
consumption would be around 1.44*10^15 J or 4*10^8 kWh, or
0.4 Terawatt-hours.

For today, we don't need to make an estimate
ourselves, we can use other people's. Looking at https://frontiergroup.org/resources/fact-file-computing-is-using-more-energy-than-ever/
one finds that data centers alone use around 240-340 Terawatt-hours,
so we have a factor of a bit less than 1000 alredy. The total
sector, according to the same source, and also according to https://researchbriefings.files.parliament.uk/documents/POST-PN-0677/POST-PN-0677.pdf
is around three times that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed May 1 08:20:58 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions for Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Thu May 2 10:13:33 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions for
Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

If you have the four operands spread across two SIMD registers, so
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?

This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Thu May 2 10:58:12 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions for >>> Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

If you have the four operands spread across two SIMD registers, so
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?

This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.

I'm actually not sure how that would help. Could you elaborate?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu May 2 20:10:35 2024

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions

for

Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

If you have the four operands spread across two SIMD registers, so
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?

This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.

I'm actually not sure how that would help. Could you elaborate?

VVM synthesizes SIMD (lanes) and strip-mining (Cray-like vectors) while processing SCALAR code. So, as long as the compiler knows which operands
are participating, almost any amount of <strange> Complexity drops out
for free -- including things like Quaternions.

Physicists like quaternions because it means they don't have to worry
about
whether to add or subtract, the {i,j,k} does it for them. Complex is OK
for
flat spaces but when one is dealing with non Cartesian coordinates (like
within the radius of the proton) other effects makes quaternions a better
path.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu May 2 20:14:09 2024

BGB wrote:

On 4/30/2024 8:22 PM, John Levine wrote:

Sometimes seems odd in a way that people can manage to find wives, with
as many difficulties and prerequisites there seem to be in being seen as

"worthy of attention", etc...

Then again, it seems that there is a split:
Many people seem to marry off between their early to mid 20s;
Like, somehow, they find someone where there is mutual interest.

More than ½ of whom end up divorced within 7 years.

Others, not so quickly, if at all.

You mean the lucky ones ?!?

On the female side, it seems there are several subgroups:
Those who are waiting for "the perfect romance".
Those who want someone with at least a "6 figure income", etc.
Then there are the asexual females.
And also lesbians.

If you don't know what you are looking for, how do you know when
you find it ?!!!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Fri May 3 03:46:22 2024

On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

If you don't know what you are looking for, how do you know when you
find it ?!!!

Maybe the procedure for determining that you’ve found it is recursively enumerable, but that for doing the search is not? ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Fri May 3 10:23:43 2024

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions for >>>> Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

If you have the four operands spread across two SIMD registers, so
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?

This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.

I'm actually not sure how that would help. Could you elaborate?

Just that all his code is scalar, but when you have a bunch of these
complex mul/mac operations in a loop, his hw will figure out the
recurrences and run them as fast as possible, with all the (Re,Im) SIMD
flips becoming NOPs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Fri May 3 09:40:33 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Then contemplate for an instant that one would want SIMD instructions for >>>>> Complex numbers and Hamiltonian Quaterions......

Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is

fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1

fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2

So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.

If you have the four operands spread across two SIMD registers, so
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?

This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.

I'm actually not sure how that would help. Could you elaborate?

Just that all his code is scalar, but when you have a bunch of these
complex mul/mac operations in a loop, his hw will figure out the
recurrences and run them as fast as possible, with all the (Re,Im) SIMD
flips becoming NOPs.

Sure.

This would then be something like (in the loop)

vec r6,{}
ldd r7,[r1,r5,0]
ldd r8,[r1,r5,8]
ldd r9,[r2,r5,0]
ldd r10,[r2,r5,8]
fmul r11,r9,r8
fmac r11,r10,r7,r11
fmul r8,r10,r8
fmac r7,r9,r7,-r8
std r7,[r3,r5,0]
std r11,[r3,r5,8]
loop1 lt,r5,r4,#16

but it would not help in a case where previous results were already
in registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat May 4 21:19:32 2024

BGB wrote:

On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:

On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

If you don't know what you are looking for, how do you know when you
find it ?!!!

Maybe the procedure for determining that you’ve found it is recursively
enumerable, but that for doing the search is not? ;)

I think it is a case of determining if someone responds in favorable
ways to interactions, does not respond in unfavorable ways, and does not

present any obvious "deal breakers".

Presumably other people are doing something similar, but with different metrics.

Different definitions !!

Though, granted, the whole process tends to be horribly inefficient.

If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.

So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.

There generally doesn't exist any good way to determine who exists in a
given area, or to get a general idea for who may or may not be worth the

time/effort of interacting with them.

Women, by and large, do the picking:: mem, by and large, do the
acquiescing.

Many dating sites (and people on them) seem to operate under the
assumption of "will post pictures, good enough".

Dating sites are for losers. P E R I O D

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat May 4 22:34:24 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

BGB wrote:

If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.

So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.

Even assuming that the numbers are true (far too low, IMHO), the
calculation assumes that both quantities are uncorrelated.

If it were really true, humans would long since have died out
(unless "compatible" means something else :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun May 5 00:07:57 2024

BGB wrote:

On 5/4/2024 4:19 PM, MitchAlsup1 wrote:

BGB wrote:

On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:

On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:

If you don't know what you are looking for, how do you know when you >>>>> find it ?!!!

Maybe the procedure for determining that you’ve found it is

recursively

enumerable, but that for doing the search is not? ;)

I think it is a case of determining if someone responds in favorable
ways to interactions, does not respond in unfavorable ways, and does

not

present any obvious "deal breakers".

Presumably other people are doing something similar, but with
different metrics.

Different definitions !!

Not sure what you mean by this, exactly.

The things that make a man attractive to a woman are completely different
from the things that make a woman attractive to a man.

Though, granted, the whole process tends to be horribly inefficient.

If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.

So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.

At this point, looks almost like it could be closer to 0.

Or, at least, most of the ones I might be interested in talking to,
aren't in the same geographic area.

Or you are not frequenting areas those who might be compatible with
you are frequenting.

There generally doesn't exist any good way to determine who exists in
a given area, or to get a general idea for who may or may not be worth

the

time/effort of interacting with them.

Women, by and large, do the picking:: mem, by and large, do the
acquiescing.

Not much point in trying to interact with them though if there is no
reason to think it might be worth the effort of doing so.

Many dating sites (and people on them) seem to operate under the
assumption of "will post pictures, good enough".

Dating sites are for losers. P E R I O D

Somehow, the actual sites still manage to be more dignified than the
Facebook groups or phone apps, which lean much more heavily into the pointless aspects...

Apps are no different than dating sites:: see above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sun May 5 18:38:19 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

BGB wrote:

If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.

So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.

Even assuming that the numbers are true (far too low, IMHO), the
calculation assumes that both quantities are uncorrelated.

If it were really true, humans would long since have died out
(unless "compatible" means something else :-)

The 1% number is for me. {smart enough, pretty enough, frugal enough,
sane enough, low maintenance.} I know of more typical male/females where
their number is closer to 20%.

1% may be a "little high" for BGB and whomever might be mutually acceptable.

There is one thing worse than being alone--and that is being with someone
you seriously dislike.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	44:35:54
Calls:	10,392
Files:	14,066
Messages:	6,417,253

Short Vectors Versus Long Vectors

Who's Online

System Info