Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes. This kind of thing makes a mockery of the “R” in “RISC”.
Interesting to see that the RISC-V folks are staying off this path;
instead, they are reviving an old idea from Seymour Cray’s original machines that bear his name: a vector pipeline. Instead of being limited
to processing 4 or 8 operands at a time, the Cray machines could operate (sequentially, but rapidly) on variable-length vectors of up to 64
elements with a single setup sequence. RISC-V seems to make the limit on vector length an implementation choice, with a value of 32 being mentioned
in the spec.
The way it avoids having separate instructions for each combination of operand types is to have operand-type registers as part of the vector
unit. This way, only a small number of instructions is required to set up
all the combinations of operand/result types. You then give it a kick in
the guts and off it goes.
Detailed spec here: <https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc>.
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.
So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code?
If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.
On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:
If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.
How would out-of-order execution help, anyway, given all the operations on >the vector elements are supposed to be identical?
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
So why would conventional short vectors work better, then? Surely the
latency discrepancy would be even worse for them.
On Tue, 23 Apr 2024 06:22:38 GMT, Anton Ertl wrote:
So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code?
On the original Cray machines, I read somewhere the benefit of using the vector versions over the scalar ones was a net positive for a vector
length as low as 2.
If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration of
the doall loop before the processing step of the first iteration has
finished with the register.
How would out-of-order execution help, anyway, given all the operations on the vector elements are supposed to be identical? Unless it’s just greater parallelism.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
CRAY machines stayed "in style" as long as memory latency remained
smaller than the length of a vector (64 cycles) and fell out of favor
when the cores got fast enough that memory could no longer keep up.
Mitch Alsup repeatedly makes this claim without giving any
justification. Your question may shed some light on that.
So why would conventional short vectors work better, then? Surely the >>latency discrepancy would be even worse for them.
Thinking about it, they probably don't work better. They just don't
work worse, so why spend area on 4096-bit vector registers like the
Cray-1 did when 128-512-bit vector registers work just as well?
Plus,
they have 200 or so of these registers, so 4096-bit registers would be
really expensive. How many vector registers does the Cray-1 (and its successors) have?
On modern machines OoO machinery bridges the latency gap between the
L2 cache, maybe even the L3 cache and the core for data-parallel code.
For the latency gap to main memory there are the hardware prefetchers,
and they use the L1 or L2 cache as intermediate buffer, while the
Cray-1 and followons use vector registers.
So what's the benefit of using vector/SIMD instructions at all rather
than doing it with scalar code? A SIMD instruction that replaces n
scalar instructions consumes fewer resources for instruction fetching, decoding, register renaming, administering the instruction in the OoO
engine, and in retiring the instruction.
So why not use SIMD instructions with longer vector registers? The progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
suggests that this is happening, but with every doubling the cost in
area doubles but the returns are diminishing thanks to Amdahl's law.
So at some point you stop. Intel introduced AVX-512 for Larrabee (a special-purpose machine), and now is backpedaling with desktop, laptop
and small-server CPUs (even though only the Golden/Raptor Cove cores
are enabled on the small-server CPUs) only supporting AVX, and with
AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
vector registers are already too costly for the benefit they give in general-purpose computing.
Back to old-style vector processors. There have been machines that
supported longer vector registers and AFAIK also memory-to-memory
machines. The question is why have they not been the answer of the vector-processor community to the problem of covering the latency? Or
maybe they have? AFAIK NEC SX has been available in some form even in
recent years, maybe still.
Anyway, after thinking about this, the reason behind Mitch Alsup's
statement is that in a
doall(load process store)
computation (like what SIMD is good at), the loads precede the
corresponding processing by the load latency (i.e., memory latency on
the Cray machines). If your OoO capabilities are limited (and I think
they are on the Cray machines), you cannot start the second iteration
of the doall loop before the processing step of the first iteration
has finished with the register.
You can do a bit of software
pipelining and software register renaming by transforming this into
load1 doall(load2 process1 store1 load1 process2 store2)
but at some point you run out of vector registers.
One thing that comes to mind is tracking individual parts of the
vector registers, which allows to starting the next iteration as soon
as the first part of the vector register no longer has any readers.
However, it's probably not that far off in complexity to tracking
shorter vector registers in an OoO engine. And if you support
exceptions (the Crays probably don't), this becomes messy, while with
short vector registers it's easier to implement the (ISA)
architecture.
- anton
As to why RISC-V went shorter ...
On 4/23/2024 1:22 AM, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
big snip>
As can be noted, SIMD is easy to implement.
Main obvious drawback is the potential for combinatorial explosions of instructions. One needs to keep a fairly careful watch over this.
Like, if one is faced with an NxN or NxM grid of possibilities, naive strategy is to be like "I will define an instruction for every
possibility in the grid.", but this is bad. More reasonable to devise a minimal set of instructions that will allow the operation to be done
within in a reasonable number of instructions.
But, then again, I can also note that I axed things like packed-byte operations and saturating arithmetic, which are pretty much de-facto in packed-integer SIMD.
Likewise, a lot of the gaps are filled in with specialized converter and helper ops. Even here, some conversion chains will require multiple instructions.
Well, and if there is no practical difference between a scalar and SIMD version of an instruction, may well just use the SIMD version for scalar.
....
- anton
On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:
As to why RISC-V went shorter ...
They didn’t fix a length.
Lawrence D'Oliveiro wrote:
On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:
As to why RISC-V went shorter ...
They didn’t fix a length.
Nor do they want to have to save a page of VRF at context switch.
On Tue, 23 Apr 2024 22:40:25 +0000, MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
On Tue, 23 Apr 2024 17:34:12 +0000, MitchAlsup1 wrote:
As to why RISC-V went shorter ...
They didn’t fix a length.
Nor do they want to have to save a page of VRF at context switch.
But then, you don’t need a whole array of registers, do you: you just need operand (one for each operand) and destination address registers, plus a counter.
On 4/23/2024 5:39 PM, MitchAlsup1 wrote:
BGB wrote:
MANY SIMD algorithms need saturating arithmetic because they cannot do
b + b -> h and avoid the overflow. And they cannot do B + b -> h because
that would consume vast amounts of encoding space.
There are ways to fake it.
Though, granted, most end up involving extra instructions and 1 bit of dynamic range.
DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result.
CRAY machines stayed "in style" as long as memory latency remained smaller >than the length of a vector (64 cycles) and fell out of favor when the cores >got fast enough that memory could no longer keep up.
I whish them well, but I expect it will not work out as they desire.....
On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.
I whish them well, but I expect it will not work out as they desire.....
I know that you've said this about Cray-style vectors.
I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available, a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided; it made much more sense to use off-the-shelf
microprocessors. Despite their shortcomings theoretically in
architectural terms compared to a Cray-style machine, they offered
vastly more FLOPS for the dollar.
After all, the reason the Cray I succeeded where the STAR-100 failed
was that it had those big vector registers - so it did calculations on
a register-to-register basis, rather than on a memory-to-memory basis.
That doesn't make it immune to considerations of memory bandwidth, but
that does mean that it was designed correctly for the circumstance
where memory bandwidth is an issue. So if you have the kind of
calculation to perform that is suited to a vector machine, wouldn't it
still be better to use a vector machine than a whole bunch of scalar
cores with no provision for vectors?
And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?
There are ways to increase memory bandwidth. Use HBM. Use static RAM.
Use graphics DRAM. The vector CPU of the last gasp of the Cray-style architecture, the NEC SX-Aurora TSUBASA, is even packaged like a GPU.
Also, the original Cray I did useful work with a memory no larger than
many L3 caches these days. So a vector machine today wouldn't be as
fast as it would be if it could have, say, a 1024-bit wide data bus to
a terabyte of DRAM. That doesn't necessarily mean that such a CPU,
even when throttled by memory bandwidth, isn't an improvement over an ordinary CPU.
Of course, though, the question is, is it an improvement enough? If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache, then it might well be that memory bandwidth would
mean that even with a vector architecture well suited to doing a lot
of work, the net result would be only a slight improvement over what
an ordinary CPU could do with the same memory bandwidth.
I would think that a chip is still useful if it can only provide an improvement for some problems, and that there are ways to increase
memory bandwidth from what ordinary CPUs offer, making it seem likely
that Cray-style vectors are worth doing as a way to improve what a CPU
can do.
John Savard
There is an instruction to calculate an approximate reciprocal (say,for
dividing two FP-SIMD vectors), at which a person can use Newton-Raphson
to either get a more accurate version, or use it directly (possibly
using N-R to fix up the result of the division).
After all, the reason the Cray I succeeded where the STAR-100 failed was
that it had those big vector registers ...
On Tue, 23 Apr 2024 02:14:32 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
CRAY machines stayed "in style" as long as memory latency remained smaller >>than the length of a vector (64 cycles) and fell out of favor when the cores >>got fast enough that memory could no longer keep up.
I whish them well, but I expect it will not work out as they desire.....
I know that you've said this about Cray-style vectors.
I had thought the cause was much simpler. As soon as chiips like the
486 DX and then the Pentium II became available,
a Cray-style machine
would have had to be implemented from smaller-scale integrated
circuits, so it would have been wildly uneconomic for the performance
it provided;
And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?
If
most problems anyone would want to use a vector CPU for today do
involve a large amount of memory, used in a random fashion, so as to
fit poorly in cache
On Tue, 23 Apr 2024 19:25:22 -0600, John Savard wrote:
Looking at an old Cray-1 manual, it mentions, among other things, sixty
four 64-bit intermediate scalar “T” registers, and eight 64-element vector >“V” registers of 64 bits per element. That’s a lot of registers.
RISC-V has nothing like this, as far as I can tell. Right at the top of
the spec I linked earlier, it says:
The vector extension adds 32 architectural vector registers,
v0-v31 to the base scalar RISC-V ISA.
Each vector register has a fixed VLEN bits of state.
So, no “big vector registers” that I can see? It says that VLEN must be a >power of two no bigger than 2**16, which does sound like a lot, but then
the example they give only has VLEN = 128.
Everyone has to have hope on something.
But they've managed to get GPUs to multiply matrices - and they're quite
good at it, which is why we're having all this amazing progress in AI recently.
For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.
On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:
For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.
But with more limited memory, which is typically not upgradeable.
On Wed, 24 Apr 2024 02:00:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:
Everyone has to have hope on something.
But false hopes are a waste of time.
The reason for my interest in long vectors is primarily because I
imagine that, if the Cray I was an improvement on the IBM System/360
Model 195, then, apparently, today a chip like the Cray I would be
the next logical step after the Pentium II (OoO plus cache, just like
a Model 195).
Well, apparently they do things like multiply 2048 by 2048 matrices.
Which is why they need stride.
What do vector machines do?
John Savard <quadibloc@servername.invalid> writes:
And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?
The claim by Mitch Alsup is that latency makes the Crays impractical,
because of chaining issues. Do GPUs have chaining? My understanding
is that GPUs deal with latency in the barrel processor way: use
another data-parallel thread while waiting for memory. Tera also
pursued this idea, but the GPUs succeeded with it.
- anton
Lawrence D'Oliveiro wrote:
On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:
DIV:
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result.
Another Cray-ism! ;)
Not IEEE 754 legal.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Wed, 24 Apr 2024 06:16:58 GMT, Anton Ertl wrote:
For that kind of stuff you better use GPUs, which have memory systems
with more bandwidth.
But with more limited memory, which is typically not upgradeable.
And yet, supercomputers these days often have lots of GPUs.
But the Cray-1 is not an improvement on the Model 195.
MitchAlsup1 wrote:
Lawrence D'Oliveiro wrote:
On Tue, 23 Apr 2024 18:36:49 -0500, BGB wrote:
DIV:Another Cray-ism! ;)
Didn't bother with this.
Typically faked using multiply-by-reciprocal and taking the high result. >>
Not IEEE 754 legal.
Well, it _is_ legal if you carry enough bits in your reciprocal...
but at
that point you would instead use a better algorithm to get the correct
result both faster and using less power.
Terje
On Wed, 24 Apr 2024 09:28:06 GMT, Anton Ertl wrote:
And yet, supercomputers these days often have lots of GPUs.
Some do, some dont. Im not sure that GPUs are accepted as de rigueur in >supercomputer design yet. I think this is just another instance of Ivan >Sutherlands wheel of reincarnation ><http://www.cap-lore.com/Hardware/Wheel.html>.
One of the things that those supercomputers that _do_ include GPUs are praised for is being energy-efficient.
On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:
But the Cray-1 is not an improvement on the Model 195.
The Cray-1 was widely regarded as the fastest computer in the world, when
it came out. Cruising speed of something over 80 megaflops, hitting bursts
of about 120.
IBM did try to compete in the “supercomputer” field for a while longer, >but I think by about ten years later, it had given up.
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include GPUs
are praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs, yes,
GPUs, no.
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
no.
On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:
But the Cray-1 is not an improvement on the Model 195.
The Cray-1 was widely regarded as the fastest computer in the world,
when it came out. Cruising speed of something over 80 megaflops,
hitting bursts of about 120.
IBM did try to compete in the “supercomputer” field for a while
longer, but I think by about ten years later, it had given up.
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs,
no.
According to Lawrence D'Oliveiro <ldo@nz.invalid>:
On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:
But the Cray-1 is not an improvement on the Model 195.
The Cray-1 was widely regarded as the fastest computer in the world, when >>it came out. Cruising speed of something over 80 megaflops, hitting bursts >>of about 120.
Its main practical improvement was that you could get two Crays for the price of one 360/195. (Not exactly, but close enough.)
IBM did try to compete in the “supercomputer” field for a while longer, >>but I think by about ten years later, it had given up.
IBM had tried to make computers very fast by making them very
complicated. STRETCH was fantastically complex for something built out
of individual transistors. The /91 and /195 had instrucion queues and reservation stations and loop mode. Cray went in the opposite
direction, making a much simpler computer where each individual bit
down to the chips and the wires, were as fast as possible.
In many ways it was a preview of RISC.
Seymore only did fast and simple, starting before the CDC 6600.....
On Thu, 25 Apr 2024 15:52:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Seymore only did fast and simple, starting before the CDC 6600.....
Do you attribute not exactly simple 6600 Scoreboard to Thornton?
On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.
Here's one example of an item about this:
https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html
According to Lawrence D'Oliveiro <ldo@nz.invalid>:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include GPUs are
praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs, yes, GPUs, >>no.
NVIDIA says their new Blackwell GPU takes 2000 watts, and is between
7x and 25x more power efficient than the current H100, but that's
still a heck of a lot of power. Data centers have had to come up with
higher capacity power and cooling when each rack can use 40 to 50KW.
I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.
Michael S wrote:
On Thu, 25 Apr 2024 15:52:36 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
Seymore only did fast and simple, starting before the CDC
6600.....
Do you attribute not exactly simple 6600 Scoreboard to Thornton?
If you measure simplicity by gate count--the scoreboard was
considerably simpler than the reservation station design of Tomasulo.
https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
On Thu, 25 Apr 2024 15:52:36 +0000, MitchAlsup1 wrote:
[Seymour] only did fast and simple, starting before the CDC 6600.....
And he didn’t seem to have much truck with “memory management” and “operating systems”, did he? He probably saw them as just getting in the way of sheer speed.
And he didn’t care for some of the niceties of floating-point arithmetic either, for the same reason.
[Seymour] only did fast and simple, starting before the CDC 6600.....
On the other hands, NOS did things no other OS did.....
Back when Fugaku was new, it was highly praised for being GPU-less
design that matches and slightly exceeds an efficiency of GPU-based (and other vector accelerator based) supercomputers. But that was possible
only because NVidea had unusually long pause between successive
generations of Tesla and at the same moment AMD and Intel GPGPUs were
not yet considered fit for serious supercomputing.
That was in November 2019. Never before or since.
On Thu, 25 Apr 2024 22:29:30 +0000, MitchAlsup1 wrote:
On the other hands, NOS did things no other OS did.....
Like what? I thought the original Cray OS was just a batch OS.
Then they added this Unix-like “UNICOS” thing, but that seemed to me like an interactive front-end to the batch OS.
On Thu, 25 Apr 2024 14:57:53 +0300, Michael S wrote:
Back when Fugaku was new, it was highly praised for being GPU-less
design that matches and slightly exceeds an efficiency of GPU-based
(and other vector accelerator based) supercomputers. But that was
possible only because NVidea had unusually long pause between
successive generations of Tesla and at the same moment AMD and
Intel GPGPUs were not yet considered fit for serious supercomputing.
That was in November 2019. Never before or since.
Fugaku is still at number 4 on the Top500, though--even after all
these years. And don’t forget the Chinese systems, using their
home-grown CPUs without access to Nvidia GPUs. There’s one at number
11.
Should we be looking at the Green500 list instead?
On Wed, 24 Apr 2024 22:29:36 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 09:18:56 GMT, Anton Ertl wrote:
But the Cray-1 is not an improvement on the Model 195.
The Cray-1 was widely regarded as the fastest computer in the world,
when it came out. Cruising speed of something over 80 megaflops,
hitting bursts of about 120.
IBM did try to compete in the “supercomputer” field for a while longer, >> but I think by about ten years later, it had given up.
BTW, in the latest Top500 list you ca see IBM at #7 spot.
Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.
John Savard wrote:
And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?
a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM)
Say, seemingly no one built an 8/16 bit mainframe,
or say using 24-bit
floats (Say: S.E7.F16) rather than bigger formats, ...
Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...
According to Thomas Koenig <tkoenig@netcologne.de>:
https://www.infoworld.com/article/2627720/gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
Well, compared to what?
In 1960 the total power generated in the US was about 750 TWh. In
recent years it's over 4000 TWh.
MitchAlsup1 <mitchalsup@aol.com> wrote:
Let us face facts:: en the large; vector machines are DMA devices
that happen to mangle the data on thee way through.
John Savard wrote:
And if memory bandwidth issues make Cray-style vector machines
impractical, then wouldn't it be even worse for GPUs?
a) It is not pure BW but BW at a latency less than K. CRAY-1 was
about 16-cycles (DRAM)
DRAM for CRAY-1 doesn't sound right. Intel made 1024-bit DRAM in 1970,
but it was pretty flaky and not very fast. I think the CRAY-1 used
Fairchild 10K ECL 10ns SRAM.
Andrew,
Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...
That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.
According to Thomas Koenig <tkoenig@netcologne.de>:
That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.
I can believe it.
If people make the claim that GPUs are more power-efficient than CPUs,
yes, they are for equal performance (if they can be programmed
efficiently enough for the application at hand). In practice, this will
not be used for energy savings, but for doing more calculations.
Same thing happend with steam engines - Watt's engines were a huge improvement in fuel efficiency over the previous Newcomen models, which
led to much more steam engines being built.
That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.
I can believe it.
You can probably find detailed schematics, on Bitsavers or elsewhere, to >confirm it. DEC published that sort of thing as a matter of course, back
in those days.
BGB <cr88192@gmail.com> schrieb:
Say, seemingly no one built an 8/16 bit mainframe,
The IBM 360/30 and 360/40 actually had a 8 and 16-bit
microarchitecture, respectively. Of course, they hid it cleverly
behind the user-visible architecture which was 32 bits.
But then, the Nova was a 4-bit system cleverly disguising itself
as a 16-bit system, and the Z80 had a 4-bit ALU, as well.
or say using 24-bit
floats (Say: S.E7.F16) rather than bigger formats, ...
Konrad Zuse used 22-bit floats.
Like, seemingly, the smallest point of computers was seemingly things
like the 6502 and similar...
That was probably the PDP 8/S, which had (if Wikipedia is to be
believed) around 519 logic gates. The 6502 had more.
John Savard <quadibloc@servername.invalid> schrieb:
On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.
Here's one example of an item about this:
https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
John Levine <johnl@taugh.com> schrieb:
I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.
Good thing you're not living in Germany, your electricity bill
would be enormous...
On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:
John Levine <johnl@taugh.com> schrieb:
I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.
Good thing you're not living in Germany, your electricity bill
would be enormous...
Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.
If he really meant 4Kw continuous ... wow!
On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas Koenig ><tkoenig@netcologne.de> wrote:
John Levine <johnl@taugh.com> schrieb:
I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.
Good thing you're not living in Germany, your electricity bill
would be enormous...
Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.
George Neuner wrote:
On Thu, 25 Apr 2024 17:49:11 -0000 (UTC), Thomas KoenigHere in Norway we abuse our hydro power as our primary house heating
<tkoenig@netcologne.de> wrote:
John Levine <johnl@taugh.com> schrieb:
I mean, my entire house is wired for 24KW and usually runs at more
like 4KW including a heat pump that heats the house.
Good thing you're not living in Germany, your electricity bill
would be enormous...
Possibly John meant to say "4Kwh", which actually would a be a bit on
the high side for the *average* home in the US.
If he really meant 4Kw continuous ... wow!
source, in our previous home we used about 60K KWh per year, which >corresponds to 60K/(24*365.24) = 6.84 KW average, day & night.
This was in fact while having a heat pump to handle the main part of the >heating needs.
The new house, which is from the same era (1962 vs 1963), uses
significantly less, but probably still 30-40K /year.
Electric power used to cost just under 1 NOK (about 9 cents at current >exchange rates), including both primary power cost and transmission
cost, but then we started exporting too much to Denmark/Sweden/Germany
which means that we also imported their sometimes much higher power prices.
Terje
Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes.
This kind of thing makes a mockery of the R in RISC.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Adding the typical kind of vector-processing instructions to an
instruction set inevitably leads to a combinatorial explosion in the
number of opcodes.
Why is that a problem that needs solving?
This kind of thing makes a mockery of the R in RISC.
So what?
Thomas Koenig <tkoenig@netcologne.de> writes:
John Savard <quadibloc@servername.invalid> schrieb:
On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.
Here's one example of an item about this:
https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.
It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).
It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).
https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.
Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often >didn't; due to maintenance, these early computers seem to have been
day shift only).
The 650s at leat ran all night. Alan Perlis told me some amusing stories
of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.
Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Adding the typical kind of vector-processing instructions to an >>>instruction set inevitably leads to a combinatorial explosion in the >>>number of opcodes.
Why is that a problem that needs solving?
When your OpCode encoding space runs out of bits in the instruction.
This kind of thing makes a mockery of the R in RISC.
So what?
Design + verification cost, time to market, Size of test vector set,
and Compiler complexity.
mitchalsup@aol.com (MitchAlsup1) writes:
Scott Lurndal wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Adding the typical kind of vector-processing instructions to an >>>>instruction set inevitably leads to a combinatorial explosion in the >>>>number of opcodes.
Why is that a problem that needs solving?
When your OpCode encoding space runs out of bits in the instruction.
And has that been a real problem yet? Pretty much every
instruction set can be easily extended (viz. 8086),
particularly with variable length encodings, nothing prevents
one from adding a special 32-bit encoding that extends the
instruction to 64-bits even in a fixed size encoding scheme.
This kind of thing makes a mockery of the R in RISC.
So what?
Design + verification cost, time to market, Size of test vector set,
and Compiler complexity.
As contrasted with usability. ARM doesn't add features just
for the sake of adding features, nor does Intel.
ARM doesn't add features just for the sake of adding features, nor does Intel.
On Tue, 30 Apr 2024 20:31:17 GMT, Scott Lurndal wrote:
ARM doesn't add features just for the sake of adding features, nor does
Intel.
There is such a thing as painting yourself into a corner, where every new feature added to the SIMD instruction set involves adding combinations of instructions, not just for the new types, but also for every single old
type as well.
Then contemplate for an instant that one would want SIMD instructions
for Complex numbers and Hamiltonian Quater[n]ions......
On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:
The 650s at least ran all night. Alan Perlis told me some amusing stories
of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.
How did they do that, though? Other than by hiring operators to work those >shifts, so the users could submit their jobs in a queue and go home?
It appears that Lawrence D'Oliveiro <ldo@nz.invalid> said:
On Tue, 30 Apr 2024 19:38:54 -0000 (UTC), John Levine wrote:
The 650s at least ran all night. Alan Perlis told me some amusing stories >>> of tripping in the dark over sleeping grad student wives who were
holding their husbands' place in line for the 650 in the middle of the
night. They soon made the scheduling more humane.
How did they do that, though? Other than by hiring operators to work those >>shifts, so the users could submit their jobs in a queue and go home?
Rather than just queueing up, they arranged it so the student could
sign up ahead of time, and then show up whenever to do his work, and
the wives could get some sleep.
I also think he tried to round up some money to get another computer.
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
John Savard <quadibloc@servername.invalid> schrieb:
On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.
Here's one example of an item about this:
https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.
Undoubtedly true, but I think you're missing quite a few
orders of magnitude there.
It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).
https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.
Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often didn't; due to maintenance, these early computers seem to have been
day shift only).
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
John Savard <quadibloc@servername.invalid> schrieb:
On Thu, 25 Apr 2024 05:39:55 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:
On Wed, 24 Apr 2024 23:10:47 -0600, John Savard wrote:
One of the things that those supercomputers that _do_ include
GPUs are praised for is being energy-efficient.
That I never heard before. I heard it in relation to ARM CPUs,
yes, GPUs, no.
Here's one example of an item about this:
https://www.infoworld.com/article/2627720/
gpus-boost-energy-efficiency-in-supercomputers.html
Compared the late 1950s, was the total energy consumption by
computers higher or lower than today? :-)
Total energy consumption by computers in the 1950s was lower
than today by at least a factor of 10.
Undoubtedly true, but I think you're missing quite a few
orders of magnitude there.
Probably not as many as you think. :)
It wouldn't surprise
me to discover the energy consumption of just the servers in
Amazon Web Services datacenters exceeds the 1950s total, and
that's only AWS (reportedly more than 1.4 million servers).
https://smithsonianeducation.org/scitech/carbons/1960.html states
that, in 1954, there were 15 computers in the US. That seems low
(did they only count IBM 701 machines?), but it reportedly went up to
17000 in 1964.
Even if you put the number of computers at 100 for the mid-1950s, at
100 kW each, you only get 10 MW of power when they ran (wich they often
didn't; due to maintenance, these early computers seem to have been
day shift only).
Oh boy, numbers.
First your question asked about the late 1950s, not the mid 1950s.
I estimated between 10,000 and 20,000 computers by the end of
the 1950s, and chose 5 KW as an average consumption. In those
days computers were big. Probably the estimate for number of
machines is a bit on the high side, and the average consumption
is a bit on the low side. I'm only estimating.
Then contemplate for an instant that one would want SIMD instructions for Complex numbers and Hamiltonian Quaterions......
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Then contemplate for an instant that one would want SIMD instructions for
Complex numbers and Hamiltonian Quaterions......
Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is
fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1
fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2
So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:If you have the four operands spread across two SIMD registers, so
Then contemplate for an instant that one would want SIMD instructions for >>> Complex numbers and Hamiltonian Quaterions......
Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is
fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1
fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2
So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?
This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:for
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Then contemplate for an instant that one would want SIMD instructions
If you have the four operands spread across two SIMD registers, soComplex numbers and Hamiltonian Quaterions......
Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is
fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1
fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2
So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?
This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.
I'm actually not sure how that would help. Could you elaborate?
On 4/30/2024 8:22 PM, John Levine wrote:
Sometimes seems odd in a way that people can manage to find wives, with
as many difficulties and prerequisites there seem to be in being seen as
"worthy of attention", etc...
Then again, it seems that there is a split:
Many people seem to marry off between their early to mid 20s;
Like, somehow, they find someone where there is mutual interest.
Others, not so quickly, if at all.
On the female side, it seems there are several subgroups:
Those who are waiting for "the perfect romance".
Those who want someone with at least a "6 figure income", etc.
Then there are the asexual females.
And also lesbians.
If you don't know what you are looking for, how do you know when you
find it ?!!!
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:If you have the four operands spread across two SIMD registers, so
Then contemplate for an instant that one would want SIMD instructions for >>>> Complex numbers and Hamiltonian Quaterions......
Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is
fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1
fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2
So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?
This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.
I'm actually not sure how that would help. Could you elaborate?
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:If you have the four operands spread across two SIMD registers, so
Then contemplate for an instant that one would want SIMD instructions for >>>>> Complex numbers and Hamiltonian Quaterions......
Quaternions would be a bit over the top, I tink. Complex
multiplication... implementing (e,f) = (a*c-b*d,a*d+b*c) is
fmul Rt1,Rc,Rb
fmac Re,Rd,Ra,Rt1
fmul Rt2,Rd,Rb
fmac Rf,Rc,Ra,-Rt2
So, you'd need both operands on both lanes. Not very SIMD-friendly,
I would assume, but (probably) not impossible, either.
(Re,Im) in each, then you need an initial pair of permutes to make
flipped copies before you can start the fmul/fmac ops, right?
This is exactly the kind of code where Mitch's transparent vector
processing would be very nice to have.
I'm actually not sure how that would help. Could you elaborate?
Just that all his code is scalar, but when you have a bunch of these
complex mul/mac operations in a loop, his hw will figure out the
recurrences and run them as fast as possible, with all the (Re,Im) SIMD
flips becoming NOPs.
On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:
On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:
If you don't know what you are looking for, how do you know when you
find it ?!!!
Maybe the procedure for determining that you’ve found it is recursively
enumerable, but that for doing the search is not? ;)
I think it is a case of determining if someone responds in favorable
ways to interactions, does not respond in unfavorable ways, and does not
present any obvious "deal breakers".
Presumably other people are doing something similar, but with different metrics.
Though, granted, the whole process tends to be horribly inefficient.
There generally doesn't exist any good way to determine who exists in a
given area, or to get a general idea for who may or may not be worth the
time/effort of interacting with them.
Many dating sites (and people on them) seem to operate under the
assumption of "will post pictures, good enough".
BGB wrote:
If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.
So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.
On 5/4/2024 4:19 PM, MitchAlsup1 wrote:recursively
BGB wrote:
On 5/2/2024 10:46 PM, Lawrence D'Oliveiro wrote:
On Thu, 2 May 2024 20:14:09 +0000, MitchAlsup1 wrote:
If you don't know what you are looking for, how do you know when you >>>>> find it ?!!!
Maybe the procedure for determining that you’ve found it is
notenumerable, but that for doing the search is not? ;)
I think it is a case of determining if someone responds in favorable
ways to interactions, does not respond in unfavorable ways, and does
present any obvious "deal breakers".
Presumably other people are doing something similar, but with
different metrics.
Different definitions !!
Not sure what you mean by this, exactly.
Though, granted, the whole process tends to be horribly inefficient.
If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.
So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.
At this point, looks almost like it could be closer to 0.
Or, at least, most of the ones I might be interested in talking to,
aren't in the same geographic area.
There generally doesn't exist any good way to determine who exists in
a given area, or to get a general idea for who may or may not be worth
the
time/effort of interacting with them.
Women, by and large, do the picking:: mem, by and large, do the
acquiescing.
Not much point in trying to interact with them though if there is no
reason to think it might be worth the effort of doing so.
Many dating sites (and people on them) seem to operate under the
assumption of "will post pictures, good enough".
Dating sites are for losers. P E R I O D
Somehow, the actual sites still manage to be more dignified than the
Facebook groups or phone apps, which lean much more heavily into the pointless aspects...
MitchAlsup1 <mitchalsup@aol.com> schrieb:
BGB wrote:
If you are anything like a normal male, you are compatible with about
1% of women of your age group. Likewise, 1% of any given women in your
age group will be compatible with you ±.
So, you (and her) will have to pass over 10,000 of the others to end
up with a compatible partner.
Even assuming that the numbers are true (far too low, IMHO), the
calculation assumes that both quantities are uncorrelated.
If it were really true, humans would long since have died out
(unless "compatible" means something else :-)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 44:35:54 |
Calls: | 10,392 |
Files: | 14,066 |
Messages: | 6,417,253 |