I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.
The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
I was keeping as close to the original CELL design as possible
On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
Well, the original Cray I had a main memory of eight megabytes
I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done
Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an >eight-channel memory bus.
Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
John Savard
On 2/5/2024 12:48 AM, Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
One doesn't need to disallow access to external RAM, but maybe:
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".
Or, say, there is a region on RAM that is divided among the cores, where
the core has fast access to its own local chunk, but slow access to any
of the other chunks (which are treated more like external RAM).
Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.
As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
is cheap and simple.
The Cell cores were, if anything, more of a "SIMD First, ALU Second" approach, building it around 128-bit registers but only using part of
these for integer code.
I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general purpose code, rather than used almost entirely for SIMD.
I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...
It is arguably more scalable, but as I see it, much past 64 or 128 bit vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.
But, I can also note that even for semi-general use, an ISA design like
RV64G is suffers a significant disadvantage, say, vs my own ISA, in the
Quadibloc <quadibloc@servername.invalid> writes:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or implementing 128-bit operations with 64-bit units (e.g., on the K8).
Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.
Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations; <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).
Anyway, the size of the register files is one reason for avoiding
longer vectors.
Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.
My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.
As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC applications are among them). And once you are limited by main memory bandwidth, the ISA makes little difference.
But for those applications where caches work (e.g., dense matrix multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.
The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have
long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.
The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
The Playstation 3 has a separate GPU based on the Nvidia G70 <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.
What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.
- anton
On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.
I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).
Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an eight-channel memory bus.
Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.
Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.
John Savard
I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer performance that proved effective in the past that still isn't being
applied to microprocessors today.
Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.
On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer
performance that proved effective in the past that still isn't being
applied to microprocessors today.
Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.
RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many architectures these days precisely because the resulting combinatorial explosion in added instructions makes a mockery of the “R” in “RISC”.
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
(I know very well that VVM gives similar gains without the VRF)
Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea"
On 2024-02-05, Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).
My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).
/Marcus
On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
Quadibloc <quadibloc@servername.invalid> writes:
On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
Scott Lurndal <scott@slp53.sl.home> schrieb:
Quadibloc <quadibloc@servername.invalid> writes:
On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main >>>memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.
Or so I heard, ~ 30 years ago.
Thomas Koenig wrote:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Quadibloc <quadibloc@servername.invalid> writes:
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under >>>>certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.
The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.
On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:
On 2024-02-05, Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from optimal. Ratio of 8 looks like more optimal in case when performance of vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.
On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.
I've come up with _one_ possible solution. Remember the Texas Instruments 9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?
Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files, indicated by a pointer register?
But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.
The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.
Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.
Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!
The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.
Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.
People who can afford a terabyte of DRAM can also afford an eight-channel memory bus, so it should be possible to manage something.
John Savard
On 2024-02-14, Quadibloc wrote:
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.
I scribbled down a suggestion here:
* https://gitlab.com/-/snippets/3673883
In particular, pay attention to the sections "Vector state on context switches" and "Thread context".
My idea is not new, but I think that it takes some old ideas a few steps further. So here goes...
There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the integer register file of RV32I).
Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.
The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are scratch registers.
I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).
I do not yet know if this will fly, though...
On 2024-02-14, Michael S wrote:
On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:
On 2024-02-05, Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.
The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with solutions to problems related to large vector registers).
Each vector element is 32 bits wide.
So, in total: 32 x 16 x 32 bits = 16384 bits
This is, incidentally, exactly the same as for AVX-512.
My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never profiled as thoroughly as I did SGEMM.
I expect that people will want to do such things with an MRISC32 core. However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.
I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.
For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.
Of course, nothing is set in stone, but so far that has been my
thinking.
/Marcus
On 2024-02-14, Quadibloc wrote:
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.
The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!
So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember,
On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
On 2024-02-14, Quadibloc wrote:
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
Then what would you call it?
I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style memory-memory vector ISA:s, etc.
/Marcus
On Thu, 15 Feb 2024 20:00:20 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:
On 2024-02-14, Michael S wrote:
On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:
On 2024-02-05, Quadibloc wrote:
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.
The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).
Each vector element is 32 bits wide.
So, in total: 32 x 16 x 32 bits = 16384 bits
This is, incidentally, exactly the same as for AVX-512.
My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.
I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.
For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many
implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.
Of course, nothing is set in stone, but so far that has been my
thinking.
/Marcus
Sounds quite reasonable, but I wouldn't call it "Cray-style".
On Fri, 16 Feb 2024 12:37:55 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:
Then what would you call it?
I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.
/Marcus
I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by architecture which in theory is problematic. But I don't care, I am not
in academy.
Quadibloc <quadibloc@servername.invalid> writes:
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop
for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}
a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
On 2/15/2024 11:27 PM, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
On 2024-02-16, Quadibloc wrote:
On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
On 2024-02-14, Quadibloc wrote:
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.
I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.
E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which precludes fetching a new instruction until the fourth or even fifth
cycle).
Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.
Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).
With vector operations you pause the front end (fetch and decode) while iterating over vector elements, which eliminates branch misprediction penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment, compare, branch).
As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.
As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this assumption is mostly based on my own ignorance, so take it with a grain
of salt).
For reference, the microarchitectural complexity that I'm thinking about
is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v
/Marcus
On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
On 2024-02-14, Quadibloc wrote:
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.
The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.
The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!
Hey, the GPU in a computer these days is also a singular resource.
Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.
Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style vectors, so that the best possible vector performance for conventional non-batch programs is also available.
Now, what would I think of as being "something very much like VVM" without actually being VVM?
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.
So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.
John Savard
On 2/15/2024 11:27 PM, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop
for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}
a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/15/2024 11:27 PM, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the >>instructions in the loop can be fetched and decoded only once, it allows >>the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >>single instruction. Perhaps the HW could figure out all of that by >>analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
in the coming years.
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >>reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware. For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:
double r;
double r0=0.0;
....
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
.... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
- anton
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/15/2024 11:27 PM, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.
Maybe we'll see it in some Intel or AMD CPU
in the coming years.
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:
double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
MitchAlsup1 wrote:
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
Is that done by disambiguation logic, checking for same cache line access?
On 2/16/2024 6:23 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/15/2024 11:27 PM, Anton Ertl wrote:
Quadibloc <quadibloc@servername.invalid> writes:
Basically, Mitch has his architecture designed for implementation on >>>>> CPUs that are smart enough to notice certain combinations of instructions >>>>> and execute them as though they're single instructions doing the same >>>>> thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows >>> the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >>> single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.
IANAHG, but if it were that simple, I would think Mitch would have implemented it that way.
Maybe we'll see it in some Intel or AMD CPU
in the coming years.
One can hope!
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
OK, but my point is that you can do conditional execution within a VVM loop.
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware recognizing that the loop contains no instructions that can't be parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.
For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:
double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
Agreed.
Quadibloc wrote:
So my substitute for VVM should now be obvious - explicit memory-to-memory >> vector instructions, like on an old STAR-100.
Gasp........
Stephen Fuld wrote:
On 2/16/2024 6:23 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/15/2024 11:27 PM, Anton Ertl wrote:
I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative; >>>>> but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and >>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>> reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure. ISTM, and again, IANAHG, that the problem for VVM is the
hardware recognizing that the loop contains no instructions that can't
be parallelized. There are also some issues like doing a sum of
signed integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.
The HW does not need preceding knowledge. If an exception happens, the vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.
On 2/16/2024 3:22 PM, MitchAlsup wrote:
Stephen Fuld wrote:
On 2/16/2024 6:23 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 2/15/2024 11:27 PM, Anton Ertl wrote:
snip
I think he also allows
recurrences (in particular, reductions), but I don't understand how >>>>>> his hardware auto-vectorizes that; e.g.:
double r=0.0;
for (i=0; i<n; i++)
   r += a[i];
This is particularly nasty given that FP addition is not associative; >>>>>> but even if you allow fast-math-style reassociation, doing this in >>>>>> hardware seems to be quite a bit harder than the rest of VVM.
 From what I understand, while you can do reductions in a VVM
loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>>> reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max >>>>> value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure. ISTM, and again, IANAHG, that the problem for VVM is the
hardware recognizing that the loop contains no instructions that
can't be parallelized. There are also some issues like doing a sum
of signed integer values and knowing whether overflow occurred,
etc. The programmer may know that overflow cannot occur, but the HW >>> doesn't.
The HW does not need preceding knowledge. If an exception happens, the
vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.
I think you might have missed my point. If you are summing the signed integer elements of an array, whether you get an overflow or not can
depend on the order the additions are done. Thus, without knowledge
that only the programmer has (i.e. that with the size of the actual data used, overflow is impossible) the hardware cannot parallelize such an operation. If the programmer knows that overflow cannot occur, he has
no way to communicate that to the VVM hardware, such that the HW could parallelize the summation.
On 2/16/2024 5:29 AM, Marcus wrote:
I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.
E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).
Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.
I guess possible.
EricP wrote:
MitchAlsup1 wrote:
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
first: memory is inherently misaligned in My 66000 architecture. So, since the width of the machine is 64-bits, we read or write in 128-bit quantities so that we have enough bits to extract the misaligned data from or a container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}
Second: over in VVM-land, the implementation can decide to read and write wider, but is architecturally constrained not to shrink below 128-bits.
A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. This supports loops of 6IPC or greater in a 1-wide machine. This machine would process suitable loops at 128-bits per cycle--depending on "other things" that are generally allowable.
A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths to support this rate.
Is that done by disambiguation logic, checking for same cache line
access?
Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses are dense, whether they all go in the same direction, and what registers carry loop-to-loop dependencies.
MitchAlsup wrote:
EricP wrote:
MitchAlsup1 wrote:
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
first: memory is inherently misaligned in My 66000 architecture. So, since >> the width of the machine is 64-bits, we read or write in 128-bit quantities >> so that we have enough bits to extract the misaligned data from or a
container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}
Second: over in VVM-land, the implementation can decide to read and write
wider, but is architecturally constrained not to shrink below 128-bits.
A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. >> This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.
A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths >> to support this rate.
Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop. That's more straight forward than fusing LD's or ST's in LSQ.
Is that done by disambiguation logic, checking for same cache line
access?
Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses >> are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.
How does it know when to use LDP/STP uOps?
That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.
But the decision on the legality to use LDP/STP depends on knowing the current loop counter >= 2 and address(es) aligned on a 16 byte boundary, which are multiple dynamic, possibly calculated, values only available
much later to the back end.
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >one that would not trap and give the mathematically correct answer.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >> one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Anton Ertl wrote:
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
while if you implement it with
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
then you would be OK, and the final result would be mathematically correct.
For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.
Terje
On 2/17/2024 3:20 AM, Terje Mathisen wrote:
BGB wrote:
But, I am not entirely sure how one would go about implementing it, as
VADD.H would need to do the equivalent of:
MOV.Q (R4), R16
MOV.Q (R5), R17
ADD 8, R4
ADD 8, R5
PADD.H R16, R17, R18
MOV.Q R18, (R6)
ADD 8, R6
All in a single instruction.
Though, could be reduced if auto-increment were re-added:
MOV.Q @R4+, R16
MOV.Q @R5+, R17
PADD.H R16, R17, R18
MOV.Q R18, @R6+
On 2/17/2024 12:03 PM, Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly >> don't trap.
Yes.
Trap on overflow is not really a thing in the JVM, the basic integer
types are modulo, and don't actually distinguish signed from unsigned (unsigned arithmetic is merely faked in some cases with special
operators, with signed arithmetic assumed as the default).
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Yeah. No traps, only NaNs.
FWIW: My own languages, and BGBCC, also partly followed Java's model in
this area. But, it wasn't hard: This is generally how C behaves as well
on most targets.
Well, except that C will often trap for things like divide by zero and similar, at least on x86. Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
On BJX2, there isn't currently any divide-by-zero trap, since:
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.
So, IIRC, integer divide-by-zero will just give 0, and FP divide-by-zero
will give Inf or NaN.
- anton
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
On 2/17/2024 4:08 PM, MitchAlsup1 wrote:
BGB wrote:
On BJX2, there isn't currently any divide-by-zero trap, since:
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.
This sounds like it should make your machine safe to program and use,
but it does not.
It is more concerned with "cheap" than "safe".
Trap on divide-by-zero would require having a way for the divider unit
to signal divide-by-zero has been encountered (say, so some external
logic can raise the corresponding exception code). This is not free.
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
Anton Ertl wrote:...
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
For this particular example, you would also get the correct answer with >wrapping arithmetic, even if that by default is UB in modern C.
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Well, except that C will often trap for things like divide by zero and >similar, at least on x86.
Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
BGB <cr88192@gmail.com> writes:
Well, except that C will often trap for things like divide by zero
and similar, at least on x86.
The division instructions of IA-32 and AMD64 trap on divide-by-zero
and when the result is out of range. Unsurprisingly, C compilers
usually use these instructions when compiling division on these architectures. One interesting case is what C compilers do when you
write
long foo(long x)
{
return x/-1;
}
Both gcc and clang compile this to
0: 48 89 f8 mov %rdi,%rax
3: 48 f7 d8 neg %rax
6: c3 retq
and you don't get a trap when you call foo(LONG_MIN), while you would
if the compiler did not know that the divisor is -1 (and it was -1 at run-time).
By contrast, when I implemented division-by-constant optimization in
Gforth, I decided not "optimize" the division by -1 case, so you get
the ordinary division operation and its behaviour. If a programmer
codes a division by -1 rather than just NEGATE, they probably want
something other than NEGATE.
Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
Reading up on Java, <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> says:
|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or equivalent could then check if the divisor was 0 (and then throw an ArithmeticException) or -1 (and then produce a long_min result and
continue execution).
- anton
On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> >> says:
|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an
ArithmeticException) or -1 (and then produce a long_min result and
continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division sequence,
i.e. CQO followed by IDIV, will produce result expected by Java spec
without any overflow.
Michael S <already5chosen@yahoo.com> writes:
On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
says:
|if the dividend is the negative integer of largest possible
magnitude |for its type, and the divisor is -1, then integer
overflow occurs and |the result is equal to the dividend. Despite
the overflow, no |exception is thrown in this case. On the other
hand, if the value of |the divisor in an integer division is 0,
then an ArithmeticException |is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating
the divide instruction. Alternatively, the generated code could
just produce a division instruction, and the signal handler (on
Unix) or equivalent could then check if the divisor was 0 (and
then throw an ArithmeticException) or -1 (and then produce a
long_min result and continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division
sequence, i.e. CQO followed by IDIV, will produce result expected by
Java spec without any overflow.
Try it. E.g., in gforth-fast /S performs this sequence:
see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code
And when I divide LONG_MIN by -1, I get a trap:
$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<
- anton
LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
while if you implement it with
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
then you would be OK, and the final result would be mathematically correct.
For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.
On Sun, 18 Feb 2024 22:40:08 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
says:
|if the dividend is the negative integer of largest possible
magnitude |for its type, and the divisor is -1, then integer
overflow occurs and |the result is equal to the dividend. Despite
the overflow, no |exception is thrown in this case. On the other
hand, if the value of |the divisor in an integer division is 0,
then an ArithmeticException |is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just
generating the divide instruction. Alternatively, the generated
code could just produce a division instruction, and the signal
handler (on Unix) or equivalent could then check if the divisor
was 0 (and then throw an ArithmeticException) or -1 (and then
produce a long_min result and continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division
sequence, i.e. CQO followed by IDIV, will produce result expected
by Java spec without any overflow.
Try it. E.g., in gforth-fast /S performs this sequence:
see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov
0x8(%r13),%rax 0x00005614dd335635 <gforth_engine+3221>: add
$0x8,%r13 0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code
And when I divide LONG_MIN by -1, I get a trap:
$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<
- anton
You are right.
LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.
On 17/02/2024 19:58, Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20 >>>> where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
David Brown wrote:
On 17/02/2024 19:58, Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like Java=20 >>>>> where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.
I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.
Terje
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.
The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)
Terje Mathisen wrote:
If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing >performance and making it that much harder to tickle the overflow exception.
We are in an era where long has higher performance than ints (except for >cache footprint overheads.)
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
David Brown wrote:
On 17/02/2024 19:58, Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like
Java=20
where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.
I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.
Terje Mathisen wrote:
David Brown wrote:
On 17/02/2024 19:58, Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language likeWhat "temporarily trapping loop" and "mathematically correct answer"? >>>>>
Java=20
where it would be illegal to transform a temporarily trapping loop >>>>>> into=20
one that would not trap and give the mathematically correct answer. >>>>>
If you are talking about integer arithmetic, the limited integers in >>>>> Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does >>>>> not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits and
long is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing performance and making it that much harder to tickle the overflow
exception.
What I was trying to illustrate was the principle that by having a
wider accumulator you could aggregate a series of numbers, both
positive and negative, and get the correct (in-range) result, even if
the input happened to be arranged in such a way that it would
temporarily overflow the target int type.
We are in an era where long has higher performance than ints (except for cache footprint overheads.)
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s. We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
David Brown <david.brown@hesbynett.no> schrieb:
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.
Depending on len and the data...
The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)
And that is one of the things that bugs me, in languages like C
and Fortran both.
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >architectures of old.
Those ideas are that integer overflows do not happen and that a
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
David Brown <david.brown@hesbynett.no> schrieb:
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Thomas Koenig <tkoenig@netcologne.de> writes:
David Brown <david.brown@hesbynett.no> schrieb:
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
 int8_t s = 0;
 for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Which is why len should have been declared as size_t. A negative
array length is nonsensical.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
COBOL:
ADD 1 TO TALLY ON OVERFLOW ...
BPL:
IF OVERFLOW ...
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>> cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
Also, these architectures probably would not have the so-called 32-bit arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64. RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom. After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes). But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the compressed extension (16-bit encodings of frequent instructions).
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
E.g., consider the function
void sext(int M, int *ic, int *is)
{
int k;
for (k = 1; k <= M; k++) {
ic[k] += is[k];
}
}
which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable performance difference (as reported in section 3.3 of <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
looks as follows on RV64GC:
int unsigned (u)intptr_t
.L3: .L8: .L15:
slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
sw a5,0(a6) addw a5,a5,a3
bge a0,a4,6 <.L3> sw a5,0(a6)
bgeu a0,a4,28 <.L8>
There is no difference between the intptr_t loop body and the
uintptr_t loop. And without -fwrapv, the int loop looks just like the (u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).
So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively. In this way you can circumvent some of the damage that I32LP64 has done.
On 20/02/2024 16:17, Terje Mathisen wrote:
Scott Lurndal wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>>> architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
COBOL:
ADD 1 TO TALLY ON OVERFLOW ...
BPL:
IF OVERFLOW ...
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
And this is reflected in more recent architectures: Many architectures >>>> since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and >superscaling, overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
x86 has had an 'O' (Overflow) flags bit since the very beginning
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?
Scott Lurndal wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
COBOL:
ADD 1 TO TALLY ON OVERFLOW ...
BPL:
IF OVERFLOW ...
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Not only that, these cpus also had a dedicated single-byte opcode INTO
(hex 0xCE) to allow you to implement exception-style overflow handling
with very little impact on the mainline program, just emit that INTO
opcode directly after any program sequence where the compiler believed
that an overflow which shjould be handled, might happen.
Terje
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
COBOL:
ADD 1 TO TALLY ON OVERFLOW ...
BPL:
IF OVERFLOW ...
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Architectures in the 1960's had a flags register with and overflow bit.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>>architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.
Moreover, addition with carry-in was only added in ESA/390 in 1990.
So they certainly did not expect multi-precision arithmetic or Bignums
before then.
COBOL:
ADD 1 TO TALLY ON OVERFLOW ...
BPL:
IF OVERFLOW ...
This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?
Thomas Koenig <tkoenig@netcologne.de> writes:
David Brown <david.brown@hesbynett.no> schrieb:
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Which is why len should have been declared as size_t. A negative
array length is nonsensical.
It was a pure typo from my side. In Rust array indices are always of
"size_t" type, you have to explicitely convert/cast anything else before
you can use it in a lookup:
opcodes[addr as usize]
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
Wouldn't it be better to forbid mixing of signedness?
I don't know
Fortran, so that might be a silly question!
For my C programming, I like to have "gcc -Wconversion -Wsign-conversion -Wsign-compare" to catch unintended mixes of signedness.
Architectures in the 1960's had a flags register with and overflow bit.
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,
as flag registers are a bottleneck for OOO and
superscaling
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >picked to fit the range you need).
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >letting the compiler assume that will not happen is a good thing.
(And
it lets you check for overflow bugs using run-time sanitizers.)
scott@slp53.sl.home (Scott Lurndal) writes:
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various >places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of >>fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require execution resources, however.
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.
- anton
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.
What I would like is a compiler flag that did "IFF when an int (or unsigned) >ends up in a register, promote it to the 'fast' type".
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
David Brown <david.brown@hesbynett.no> schrieb:
On 17/02/2024 19:58, Terje Mathisen wrote:
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always
at least 16 bits, on any target) and the operation will therefore
not overflow.
Depending on len and the data...
[...]
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
I only heard about (u)intptr_t long after my first contact with
I32LP64 in 1995. I don't think it existed at the time.
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).
Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.
My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor.
I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.
For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough.
So I don't see a reason for int_fast32_t etc.
These adapt suitably for different
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
Why do you bring up "long long int"?
As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.
I don't see a point in those types. What's wrong with (u)intptr_t IYO?
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.
Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?
I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).
(And
it lets you check for overflow bugs using run-time sanitizers.)
If the compiler assumes that overflow does not happen, how do these "sanitizers" work?
Anyway, I certainly have code that relies on modulo arithmetic.
On 2/20/24 10:25, David Brown wrote:
I would suggest C "fast" types like int_fast32_t (or other "fast"What I would like is a compiler flag that did "IFF when an int (or
sizes, picked to fit the range you need). These adapt suitably for
different targets. If you want to force the issue, then "int64_t" is
IMHO clearer than "long long int" and does not give a strange
impression where you are using a type aimed at pointer arithmetic for
general integer arithmetic.
unsigned)
ends up in a register, promote it to the 'fast' type". This would be great when compiling dusty C decks. (Was there ever C code on punched cards?)
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require execution resources, however.
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.
On 20/02/2024 18:47, Anton Ertl wrote:
And support for 32-bit has /not/ been "eliminated from ARM cores". It
may have been eliminated from the latest AArch64 cores
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 18:47, Anton Ertl wrote:
And support for 32-bit has /not/ been "eliminated from ARM cores".
It may have been eliminated from the latest AArch64 cores
The ARMv8 architecture fully supports the A32 and T32 instruction
sets.
Implementations of the architecture can choose not to implement
the A32 and T32 instruction sets. Some ARMv8 implementations
(e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM implementations of ARMv8 implemented the A32/T32 instruction
sets for EL0 (user-mode) only - I'm not aware of any that
supported A32 at privileged exception levels (EL1, EL2 or EL3).
Some of the more recent ARM neoverse cores support A32/T32 at EL0,
and some of them don't. Cavium's cores were 64-bit only.
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in
microcontrollers
Indeed, and many ARMv8 SoCs include arm 32-bit M-series
microcontrollers on-chip.
On 20/02/2024 20:18, Brian G. Lucas wrote:
On 2/20/24 10:25, David Brown wrote:
I would suggest C "fast" types like int_fast32_t (or other "fast"What I would like is a compiler flag that did "IFF when an int (or unsigned)
sizes, picked to fit the range you need). These adapt suitably
for different targets. If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.
ends up in a register, promote it to the 'fast' type". This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)
Well, that will happen to at least some extent (with an optimising compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int", precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.
On Wed, 21 Feb 2024 14:34:59 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 18:47, Anton Ertl wrote:
And support for 32-bit has /not/ been "eliminated from ARM cores".
It may have been eliminated from the latest AArch64 cores
The ARMv8 architecture fully supports the A32 and T32 instruction
sets.
Implementations of the architecture can choose not to implement
the A32 and T32 instruction sets. Some ARMv8 implementations
(e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM
implementations of ARMv8 implemented the A32/T32 instruction
sets for EL0 (user-mode) only - I'm not aware of any that
supported A32 at privileged exception levels (EL1, EL2 or EL3).
W.r.t. Arm Inc. Cortex-A cores that's simply wrong.
All 64-bit Cortex-A cores from the very first two (A53 and A57, 2012)
and up to A75 support A32/T32 at all four exception levels. Of those,
A53 and A55 are still produced and used in huge quantities.
The first Cortex-A 64-bit core that supports aarch32 only at EL0 is >Cortex-A76 (2018).
On Wed, 21 Feb 2024 14:34:59 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
David Brown <david.brown@hesbynett.no> writes:
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in
microcontrollers
Indeed, and many ARMv8 SoCs include arm 32-bit M-series
microcontrollers on-chip.
Still, I don't think that the ratio of all ARM cores combined to cores
in smartphone application processors is really hundreds. I'd say,
something between 15 and 30.
On 20/02/2024 19:42, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning,
along with JO and JNO for Jump on Overflow and Jump if Not
Overflow.
Many processors had something similar. But I think they fell out
of fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep
compatibility with 32-bit code.
But I think it was fairly rare to use the "add 32-bit and update
flags" instruction in 32-bit RSIC systems (except for 64-bit
arithmetic), and much rarer to use the "add 64-bit and update flags"
version in 64-bit versions.
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
I think it would mainly be /despite/ having a flag register, rather
than /because/ of it?
Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.
Implementing a flags register in an OoO microarchitecture does
require execution resources, however.
It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.
Do you think it is an advantage for a RISC architecture to have a
flags register compared to alternatives? Say we want to have a
double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
(I hope my pseudo-code is clear enough here.) With flags and an "add
with carry" instruction you could have :
carry = 0;
carry, res_lo = reg_a + reg_b + carry
carry, res_hi = 0 + 0 + carry
Alternatively, you could have a double-register result, at the cost
of having more complex register banks :
res_hi:res_lo = reg_a + reg_b
Or you could have an "add and take the high word" instruction and use
two additions :
res_hi = (reg_a + reg_b) >> N
res_lo = reg_a + reg_b
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.
Fair enough, you do want carry (or an equivalent) for big number
work. But I would still contend that the vast majority of integers
and integer arithmetic used in code will fit within 32 bits, and the
vast majority of those that don't, will fit within 64 bits. Once you
go beyond that, you will need lots of bits (such as, as you say, cryptography).
On Wed, 21 Feb 2024 13:27:23 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 20/02/2024 20:18, Brian G. Lucas wrote:
On 2/20/24 10:25, David Brown wrote:
I would suggest C "fast" types like int_fast32_t (or other "fast"What I would like is a compiler flag that did "IFF when an int (or
sizes, picked to fit the range you need). These adapt suitably
for different targets. If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.
unsigned)
ends up in a register, promote it to the 'fast' type". This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)
Well, that will happen to at least some extent (with an optimising
compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int",
precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.
In case of array indices I'd replace your "a bit more" by "a lot
more".
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a naivety.
On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:
On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
Reading flags can't really be rare, because conditional branches
are among most common instructions in the real-world code.
On 21/02/2024 17:33, Michael S wrote:
On Wed, 21 Feb 2024 13:27:23 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 20/02/2024 20:18, Brian G. Lucas wrote:
On 2/20/24 10:25, David Brown wrote:
I would suggest C "fast" types like int_fast32_t (or other "fast"What I would like is a compiler flag that did "IFF when an int (or
sizes, picked to fit the range you need). These adapt suitably
for different targets. If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.
unsigned)
ends up in a register, promote it to the 'fast' type". This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)
Well, that will happen to at least some extent (with an optimising
compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int",
precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.
In case of array indices I'd replace your "a bit more" by "a lot
more".
I haven't measured the real-world performance impact (I am more
interested in performance on microcontroller cores). So I'll believe >whatever you and the others here say on that!
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a
naivety.
Indeed.
On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 20/02/2024 19:42, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning,
along with JO and JNO for Jump on Overflow and Jump if Not
Overflow.
Many processors had something similar. But I think they fell out
of fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep
compatibility with 32-bit code.
aarch64 is a completely new and incompatible instruction encoding.
But I think it was fairly rare to use the "add 32-bit and update
flags" instruction in 32-bit RSIC systems (except for 64-bit
arithmetic), and much rarer to use the "add 64-bit and update flags"
version in 64-bit versions.
Of course, the main use of flags is conditional branching. That's true
even on 16-bit.
The 2nd common use is conditional move/selection.
Other uses are just bonus, insignificant in The Great Scheme of Things. However I don't think that "bonus" use of flags for bignum and similar
is any rarer on 64-bit machines than on 32-bit.
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
I think it would mainly be /despite/ having a flag register, rather
than /because/ of it?
That's what Mitch thinks. But he has no proof.
Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.
On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
Reading flags can't really be rare, because conditional branches
are among most common instructions in the real-world code.
Implementing a flags register in an OoO microarchitecture does
require execution resources, however.
It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.
Do you think it is an advantage for a RISC architecture to have a
flags register compared to alternatives? Say we want to have a
double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
(I hope my pseudo-code is clear enough here.) With flags and an "add
with carry" instruction you could have :
carry = 0;
carry, res_lo = reg_a + reg_b + carry
carry, res_hi = 0 + 0 + carry
That's not how we do it.
We just use normal add for the first instruction.
Alternatively, you could have a double-register result, at the cost
of having more complex register banks :
res_hi:res_lo = reg_a + reg_b
Or you could have an "add and take the high word" instruction and use
two additions :
res_hi = (reg_a + reg_b) >> N
res_lo = reg_a + reg_b
This variant does not scale to longer additions. Producing carry, i.e. instruction having effectively two outputs is smaller part of advantage
of flags-based scheme. The bigger part is consuming carry, i.e. having effectively three inputs. In order to see it, you have to think about
triple -width (or wider) case.
Those are words in between the first and the last which are the most challenging for MIPS/Alpha/RISC-V and where they need 5 instruction vs 1 instruction on x86/Arm/SPARC.
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float).
float
is cerainly faster than double.
It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.
On Wed, 21 Feb 2024 13:49:41 +0100[...]
David Brown <david.brown@hesbynett.no> wrote:
However I don't think that "bonus" use of flags for bignum and similar
is any rarer on 64-bit machines than on 32-bit.
Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.
On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.
Still, as I said above, that just a bonus rather than a major reason to
have flags.
Michael S <already5chosen@yahoo.com> writes:
On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 ><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >but I guess that this is due to being able to suppress flags updates
rather than ignoring those that are overwritten without being read.
That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?
Michael S <already5chosen@yahoo.com> writes:
On Wed, 21 Feb 2024 13:49:41 +0100 David Brown[...]
<david.brown@hesbynett.no> wrote:
However I don't think that "bonus" use of flags for bignum and
similar is any rarer on 64-bit machines than on 32-bit.
I think Bignums are more common on 64-bit machines.
I also think that general-purpose computers do more cryptography
(and multi-precision arithmetic for that) than embedded computers.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >>but I guess that this is due to being able to suppress flags updates
rather than ignoring those that are overwritten without being read.
That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?
A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.
I also think that general-purpose computers do more cryptography
(and multi-precision arithmetic for that) than embedded computers.
Hm. That may depend on whether you are comnparing absolute numbers or >proportions of work-loads.
Cryptography is very important for many
embedded systems (telecom, automatic teller machines, point-of-sale >terminals, etc.) Last I looked (several years ago), there were several >8051-based chips (small 8-bit processors) on the market with dedicated >on-chip HW accelerators for cryptography.
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 18:47, Anton Ertl wrote:
And support for 32-bit has /not/ been "eliminated from ARM cores". It
may have been eliminated from the latest AArch64 cores
The ARMv8 architecture fully supports the A32 and T32 instruction sets.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
On ARM cores the number of physical flags registers is roughly 1/3
of the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
but I guess that this is due to being able to suppress flags updates >>rather than ignoring those that are overwritten without being read.
That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?
A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.
What is T16? Google does not give me anything appropriate for "ARM
T16"? Do you mean the 16-bit-encoded instructions in T32?
- anton
Terje Mathisen wrote:
If we have a platform where the default integer size is 32 bits
and long is 64 bits, then afaik the C promotion rules will use
int as the accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of
int->long inside a loop
if the long is smashed back to int immediately after the loop
terminates.
David Brown wrote:
On 17/02/2024 19:58, Terje Mathisen wrote:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
On the third (i.e gripping) hand you could have a language like
Java where it would be illegal to transform a temporarily
trapping loop into one that would not trap and give the
mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct
answer"?
If you are talking about integer arithmetic, the limited integers
in Java have modulo semantics, i.e., they don't trap, and
BigIntegers certainly don't trap.
If you are talking about FP (like I did), by default FP addition
does not trap in Java, and any mention of "mathematically
correct" in connection with FP needs a lot of further
elaboration.
Sorry to be unclear:
I haven't really been following this thread, but there's a few
things here that stand out to me - at least as long as we are
talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits
and long is 64 bits, then afaik the C promotion rules will use int
as the accumulator size, right?
What I was trying to illustrate was the principle that by having a
wider accumulator you could aggregate a series of numbers, both
positive and negative, and get the correct (in-range) result, even
if the input happened to be arranged in such a way that it would
temporarily overflow the target int type.
I think it is much better to do it this way and then get a
conversion size trap at the very end when/if the final sum is in
fact too large for the result type.
On 20/02/2024 19:42, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 20/02/2024 16:17, Terje Mathisen wrote:
x86 has had an 'O' (Overflow) flags bit since the very beginning, along >>>> with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep compatibility
with 32-bit code.
But I think it was fairly rare to use the "add 32-bit and update flags" >instruction in 32-bit RSIC systems (except for 64-bit arithmetic), and
much rarer to use the "add 64-bit and update flags" version in 64-bit >versions.
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
I think it would mainly be /despite/ having a flag register, rather than >/because/ of it?
Sometimes having flags for overflows, carries, etc., can be very handy.
So having it in the ISA is useful. But I think you would normally want
your code to avoid setting or reading flags.
It would, I think, be particularly cumbersome to track several parallel >actions that all act on the flag register, as it is logically a shared >resource.
Do you think it is an advantage for a RISC architecture to have a flags >register compared to alternatives?
That's nonsense. People use carry for implementing multi-precision
arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.
Fair enough, you do want carry (or an equivalent) for big number work.
But I would still contend that the vast majority of integers and integer >arithmetic used in code will fit within 32 bits, and the vast majority
of those that don't, will fit within 64 bits.
Once you go beyond that,
you will need lots of bits (such as, as you say, cryptography).
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
Sorry, I meant ptrdiff_t, which was used for pointer math.
Tim Rentsch wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Missing my point:: which was::
The summation loop will not overflow, and overflow is detected at
the smash from int to int8_t.
On 20/02/2024 18:47, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:Another possible reason is that it is very useful to have integer types
On 20/02/2024 13:00, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
mitchalsup@aol.com (MitchAlsup1) writes:
with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
types, so if "int" is 64-bit, you have "short" as either 16-bit and have
no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With >32-bit "int", it's easy to have each size without having to add extended >integer types or add new standard integer types (like "short short int"
for 16-bit and "short int" for 32-bit).
I saw benchmarks showing x32 being measurably faster,
but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)
and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.
And support for 32-bit has /not/ been "eliminated from ARM cores".
It
may have been eliminated from the latest AArch64 cores - I don't keep
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
and embedded systems.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>> picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.
You need to pick the appropriate size for your data, as I said.
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.
I like my types to say what they mean. "uintptr_t" says "this object
holds addresses for converting pointer values back and forth with an
integer type".
Nonetheless, it is good design to use appropriate type names for
appropriate usage. This makes the code clearer, and increases portability.
As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
And that is why it makes sense to use the "fast" types. If you need a
16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems, >32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
always supporting the range you need, as fast as possible.
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.
Sorry, that comment went over my head - I don't know what
"demonstration" you are referring to.
For comparison to C, look at the Zig language. (It's not a language I
have used or know in detail, but I know a little about it.) Unsigned
integer arithmetic overflow is UB in Zig, just like signed integer
arithmetic overflow. There are standard options and block settings
(roughly equivalent to pragmas) to control whether these give a run-time >error, or are assumed never to happen (for optimisation purposes). And
if you want to add "x" and "y" with wrapping, you use "x +% y" as an
explicit choice of operator.
That seems to me to be the right approach for an efficient high-level >language.
If you want "int z = x + y;" with wrapping, write :
int z = (unsigned) x + (unsigned) y;
And even then, I always put it in a
pragma so that the code works even if someone uses different compiler flags.
There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >>><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
but I guess that this is due to being able to suppress flags updates >>>rather than ignoring those that are overwritten without being read.
That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?
A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.
What is T16? Google does not give me anything appropriate for "ARM
T16"? Do you mean the 16-bit-encoded instructions in T32?
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:...
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
Sorry, I meant ptrdiff_t, which was used for pointer math.
I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.
John Levine <johnl@taugh.com> writes:
There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.
That is sufficient for double-word arithmetic, but not for multi-word >arithmetic. ESA/390 adds addition-with-carry-in.
On 18/02/2024 05:01, Tim Rentsch wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Of course the conversion will be done implicitly. C converts almost
anything implicitly. Not that this is its greatest feature.
The explicit cast is still useful: 1/ to express intent (it shows that
the potential loss of data is intentional) and then 2/ to avoid
compiler warnings (if you enable -Wconversion, which I usually
recommend) or warning from any serious static analyzer too (which I
highly recommend using too).
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
mitchalsup@aol.com (MitchAlsup1) writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
I suspect most experienced C programs know that.
Yet, the 'superfluous' cast is also documentation that the
programmer _intended_ that the return value would be narrowed.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
...
Sorry, I meant ptrdiff_t, which was used for pointer math.
I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
Maybe you could share such code?
The next question would be how to do the same for multiplication....
According to the IBM manuals, add with carry was added in z/Series but
you could use it in S/390 mode on a z machine, so I guess someone
really wanted it for some existing code.
Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
D ( n -- d ) sign-extends a single-cell to a double-cell.
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:
A guess: people did not want sizeof(float) != sizeof(float).
I assume that you mean that people wanted sizeof(int)==sizeof(float).
Why would they want that? That certainly did not hold on the PDP-11
and many other 16-bit systems where sizeof(int)==2 and
sizeof(float)==4.
float
is cerainly faster than double.
On the 21064 or MIPS R4000 (the first 64-bit systems after those by
Cray)? I am pretty sure that FP addition, subtraction and
multiplication have the same speed on these CPUs in binary32 and
binary64.
It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.
C compilers choose their types according to their rules, and Fortran
chooses its types according to its rules. I don't see what C's int
type has to do with Fortran's INTEGER type.
And if the rules you
specify above mean that Fortran on the PDP-11 has a 4-byte INTEGER
type, there is already precedent for int having a different size from INTEGER.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
...
Sorry, I meant ptrdiff_t, which was used for pointer math.
I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.
In 1995 a problem with both size_t and ptrdiff_t is that there
were no corresponding length modifiers for those types in
printf() format conversions (corrected in C99).
John Levine <johnl@taugh.com> writes:
According to the IBM manuals, add with carry was added in z/Series but
you could use it in S/390 mode on a z machine, so I guess someone
really wanted it for some existing code.
Interestingly,
<https://en.wikibooks.org/wiki/360_Assembly/360_Instructions> lists
ALC, ALCR, SLB and SLBR as belonging to the 390 instructions, and only
ALCG, ALCGR, SLBG and SLBGR (the 64-bit variants) as Z instructions.
If they added ALC, ALCR, SLB and SLBR only in Z (but in the S/390
mode), that is counterevidence for the claim that add-with-carry is
less important for 64-bit systems than for 32-bit systems.
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?
Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
Maybe you could share such code?
Rather that do that I will explain.
An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.
Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.
The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example
#if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
// this case is the one outlined above
#elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX
#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1
#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX
Does that all make sense?
The next question would be how to do the same for multiplication....
Multiplication is a whole other ball game. First we need to
consider only the widest types, because narrower types can be
carried out in a wider type and the resulting product value
checked. Off the top of my head, for the widest types I would
try converting to float or double, do a floating-point multiply,
and do some trivial accepts and trivial rejects based on the
exponent of the result. Any remaining cases would need more
care, but probably (we hope!) there aren't many of those and they
don't happen very often. So for what it's worth there is my
first idea. Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
I better stop now or the ideas will probably get worse rather
than better. :/
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.
Oh. I didn't know that about Rust. Interesting.
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
understand that it has the support of community sentiment, but
even so it seems like a poor choice here. I'm not a big fan of
the new attribute syntax, but a form like
return [[narrow]] s;
looks to be a better way of asking Rust to allow what is a
normally disallowed conversion. By contrast, using a cast is
overkill. There is unnecessary redundancy, by specifying a type
in two places, and the risk that they might get out of sync. And
on general principles requiring a cast violates good security
principles. If someone needs access to a particular room in a
building, we don't hand over a master key that opens every room
in the building. If someone needs to read some documents that
have classified materials, we don't give them an access code that
lets them read any sensitive material regardless of whether it's
relevant. Maybe Rust is different, but in C a cast allows any
conversion that is possible in the language, even the unsafe
ones. It just seems wrong to use the nuclear option of casting
for every minor infringement.
Tim Rentsch wrote:
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target container.
This is of course the total opposite of the C "nuclear option", and much more like other languages that try to be secure by default.
Terje Mathisen wrote:
Tim Rentsch wrote:
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target container.
This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.
Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:
Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.
Terje
On Tue, 27 Feb 2024 13:50:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:
Tim Rentsch wrote:
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target
container.
This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.
Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:
Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.
Terje
Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
less striking visual hint of doing something unusual and potentially dangerous.
Michael S wrote:
On Tue, 27 Feb 2024 13:50:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:
Tim Rentsch wrote:
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target
container.
This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.
Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:
Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.
Terje
Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
less striking visual hint of doing something unusual and potentially
dangerous.
No more dangerous than::
if( c >= 'A' and c <= 'Z' ) c -= 'A'-'a';
or
if( table[c] & CAPS ) c -='A'-'a';
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
The Unix code ported relatively easily to I32LP64 because
uintptr_t had been used extensively rather than assumptions
about
sizeof(int) == sizeof(int *).
...
Sorry, I meant ptrdiff_t, which was used for pointer math.
I have seen little code that uses ptrdiff_t; quite a bit that
used size_t (the unsigned brother of ptrdiff_t). But my memory
tells me that even size_t was not very widespread in 1995.
In 1995 a problem with both size_t and ptrdiff_t is that there
Calling it a "problem" is overstating the case. It was
straightforward enough, if not completely portable to
use the appropriate number of 'l' modifiers.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
What "standard multiple-precision arithmetic" is there in C? I am
not aware of any.
If you have widening multiplication in the language, things are
trivial. [...]
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.
Oh. I didn't know that about Rust. Interesting.
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target
container.
This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.
understand that it has the support of community sentiment, but
even so it seems like a poor choice here. I'm not a big fan of
the new attribute syntax, but a form like
return [[narrow]] s;
looks to be a better way of asking Rust to allow what is a
normally disallowed conversion. By contrast, using a cast is
overkill. There is unnecessary redundancy, by specifying a type
in two places, and the risk that they might get out of sync. And
on general principles requiring a cast violates good security
principles. If someone needs access to a particular room in a
building, we don't hand over a master key that opens every room
in the building. If someone needs to read some documents that
have classified materials, we don't give them an access code that
lets them read any sensitive material regardless of whether it's
relevant. Maybe Rust is different, but in C a cast allows any
conversion that is possible in the language, even the unsafe
ones. It just seems wrong to use the nuclear option of casting
for every minor infringement.
I agree, it Rust did it like C, then it would be very unsafe indeed.
I have not checked the generated asm, but I believe that if I write
code like this:
// x:u64
x = 0x1234567890abcdef;
let y:u8 = (x & 255) as u8;
the compiler will see the mask and realize that the conversion is
safe, so no need to interpose a
cmp x,256
jae trp_conversion
idiom.
OTOH, I have seen C compilers that insist on such a test at the
end of a fully saturated switch statement, even when the mask in
front should prove that no other values are possible.
Terje Mathisen wrote:
Tim Rentsch wrote:
I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target
container.
This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.
Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:
Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.
Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?
Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
Maybe you could share such code?
Rather that do that I will explain.
An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.
Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.
The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example
#if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
// this case is the one outlined above
#elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX
#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1
#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX
Does that all make sense?
The next question would be how to do the same for multiplication....
Multiplication is a whole other ball game. First we need to
consider only the widest types, because narrower types can be
carried out in a wider type and the resulting product value
checked. Off the top of my head, for the widest types I would
try converting to float or double, do a floating-point multiply,
and do some trivial accepts and trivial rejects based on the
exponent of the result. Any remaining cases would need more
care, but probably (we hope!) there aren't many of those and they
don't happen very often. So for what it's worth there is my
first idea. Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
I better stop now or the ideas will probably get worse rather
than better. :/
Your floating point method is pretty bad, imho, since it can give
you both false negatives and false positives, with no way to know
for sure, except doing it all over again.
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like multiplicative hashes...
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...
What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?
On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...
What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?
Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).
My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.
And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.
On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...
What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?
Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
Maybe you could share such code?
Rather that do that I will explain.
An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.
Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.
The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example
On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...
What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?
Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).
So, speaking in C Standard language, UB.
My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.
Yes, I'd expect the same.
And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.
As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
preferable, tend to vary.
What is the size of Fortran's default UNSIGNED ?
On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Thomas Koenig <tkoenig@netcologne.de> schrieb:
David Brown <david.brown@hesbynett.no> schrieb:
On 20/02/2024 07:31, Thomas Koenig wrote:
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
(Testing a new news server, my old one was decomissioned...)
I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.
Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...
What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?
Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).
So, speaking in C Standard language, UB.
Yes, that would be the translation. In Fortran terms, it would
violate a "shall" directive.
My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.
Yes, I'd expect the same.
And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.
As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under
IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
preferable, tend to vary.
What is the size of Fortran's default UNSIGNED ?
It is not yet in the language; a paper has been passed by J3,
but it needs to be put to WG5, and WG5 has to agree that J3 should
put it into the standard proper for Fortran 202y (202x just
came out as Fortran 2023).
But if it does go in, it is likely that it will have the same
size as INTEGER, which is usually 32 bits.
However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.
Currently, in Fortran, the number of iterations of the loop
do i=m1,m2,m3
....
end do
is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).
So,
do i=1,3,-1
will be executed zero times, as will
do i=3,1
Translating that into arithmetic with unsigned integers makes
little sense, how many times should
do i=1,3,4294967295
be executed?
Thomas Koenig wrote:
However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.
Currently, in Fortran, the number of iterations of the loop
do i=m1,m2,m3
....
end do
is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).
So,
do i=1,3,-1
will be executed zero times, as will
do i=3,1
Translating that into arithmetic with unsigned integers makes
little sense, how many times should
do i=1,3,4294967295
be executed?
3-1+4294967295 = 4294967297 // (m2-m1+m3)
4294967297 / 4294967295 = 1.0000000004656612874161594750863
So the loop should be executed one time. {{And yes I know 4294967295 == 0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement) where 4294967295 is representable naturally ??
On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example
...
but that's implementation-defined behavior, correct?
As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under >IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:Signed mul is just a special case of unsigned mul, right?
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.
At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.
Terje
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
Signed mul is just a special case of unsigned mul, right?
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.
At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.
Michael S <already5chosen@yahoo.com> writes:
As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under >>IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.
If Fortran makes unsigned overflow illegal, Fortran compilers can
perform the same shenanigans for unsigned that C compilers do for
signed integers; so if signed int really is preferable because of
these shenanigans, unsigned with the same shenanigans would be
preferable, too.
On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]
It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.
Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?
[...]
Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)
It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.
Maybe you could share such code?
Rather that do that I will explain.
An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.
Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.
The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example
...
but that's implementation-defined behavior, correct?
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Michael S <already5chosen@yahoo.com> writes:
As discussed here just recently, there are good reason to avoid >>>'unsigned' array indices in performance-oriented programs running under >>>IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.
If Fortran makes unsigned overflow illegal, Fortran compilers can
perform the same shenanigans for unsigned that C compilers do for
signed integers; so if signed int really is preferable because of
these shenanigans, unsigned with the same shenanigans would be
preferable, too.
One problem is that, without 2^n modulo, something like a
multiplicative hash would be illegal.
People would do it anyway, ignoring the prohibition, because it
is so useful, and subsequent hilarity will ensue.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
Signed mul is just a special case of unsigned mul, right?
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.
At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.
It isn't quite that simple. Some of what you describe has a risk
of running afoul of implementation-defined behavior or undefined
behavior (as for example abs( INT_MIN )). I'm pretty sure it's
possible to avoid those pitfalls, but it requires a fair amount
of care and careful thinking.
Note that my goal is only to avoid the possibility of undefined
behavior that comes from signed overflow. My approach is to safely
determine whether the signed multiplication would overflow, and if
it wouldn't then simply use signed arithmetic to get the result.
I use unsigned types to determine the safety, and if it's safe then
using signed types to get a result. For the current problem I don't
care about widening, except as it might help to determine safety.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:Signed mul is just a special case of unsigned mul, right?
If I really had to write a 64x64->128 MUL, with no widening MUL orI wrote some code along the same lines. A difference is you
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
are considering unsigned multiplication, and I am considering
signed multiplication.
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
In Gforth we use:
DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
DCell res;
res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}
I have this technique from Andrew Haley. It relies on twos-complement representation.
- anton
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:Signed mul is just a special case of unsigned mul, right?
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
Signed mul is just a special case of unsigned mul, right?
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.
At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.
It isn't quite that simple. Some of what you describe has a risk
of running afoul of implementation-defined behavior or undefined
behavior (as for example abs( INT_MIN )). I'm pretty sure it's
possible to avoid those pitfalls, but it requires a fair amount
of care and careful thinking.
It would be supremely nice if we could go back in time before
computers and reserve an integer encoding that represents the
value of "there is no value here" and mandate if upon integer
arithmetic.
Note that my goal is only to avoid the possibility of undefined
behavior that comes from signed overflow. My approach is to safely
determine whether the signed multiplication would overflow, and if
it wouldn't then simply use signed arithmetic to get the result.
Double width multiplication cannot overflow. 2n = nxn then, ignoring
the top n bits gives you your non-overflowing multiply.
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
Signed mul is just a special case of unsigned mul, right?
I.e. in case of a signed widening mul, you'd first extract the
signs, convert the inputs to unsigned, then do the unsigned
widening mul, before finally resotirng the sign as the XOR of the
input signs?
In Gforth we use:
DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
DCell res;
res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}
I have this technique from Andrew Haley. It relies on twos-complement
representation.
Yeah, that's what Alpha does with UMULH.
I'm still trying to figure out why it works.
Anton Ertl wrote:
DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
DCell res;
res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}
I have this technique from Andrew Haley. It relies on twos-complement
representation.
- anton
Yeah, that's what Alpha does with UMULH.
I'm still trying to figure out why it works.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.
Currently, in Fortran, the number of iterations of the loop
do i=m1,m2,m3
....
end do
is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).
So,
do i=1,3,-1
will be executed zero times, as will
do i=3,1
Translating that into arithmetic with unsigned integers makes
little sense, how many times should
do i=1,3,4294967295
be executed?
3-1+4294967295 = 4294967297 // (m2-m1+m3)
4294967297 / 4294967295 = 1.0000000004656612874161594750863
So the loop should be executed one time. {{And yes I know 4294967295 ==
0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement)
where 4294967295 is representable naturally ??
Correct (of course).
The same result would be expected for
do i=1u,3u,-1u
(asusming an u suffix for unsigned numbers).
The problem is that this violates a Fortran basic assumption since
FORTRAN 77, which is that DO loops can be zero-trip.
This is a can of worms that I would like to leave unopened.
Same goes for array slices. Even assuming that no negative
indices are used, the slice a(1:3:-1) is zero-sized in Fortran,
as is a(3:1) .
For a(1u:3u:-1u) the same logic that you outlined above would apply,
making it a slice with one element.
Not going there :-)
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:Signed mul is just a special case of unsigned mul, right?
If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]
I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.
I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?
In Gforth we use:
DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
DCell res;
res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}
I have this technique from Andrew Haley. It relies on twos-complement representation.
Here you can probably schedule the fixup to happen in parallel with the >actual multiplication:
;; inputs in r9 & r10, result in rdx:rax, rbx & rcx as scratch
mov rax,r9 ;; All these can start in the first cycle
mul r10
mov rbx,r9 ;; The MOV can be handled by the renamer
sar r9,63
mov rcx,r10 ;; Ditto
sar r10,63
and rbx,r9 ;; Second set of ops
and rcx,r10
add rbx,rcx ;; Third cycle
sub rdx,rbx ;; Do a single adjustment as soon as the MUL finishes
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:20 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |