Forum: >>> Magnum BBS <<<

A Very Bad Idea

From Quadibloc@21:1/5 to All on Mon Feb 5 06:48:59 2024

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.

But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Mon Feb 5 07:44:24 2024

Quadibloc <quadibloc@servername.invalid> writes:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or implementing 128-bit operations with 64-bit units (e.g., on the K8).

Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.
Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to
implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations; <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).

Anyway, the size of the register files is one reason for avoiding
longer vectors.

Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit
instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.

My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.

As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC applications are among them). And once you are limited by main memory bandwidth, the ISA makes little difference.

But for those applications where caches work (e.g., dense matrix
multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.

The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have
long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.

The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

The Playstation 3 has a separate GPU based on the Nvidia G70 <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Mon Feb 5 13:19:41 2024

On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;

Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.

I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).

Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an
eight-channel memory bus.

Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.

Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Quadibloc on Mon Feb 5 14:34:00 2024

In article <upqn9d$a0i2$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

I was keeping as close to the original CELL design as possible

That, I think, was a mistake. Given how unsuccessful it was, it's fair
evidence that strategy isn't useful for very much.

There's an approach that isn't specifically for vectors, but gives good
memory bandwidth, and is having some commercial success. Apple's M-series
ARM SoCs have limited RAM (8GB to 24GB) on the SoC, but it's much larger
than caches, or a Cray-1's main memory. They then use fast SSDs for
swapping, but there's no reason, apart from cost, that you couldn't have
a layer with 1 TB or so of DRAM between the SoC and the SSD.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Mon Feb 5 13:44:56 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;

Well, the original Cray I had a main memory of eight megabytes

If you want to compete with a 1976 supercomputer, Megabytes may be
enough. However, if you want to compete with something from 2024,
better look at how much local memory the likes of these NEC cards, or
Nvidia or AMD GPGPUs provide. And that's Gigabytes.

I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done

You don't have to look for the Xeon Phi. The lowly Athlon 64 X2 or
Pentium D from 2005 already have several cores sharing access the
external memory (and the UltraSPARC T1 from the same year even has 8
cores.

The Xeon Phis are interesting:

* Knight's Corner is a PCIe card with up to 16GB local memory and
bandwidths up to 352GB/s (plus access to the host system's DRAM and
anemic bandwidth (PCIe 2.0 x16)).

* Knight's Landing was available as PCIe card or as socketed CPU with
16GB of local memory with "400+ GB/s" bandwidth and up to 384GB of
DDR4 memory with 102.4GB/s.

* Knight's Mill was only available in a socketed version with similar
specs.

* Eventually they were replaced by the big mainstream Xeons without
local memory, stuff like the Xeon Platinum 8180 with about 128GB/s
DRAM bandwidth.

It seems that running the HPC processor as a coprocessor was not good
enough for the Xeon Phi, and that the applications that needed lots of bandwidth to local memory also did not provide enough revenue to
sustain Xeon Phi development; OTOH, Nvidia has great success with its
GPGPU line, so maybe the market is there, but the Xeon Phi was
uncompetetive.

If you are interested in such things, the recently announced AMD
Instinct MI300A (CPUs+GPUs) with 128GB local memory or MI300X (GPUs
only) with 192GB local memory with 5300GB/s bandwidth may be of
interest to you.

Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an >eight-channel memory bus.

IBM sells Power systems with few cores and the full memory system.
Similarly, you can buy AMD EPYCs with few active cores and the full
memory system. Some of them have a lot of cache, too (e.g., 72F3 and
7373X).

Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.

If you want high bandwidth in a consumer product, buy a graphics card.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Quadibloc on Mon Feb 5 19:20:51 2024

Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one

Memory LATENCY issues not BW issues. The length of the vector
has to be able to absorb a miss at all cache levels without
stalling the core. 5GHz processors, 60 ns DRAM access times
means the minimum vector length is 300 registers in a single
vector. Which also means it takes a loop of count 300+ to
reach peak efficiency.

To a certain extent the B registers of the CRAY 2 were to
do that (absorb longer and longer memory latencies) but
this B register set is now considered a failure.

extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

They also increased vector length as memory latency increased.
Ending up at (IIRC) 256 entry VRF[k].

These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.

Consider FFT:: sooner or later you are reading and writing
vastly scattered memory containers considerably smaller than
any cache line. FFT is one you want peak efficiency on !
So, if you want FFT to run at core peak efficiency, your
interconnect has to be able to pass either containers from
different memory banks on alternating cycles, or whole
cache lines in a single cycle. {{The later is easier to do}}

A vector machine (done properly) is a bandwidth machine rather
than a latency based machine (which can be optimized by cache
hierarchy).

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

Generally microprocessors are pin limited as are DRAM chips,
so in order to get the required BW--2LD1ST per cycle continuously
with latency less than vector length--you end up needing a way
to access 16-to-64 DRAM DIMMs simultaneously. Yo might be able
to do with with PCIe 6.0 is you have 64 twisted quads, one for
each DRAM DIMM. Minimum memory size is 64 DIMMs !

A processor box with 64 DIMMs (as its minimum) is not mass market.

One reason CRAY sold a lot of supercomputers is that its I/O
system was also up to the task--CRAY YMP had 4× the I/O BW
of NEC SX{4,5,6} so when the application became I/O bound
the 6ns YMP was faster than the SX.

It is perfectly OK to try to build a CRAY-like vector processor.
But designing a vector processor is a lot more about the memory
system (feeding the beast) than about the processor (the beast).

It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.

If you figure this out, there is a market for 100-200 vector
supercomputers mainframes per year. If you can build a company
that makes money on this volume-- go for it !

But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Feb 5 19:43:15 2024

BGB wrote:

On 2/5/2024 12:48 AM, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.

One doesn't need to disallow access to external RAM, but maybe:
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".

Or, say, there is a region on RAM that is divided among the cores, where
the core has fast access to its own local chunk, but slow access to any
of the other chunks (which are treated more like external RAM).

Large FFTs do not fit n this category. FFTs are one of the most valuable
means of calculating Great Big Physics "stuff". We used FFT back in the
NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
VERY Many big physics simulations do similarly.

That problem was matrix-matrix multiplication !!

MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );

{where pattern was FFTd earlier }

Lookup the data access pattern and apply that knowledge to TB sized
matrixes and then ask yourself if caches bring anything to the party ?

Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.

As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
is cheap and simple.

If you are happy adding 1,000+ instructions to your ISA, then yes.

The Cell cores were, if anything, more of a "SIMD First, ALU Second" approach, building it around 128-bit registers but only using part of
these for integer code.

I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general purpose code, rather than used almost entirely for SIMD.

I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...

Vector design is a lot more about the memory system (feeding the beast)
than the core (the beast) consuming memory BW.

It is arguably more scalable, but as I see it, much past 64 or 128 bit vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.

Architecture is more about "what to leave OUT" as about "what to put in".

But, I can also note that even for semi-general use, an ISA design like
RV64G is suffers a significant disadvantage, say, vs my own ISA, in the

They disobeyed the "what to leave out" and "what to put in" rules.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Feb 5 19:30:20 2024

Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or implementing 128-bit operations with 64-bit units (e.g., on the K8).

Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.

They wanted to keep core power under some <thermal> limit 256-bits
fit under this limit, 513 did not.

Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations; <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).

Anyway, the size of the register files is one reason for avoiding
longer vectors.

Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.

My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.

As seen by scalar cores, yes, as seen by vector cores (like CRAY) no.

I might note:: RISC-V has a CRAY-like vector extension and a SIMD-like
vector extension. ... make of that what you may.

As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC applications are among them). And once you are limited by main memory bandwidth, the ISA makes little difference.

My point in the previous post.

But for those applications where caches work (e.g., dense matrix multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.

TeraByte 2D and 3D FFTs are not cache friendly...

The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have

And can afford to absorb the latency.

long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.

The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

The Playstation 3 has a separate GPU based on the Nvidia G70 <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Quadibloc on Mon Feb 5 19:46:55 2024

Quadibloc wrote:

On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;

Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.

CRAY-1 could access one 64-bit memory container per cycle continuously. CRAY-XMP could access 3 64-bit memory containers in 2LD, 1 SR per cycle continuously.
Where memory started at about 16 cycles away (12.5ns version) and ended
up about 30 cycles away (6ns version) and a memory bank could be accessed
about every 7 cycles.

I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).

Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an eight-channel memory bus.

Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.

Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Quadibloc on Mon Feb 5 22:56:20 2024

On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer performance that proved effective in the past that still isn't being
applied to microprocessors today.

Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.

RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many architectures these days precisely because the resulting combinatorial explosion in added instructions makes a mockery of the “R” in “RISC”.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Feb 10 23:27:35 2024

Lawrence D'Oliveiro wrote:

On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer
performance that proved effective in the past that still isn't being
applied to microprocessors today.

Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.

RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many architectures these days precisely because the resulting combinatorial explosion in added instructions makes a mockery of the “R” in “RISC”.

So does the C extension--its all redundant...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Quadibloc on Tue Feb 13 19:57:28 2024

On 2024-02-05, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).

My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.

Thus, it always feels like using vector instructions would not give any performance gains. Yet, every time I vectorize a scalar loop (basically
change scalar registers for vector registers), I see a very healthy
performance increase.

I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by simply
adding MRISC32-style vector support. I think that the gains would be
even bigger for non-pipelined machines, that could start "pumping" the
execute stage on every cycle when processing vectors, skipping the fetch
and decode cycles.

BTW, I have also noticed that I often only need a very limited number of
vector registers in the core vectorized loops (e.g. 2-4 registers), so I
don't think that the VRF has to be excruciatingly big to add value to a
small core. I also envision that for most cases you never have to
preserve vector registers over function calls. I.e. there's really no
need to push/pop vector registers to the stack, except for context
switches (which I believe should be optimized by tagging unused vector registers to save on stack bandwidth).

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Marcus on Wed Feb 14 05:24:27 2024

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

(I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.

I've come up with _one_ possible solution. Remember the Texas Instruments
9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?

Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files, indicated by a pointer register?

But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.

The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.

Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.

Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!

The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.

Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.

People who can afford a terabyte of DRAM can also afford an eight-channel memory bus, so it should be possible to manage something.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Wed Feb 14 05:53:02 2024

On Wed, 14 Feb 2024 05:24:27 +0000, Quadibloc wrote:

Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea"

Didn't look very carefully. That's _this_ thread.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Marcus on Wed Feb 14 11:14:22 2024

On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-05, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).

Does it means that you have 8 VRs and each VR is 2048 bits?

My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.

Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.

I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.

BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.

It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.

I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).

/Marcus

If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from
optimal. Ratio of 8 looks like more optimal in case when performance of vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Quadibloc on Wed Feb 14 15:37:30 2024

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

(I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances.

The Cray systems weren't used as general purpose timesharing systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Feb 14 17:13:28 2024

Scott Lurndal <scott@slp53.sl.home> schrieb:

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

(I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances.

The Cray systems weren't used as general purpose timesharing systems.

They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.

Or so I heard, ~ 30 years ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Feb 14 20:45:36 2024

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Quadibloc <quadibloc@servername.invalid> writes:

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

(I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main >>>memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances.

The Cray systems weren't used as general purpose timesharing systems.

They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.

The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.

Or so I heard, ~ 30 years ago.

Should be closer to 40 years ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Thu Feb 15 11:21:14 2024

On Wed, 14 Feb 2024 20:45:36 +0000, MitchAlsup1 wrote:

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Quadibloc <quadibloc@servername.invalid> writes:

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under >>>>certain circumstances.

The Cray systems weren't used as general purpose timesharing systems.

I wasn't intending this as a criticism of the Cray systems, but
of my plan to copy their vector architecture in a chip intended
for general purpose desktop computer use.

They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.

Interesting.

The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.

Good to know; the Cray I was a success, so it's good to learn from
it.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Michael S on Thu Feb 15 20:00:20 2024

On 2024-02-14, Michael S wrote:

On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-05, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).

Does it means that you have 8 VRs and each VR is 2048 bits?

No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been thinking
about reducing it to 16 vector registers, and find some clever use for
the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not there yet.

The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).

Each vector element is 32 bits wide.

So, in total: 32 x 16 x 32 bits = 16384 bits

This is, incidentally, exactly the same as for AVX-512.

My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.

Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.

I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.

BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.

It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.

I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that they
would even have floating-point support. It's more a question of simple
loop optimizations - e.g. the kinds you find in libc or software
rasterization kernels. For those you will often get lots of work done
with just four vector registers.

I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).

/Marcus

If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from optimal. Ratio of 8 looks like more optimal in case when performance of vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.

For MRISC32 I'm aiming for splitting a vector operation into four. That
seems to eliminate most RAW hazards as execution pipelines tend to be at
most four stages long (or thereabout). So, with a pipeline width of 128
bits (which seems to be the goto width for many implementations), you
want registers that have 4 x 128 = 512 bits, which is one of the reasons
that I mandate at least 512-bit vector registers in MRISC32.

Of course, nothing is set in stone, but so far that has been my
thinking.

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Quadibloc on Thu Feb 15 19:44:27 2024

On 2024-02-14, Quadibloc wrote:

On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

(I know very well that VVM gives similar gains without the VRF)

Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as
Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.

I scribbled down a suggestion here:

* https://gitlab.com/-/snippets/3673883

In particular, pay attention to the sections "Vector state on context
switches" and "Thread context".

My idea is not new, but I think that it takes some old ideas a few steps further. So here goes...

There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the integer register file of RV32I).

Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.

The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are scratch registers.

I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).

I do not yet know if this will fly, though...

So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.

I've come up with _one_ possible solution. Remember the Texas Instruments 9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?

Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files, indicated by a pointer register?

I have had a similar idea for "big" implementations that have a huge
vector register file. My idea, though, is more of a hybrid: Basically
keep a few copies (e.g. 4-8 copies?) of vector registers for hot threads
that can be quickly switched between (no cost - just a logical "vector
register file ID" that is changed), and then have a more or less
separate memory path to a bigger vector register file cache, and swap
register file copies in/out of the hot storage asynchronously.

I'm not sure if it would be feasible to either implement next-thread
prediction in hardware, or get help from the OS in the form of hints
about the next likely thread(s) to execute, but the idea is that it
should be possible to hide most of the context switch overhead this way.

But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.

Wouldn't a kernel space "thread ID" or "vector register file ID" do?

The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.

Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.

Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!

The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.

Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.

People who can afford a terabyte of DRAM can also afford an eight-channel memory bus, so it should be possible to manage something.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Marcus on Thu Feb 15 19:12:00 2024

Marcus wrote:

On 2024-02-14, Quadibloc wrote:

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.

I scribbled down a suggestion here:

* https://gitlab.com/-/snippets/3673883

In particular, pay attention to the sections "Vector state on context switches" and "Thread context".

My idea is not new, but I think that it takes some old ideas a few steps further. So here goes...

There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the integer register file of RV32I).

Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.

The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are scratch registers.

I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).

VVM is designed such that even ISRs can use the vectorized parts of the implementation. Move data, clear pages, string.h, ... so allowing GuestOSs
to use vectorization fall out for free.

I do not yet know if this will fly, though...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Marcus on Thu Feb 15 23:00:33 2024

On Thu, 15 Feb 2024 20:00:20 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-14, Michael S wrote:

On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-05, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).

Does it means that you have 8 VRs and each VR is 2048 bits?

No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.

The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with solutions to problems related to large vector registers).

Each vector element is 32 bits wide.

So, in total: 32 x 16 x 32 bits = 16384 bits

This is, incidentally, exactly the same as for AVX-512.

My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.

Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.

I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.

BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.

It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never profiled as thoroughly as I did SGEMM.

I expect that people will want to do such things with an MRISC32 core. However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.

I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).

/Marcus

If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.

For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.

Of course, nothing is set in stone, but so far that has been my
thinking.

/Marcus

Sounds quite reasonable, but I wouldn't call it "Cray-style".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Marcus on Fri Feb 16 04:10:21 2024

On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:

On 2024-02-14, Quadibloc wrote:

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

My first question, though, is whether or not we can really make caches
that good.

But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.

The basic way in which I originally felt I could make it work was really
quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

The details of how one may have to make use of that capability... well,
that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!

Hey, the GPU in a computer these days is also a singular resource.

Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.

Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style vectors, so that the best possible vector performance for conventional non-batch programs is also available.

Now, what would I think of as being "something very much like VVM" without actually being VVM?

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.

So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Feb 16 04:30:34 2024

On Fri, 16 Feb 2024 04:10:21 +0000, Quadibloc wrote:

The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!

and then I also wrote...

So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

However, an obvious objection can be raised.

Vector programs that can only be run one at a time on a computer using
your new chip? That's a throwback to ancient times; people using today's computers with GUI operating systems aren't used to that sort of thing,
and will therefore end up tossing your computer out, thinking that it's
broken!

So there is one more stratagem that I need to employ to avoid that
disaster.

Nothing is stopping the operating system and compilers from supporting
a particular kind of *fat binaries* that addresses this issue, making
it all invisible to the user.

Vector programs would come in a form that includes _both_ Cray I
style code and STAR-100 style code, and the highest-priority
vector program on the machine would get to run in Cray I mode until
it finishes.

Yes, that means that later programs with even higher priority would
be doomed to run slow, but this horse can't be changed in midstream,
and so one just has to live with this limitation.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Fri Feb 16 07:27:36 2024

Quadibloc <quadibloc@servername.invalid> writes:

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop

for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}

a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?), and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures. I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember,

My understanding is that there is no need to remember much. Just
remember that it has to be a simple loop, and mark it. But, as in all auto-vectorization schemes, there are cases where it works better than
in others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Quadibloc on Fri Feb 16 12:29:33 2024

On 2024-02-16, Quadibloc wrote:

On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:

On 2024-02-14, Quadibloc wrote:

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

My first question, though, is whether or not we can really make caches
that good.

I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.

I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.

E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).

Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.

Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).

With vector operations you pause the front end (fetch and decode) while iterating over vector elements, which eliminates branch misprediction penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment,
compare, branch).

As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.

As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this assumption is mostly based on my own ignorance, so take it with a grain
of salt).

For reference, the microarchitectural complexity that I'm thinking about
is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Marcus on Fri Feb 16 14:04:25 2024

On Fri, 16 Feb 2024 12:37:55 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

Then what would you call it?

I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style memory-memory vector ISA:s, etc.

/Marcus

I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by
architecture which in theory is problematic. But I don't care, I am not
in academy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Michael S on Fri Feb 16 12:37:55 2024

On 2024-02-15, Michael S wrote:

On Thu, 15 Feb 2024 20:00:20 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-14, Michael S wrote:

On Tue, 13 Feb 2024 19:57:28 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

On 2024-02-05, Quadibloc wrote:

I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of
64-bit).

Does it means that you have 8 VRs and each VR is 2048 bits?

No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.

The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).

Each vector element is 32 bits wide.

So, in total: 32 x 16 x 32 bits = 16384 bits

This is, incidentally, exactly the same as for AVX-512.

My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.

Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.

I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.

BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.

It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.

I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.

I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).

/Marcus

If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.

For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many
implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.

Of course, nothing is set in stone, but so far that has been my
thinking.

/Marcus

Sounds quite reasonable, but I wouldn't call it "Cray-style".

Then what would you call it?

I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style memory-memory vector ISA:s, etc.

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcus@21:1/5 to Michael S on Fri Feb 16 13:27:00 2024

On 2024-02-16, Michael S wrote:

On Fri, 16 Feb 2024 12:37:55 +0100
Marcus <m.delete@this.bitsnbites.eu> wrote:

Then what would you call it?

I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.

/Marcus

I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by architecture which in theory is problematic. But I don't care, I am not
in academy.

Ok, I am generally talking about the ISA, which dictates the semantics
and what kind of implementations that are possible (or at least
feasible).

For my current MRISC32-A1 implementation, the vector register width to
ALU width ratio is 16, so it would definitely qualify as "vector" then.

The ISA is designed, however, to support wider execution, but the idea
is to *not require* very wide execution, but rather encourage sequential execution (up to a point where things like hazard resolution become
less of a problem and OoO is not really a necessity for high
throughput).

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Fri Feb 16 05:11:42 2024

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop

for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}

a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Fri Feb 16 14:23:20 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
in the coming years.

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

Sure, predication is not a control structure.

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware. For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:

double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;

But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Marcus on Fri Feb 16 18:45:59 2024

Marcus wrote:

On 2024-02-16, Quadibloc wrote:

On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:

On 2024-02-14, Quadibloc wrote:

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

My first question, though, is whether or not we can really make caches
that good.

I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.

I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.

E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which precludes fetching a new instruction until the fourth or even fifth
cycle).

Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).

With vector operations you pause the front end (fetch and decode) while iterating over vector elements, which eliminates branch misprediction penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment, compare, branch).

VVM does not use branch prediction--it uses a zero-loss ADD-CMP-BC
instruction I call LOOP.

And you do not have to lose precise exceptions, either.

As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.

More energy efficient, but consumes more energy because it is running
more data in less time.

As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this assumption is mostly based on my own ignorance, so take it with a grain
of salt).

For reference, the microarchitectural complexity that I'm thinking about
is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

/Marcus

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Quadibloc on Fri Feb 16 18:35:01 2024

Quadibloc wrote:

On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:

On 2024-02-14, Quadibloc wrote:

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

VVM on My 66000 remains a RISC ISA--what it does is provide an implemen-
tation freedom to perform multiple loops (SIMD -style) at the same time.
CRAY nomenclature would call this "lanes".

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

Cache buffers to be more precise.

My first question, though, is whether or not we can really make caches
that good.

Once a memory reference in a vectorized loop starts to miss, you quit
storing the data in the cache and just strip mine it through the cache
buffers, and avoid polluting the DCache with data that will be displaced
before the loop completes.

But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.

The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!

Hey, the GPU in a computer these days is also a singular resource.

Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.

Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style vectors, so that the best possible vector performance for conventional non-batch programs is also available.

Now, what would I think of as being "something very much like VVM" without actually being VVM?

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.

So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

Gasp........

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Feb 16 18:53:00 2024

Stephen Fuld wrote:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop

for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}

a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Bookends on the loop provide the information the HW needs, the VEC
instruction at the top provides the IP for the LOOP instruction at
the bottom to branch to, and also provides a bit map of registers
which are live-out of the loop, discarding other used loop registers.

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

All of that is correct.

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

Predicated control flow--yes, branch flow-control no.

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

Right, register and memory dependencies are observed and obeyed. So,
in the above loop, the recurrence slows the loop down to the latency of
FADD, but the LD, ADD-CMP-BC run concurrently; so, you are still faster
than if you did no VVM the loop.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Feb 16 18:57:11 2024

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the >>instructions in the loop can be fetched and decoded only once, it allows >>the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >>single instruction. Perhaps the HW could figure out all of that by >>analyzing a "normal" instruction stream, but that seems much harder.

Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
in the coming years.

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

Sure, predication is not a control structure.

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >>reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware. For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:

double r;
double r0=0.0;
....
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
.... deal with the remaining iterations ...
r = r0+...+r15;

But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.

VVM is also adept at vectorizing str* and mem* functions from the C
library, and as such, you have to do it in a way that even ISRs can
use VVM (when it is to their advantage).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Fri Feb 16 11:03:26 2024

On 2/16/2024 6:23 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.

IANAHG, but if it were that simple, I would think Mitch would have
implemented it that way.

Maybe we'll see it in some Intel or AMD CPU
in the coming years.

One can hope!

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

Sure, predication is not a control structure.

OK, but my point is that you can do conditional execution within a VVM loop.

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.

Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware recognizing that the loop contains no instructions that can't be
parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.

For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:

double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;

But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.

Agreed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Feb 16 14:27:19 2024

MitchAlsup1 wrote:

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
Is that done by disambiguation logic, checking for same cache line access?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Fri Feb 16 23:34:33 2024

EricP wrote:

MitchAlsup1 wrote:

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?

first: memory is inherently misaligned in My 66000 architecture. So, since
the width of the machine is 64-bits, we read or write in 128-bit quantities
so that we have enough bits to extract the misaligned data from or a container large enough to store a 64-bit value into. {{And there are all the associated corner cases}}

Second: over in VVM-land, the implementation can decide to read and write wider, but is architecturally constrained not to shrink below 128-bits.

A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.

A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
to support this rate.

Is that done by disambiguation logic, checking for same cache line access?

Before I have said that the front end observes the first iteration of the
loop and makes some determinations as to how wide the loop can be run on
the machine at hand. One of those observations is whether memory addresses
are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Stephen Fuld on Fri Feb 16 23:22:08 2024

Stephen Fuld wrote:

On 2/16/2024 6:23 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

Basically, Mitch has his architecture designed for implementation on >>>>> CPUs that are smart enough to notice certain combinations of instructions >>>>> and execute them as though they're single instructions doing the same >>>>> thing, which can then be executed more efficiently.

My understanding is that he requires explicit marking (why?),

Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows >>> the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a >>> single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.

Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.

IANAHG, but if it were that simple, I would think Mitch would have implemented it that way.

Maybe we'll see it in some Intel or AMD CPU
in the coming years.

One can hope!

and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.

It allows predicated instructions within the loop

Sure, predication is not a control structure.

OK, but my point is that you can do conditional execution within a VVM loop.

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.

Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware recognizing that the loop contains no instructions that can't be parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.

The HW does not need preceding knowledge. If an exception happens, the vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.

For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
you can ask the programmer to write:

double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;

But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Sat Feb 17 04:30:37 2024

On Fri, 16 Feb 2024 18:35:01 +0000, MitchAlsup1 wrote:

Quadibloc wrote:

So my substitute for VVM should now be obvious - explicit memory-to-memory >> vector instructions, like on an old STAR-100.

Gasp........

Oh, dear. But, yes, old-style memory-to-memory vector instructions omit
at least one very important thing that VVM provides, which I do indeed
want to make sure I include.

So there would need to be instructions like

multiply v1 by v2 giving scratch-1
add scratch-1 to scratch-2 giving scratch-3
divide scratch-2 by v1 giving v4

... that is, instead of vector registers, there would still be another
kind of thing that isn't a vector in memory, but instead an *explicit* reference to a forwarding node.

And so these vector instructions would have to be in explicitly
delimited groups (since forwarding nodes, unlike vector registers, aren't intended to be _persistent_, so a group of vector instructions would have
to combine into a clause which for some purposes acts like a single instruction)... which then makes it look a whole lot _more_ like VVM,
even though the inside of the sandwich is now special instructiions,
instead of ordinary arithmetic instructions as in VVM.

I think there _may_ have been something like this already in the
original Concertina.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Fri Feb 16 23:33:32 2024

On 2/16/2024 3:22 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 2/16/2024 6:23 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

snip

I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative; >>>>> but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.

From what I understand, while you can do reductions in a VVM loop, and >>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>> reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.

Sure. ISTM, and again, IANAHG, that the problem for VVM is the
hardware recognizing that the loop contains no instructions that can't
be parallelized. There are also some issues like doing a sum of
signed integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.

The HW does not need preceding knowledge. If an exception happens, the vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.

I think you might have missed my point. If you are summing the signed
integer elements of an array, whether you get an overflow or not can
depend on the order the additions are done. Thus, without knowledge
that only the programmer has (i.e. that with the size of the actual data
used, overflow is impossible) the hardware cannot parallelize such an operation. If the programmer knows that overflow cannot occur, he has
no way to communicate that to the VVM hardware, such that the HW could parallelize the summation.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Sat Feb 17 10:34:05 2024

Stephen Fuld wrote:

On 2/16/2024 3:22 PM, MitchAlsup wrote:

Stephen Fuld wrote:

On 2/16/2024 6:23 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 2/15/2024 11:27 PM, Anton Ertl wrote:

snip

I think he also allows
recurrences (in particular, reductions), but I don't understand how >>>>>> his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
Â Â Â r += a[i];

This is particularly nasty given that FP addition is not associative; >>>>>> but even if you allow fast-math-style reassociation, doing this in >>>>>> hardware seems to be quite a bit harder than the rest of VVM.

Â From what I understand, while you can do reductions in a VVM
loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>>> reduction, thus avoids the problem you mention.Â That does cost
performance if the reduction could be parallelized, e.g. find the max >>>>> value in an array.

My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.

Sure.Â ISTM, and again, IANAHG, that the problem for VVM is the
hardware recognizing that the loop contains no instructions that
can't be parallelized.Â There are also some issues like doing a sum
of signed integer values and knowing whether overflow occurred,
etc.Â The programmer may know that overflow cannot occur, but the HW >>> doesn't.

The HW does not need preceding knowledge. If an exception happens, the
vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.

I think you might have missed my point. If you are summing the signed integer elements of an array, whether you get an overflow or not can
depend on the order the additions are done. Thus, without knowledge
that only the programmer has (i.e. that with the size of the actual data used, overflow is impossible) the hardware cannot parallelize such an operation. If the programmer knows that overflow cannot occur, he has
no way to communicate that to the VVM hardware, such that the HW could parallelize the summation.

I am not sure, but I strongly believe that VMM cannot be caught out this
way, simply because it would observe the accumulator loop dependency.

I.e. it could do all the other loop instructions (load/add/loop counter decrement & branch) completeley overlapped, but the actual adds to the accumulator register would limit total throughput to the ADD-to-ADD latency.

So, on the first hand, VMM cannot automagically parallelize this to use multiple accumulators, on the other hand a programmer would be free to
use a pair of wider accumulators to sidestep the issue.

On the third (i.e gripping) hand you could have a language like Java
where it would be illegal to transform a temporarily trapping loop into
one that would not trap and give the mathematically correct answer.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to BGB on Sat Feb 17 10:20:44 2024

BGB wrote:

On 2/16/2024 5:29 AM, Marcus wrote:

I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.

E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).

Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.

I guess possible.

Absolutely possible. After all, the IBM block move and all the 1978 x86
string ops were designed to make an internal, interruptible, loop. No
need to load more instructions, just let the internal state machine run
until completion.

The current state of the art (i.e. VMM) is of course far more capable,
but the original idea is old.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sat Feb 17 07:49:34 2024

MitchAlsup wrote:

EricP wrote:

MitchAlsup1 wrote:

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?

first: memory is inherently misaligned in My 66000 architecture. So, since the width of the machine is 64-bits, we read or write in 128-bit quantities so that we have enough bits to extract the misaligned data from or a container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}

Second: over in VVM-land, the implementation can decide to read and write wider, but is architecturally constrained not to shrink below 128-bits.

A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. This supports loops of 6IPC or greater in a 1-wide machine. This machine would process suitable loops at 128-bits per cycle--depending on "other things" that are generally allowable.

A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths to support this rate.

Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop. That's more straight forward than fusing LD's or ST's in LSQ.

Is that done by disambiguation logic, checking for same cache line
access?

Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses are dense, whether they all go in the same direction, and what registers carry loop-to-loop dependencies.

How does it know when to use LDP/STP uOps?
That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.

But the decision on the legality to use LDP/STP depends on knowing the
current loop counter >= 2 and address(es) aligned on a 16 byte boundary,
which are multiple dynamic, possibly calculated, values only available
much later to the back end.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Feb 17 17:18:36 2024

EricP wrote:

MitchAlsup wrote:

EricP wrote:

MitchAlsup1 wrote:

You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.

Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?

first: memory is inherently misaligned in My 66000 architecture. So, since >> the width of the machine is 64-bits, we read or write in 128-bit quantities >> so that we have enough bits to extract the misaligned data from or a
container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}

Second: over in VVM-land, the implementation can decide to read and write
wider, but is architecturally constrained not to shrink below 128-bits.

A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. >> This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.

A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths >> to support this rate.

Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop. That's more straight forward than fusing LD's or ST's in LSQ.

Is that done by disambiguation logic, checking for same cache line
access?

Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses >> are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.

How does it know when to use LDP/STP uOps?

It does not have LDP/STP ops to use.
It uses the width of the cache port it has.
It just so happens that the low end machine has a cache width of 128-bits.
But each implementation gets to choose its own width.

That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.

It is not using a register, although it is using flip-flops. It is not
using something that is visible to SW but is visible to HW.

But the decision on the legality to use LDP/STP depends on knowing the current loop counter >= 2 and address(es) aligned on a 16 byte boundary, which are multiple dynamic, possibly calculated, values only available
much later to the back end.

It does not need to see the address aligned to a 16-byte boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Sat Feb 17 18:03:53 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Sat Feb 17 19:58:19 2024

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >> one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

while if you implement it with

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

then you would be OK, and the final result would be mathematically correct.

For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Sat Feb 17 20:03:01 2024

Terje Mathisen wrote:

Anton Ertl wrote:

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

Yes, and it should not be vectorized when your vector resource has
CRAY-like vector registers--however, it can be vectorized with VVM
like resources.

while if you implement it with

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

then you would be OK, and the final result would be mathematically correct.

when len > 2^24 it may still not be mathematically correct for 32-bit ints
or len > 2^60 for 64-bit ints.

For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Feb 17 22:03:16 2024

BGB wrote:

On 2/17/2024 3:20 AM, Terje Mathisen wrote:

BGB wrote:

But, I am not entirely sure how one would go about implementing it, as
VADD.H would need to do the equivalent of:
MOV.Q (R4), R16
MOV.Q (R5), R17
ADD 8, R4
ADD 8, R5
PADD.H R16, R17, R18
MOV.Q R18, (R6)
ADD 8, R6
All in a single instruction.

With the proper instruction set, the above is::

VEC R9,{}
LDSH R10,[R1,Ri<<1]
LDSH R11,[R2,Ri<<1]
ADD R12,R10,R11
STH R12,[R3,Ri<<1]
LOOP LT,Ri,#1,Rmax

Once you see that there is no loop recurrence, then the loops can be run concurrently as wide as you have arithmetic capabilities and cache BW--
in this case we have an arithmetic capability of 4 Halfword ADDs per cycle
and a memory capability of 128-bits every cycle creating a BW of 4×5 inst every 1.5 cycles or 13.3 IPC and we are memory limited, not arithmetic
limited.

Though, could be reduced if auto-increment were re-added:
MOV.Q @R4+, R16
MOV.Q @R5+, R17
PADD.H R16, R17, R18
MOV.Q R18, @R6+

You will find the requisite patterns harder to recognize when the memory reference size is NOT the calculation size. In your case, the calculation
is .H while memory reference is .Q .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Feb 17 22:08:30 2024

BGB wrote:

On 2/17/2024 12:03 PM, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly >> don't trap.

Yes.

Trap on overflow is not really a thing in the JVM, the basic integer
types are modulo, and don't actually distinguish signed from unsigned (unsigned arithmetic is merely faked in some cases with special
operators, with signed arithmetic assumed as the default).

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

People skilled in numerical analysis hate java FP semantics.

Yeah. No traps, only NaNs.

FWIW: My own languages, and BGBCC, also partly followed Java's model in
this area. But, it wasn't hard: This is generally how C behaves as well
on most targets.

Well, except that C will often trap for things like divide by zero and similar, at least on x86. Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.

On BJX2, there isn't currently any divide-by-zero trap, since:
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.

This sounds like it should make your machine safe to program and use,
but it does not.

So, IIRC, integer divide-by-zero will just give 0, and FP divide-by-zero
will give Inf or NaN.

Can I volunteer this as the worst possible value for int/0, [un]signedMAX
is trivially harder to implement.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 17 15:36:31 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 01:03:23 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Feb 18 01:06:46 2024

BGB wrote:

On 2/17/2024 4:08 PM, MitchAlsup1 wrote:

BGB wrote:

On BJX2, there isn't currently any divide-by-zero trap, since:
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.

This sounds like it should make your machine safe to program and use,
but it does not.

It is more concerned with "cheap" than "safe".

Trap on divide-by-zero would require having a way for the divider unit
to signal divide-by-zero has been encountered (say, so some external
logic can raise the corresponding exception code). This is not free.

Most result busses have a bit that carries exception to the retire end
of the pipeline. The retire stage looks at the bit, sees a DIV instruction
and knows what exception was raised. FP generally needs 3-such bits on
the result bus.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 17 20:01:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Sun Feb 18 07:47:13 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

...

I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.

As mentioned, Java defines addition of the integral base types to use
modulo (aka wrapping) arithmetic, i.e., overflow is fully defined with
nice properties. In particular, the associative law hold for modulo
addition, which allows all kinds of reassociations that are helpful
for parallelizing reduction.

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

I don't think that int8_t or unsigned are Java types.

If that is C code: C standard lawyers will tell you what the C
standard says about storing 128 into s (the addition itself does not
overflow, because it uses ints).

For this particular example, you would also get the correct answer with >wrapping arithmetic, even if that by default is UB in modern C.

The standardized subset of C is not relevant for discussing Java.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Opus@21:1/5 to Tim Rentsch on Sun Feb 18 09:24:23 2024

On 18/02/2024 05:01, Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

Of course the conversion will be done implicitly. C converts almost
anything implicitly. Not that this is its greatest feature.

The explicit cast is still useful: 1/ to express intent (it shows that
the potential loss of data is intentional) and then 2/ to avoid compiler warnings (if you enable -Wconversion, which I usually recommend) or
warning from any serious static analyzer too (which I highly recommend
using too).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sun Feb 18 08:00:18 2024

BGB <cr88192@gmail.com> writes:

Well, except that C will often trap for things like divide by zero and >similar, at least on x86.

The division instructions of IA-32 and AMD64 trap on divide-by-zero
and when the result is out of range. Unsurprisingly, C compilers
usually use these instructions when compiling division on these
architectures. One interesting case is what C compilers do when you
write

long foo(long x)
{
return x/-1;
}

Both gcc and clang compile this to

0: 48 89 f8 mov %rdi,%rax
3: 48 f7 d8 neg %rax
6: c3 retq

and you don't get a trap when you call foo(LONG_MIN), while you would
if the compiler did not know that the divisor is -1 (and it was -1 at run-time).

By contrast, when I implemented division-by-constant optimization in
Gforth, I decided not "optimize" the division by -1 case, so you get
the ordinary division operation and its behaviour. If a programmer
codes a division by -1 rather than just NEGATE, they probably want
something other than NEGATE.

Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.

Reading up on Java, <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> says:

|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an ArithmeticException) or -1 (and then produce a long_min result and
continue execution).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Sun Feb 18 11:26:09 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

I am normally writing Rust these days, where UB is far less common, but
casts like this are mandatory.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Feb 18 16:10:52 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

I suspect most experienced C programs know that.

Yet, the 'superfluous' cast is also documentation that the
programmer _intended_ that the return value would be narrowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 17:48:09 2024

Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

Missing my point:: which was::

The summation loop will not overflow, and overflow is detected at
the smash from int to int8_t.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sun Feb 18 20:14:05 2024

On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

BGB <cr88192@gmail.com> writes:

Well, except that C will often trap for things like divide by zero
and similar, at least on x86.

The division instructions of IA-32 and AMD64 trap on divide-by-zero
and when the result is out of range. Unsurprisingly, C compilers
usually use these instructions when compiling division on these architectures. One interesting case is what C compilers do when you
write

long foo(long x)
{
return x/-1;
}

Both gcc and clang compile this to

0: 48 89 f8 mov %rdi,%rax
3: 48 f7 d8 neg %rax
6: c3 retq

and you don't get a trap when you call foo(LONG_MIN), while you would
if the compiler did not know that the divisor is -1 (and it was -1 at run-time).

By contrast, when I implemented division-by-constant optimization in
Gforth, I decided not "optimize" the division by -1 case, so you get
the ordinary division operation and its behaviour. If a programmer
codes a division by -1 rather than just NEGATE, they probably want
something other than NEGATE.

Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.

Reading up on Java, <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> says:

|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or equivalent could then check if the divisor was 0 (and then throw an ArithmeticException) or -1 (and then produce a long_min result and
continue execution).

- anton

I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division sequence,
i.e. CQO followed by IDIV, will produce result expected by Java spec
without any overflow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sun Feb 18 22:40:08 2024

Michael S <already5chosen@yahoo.com> writes:

On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> >> says:

|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an
ArithmeticException) or -1 (and then produce a long_min result and
continue execution).

- anton

I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division sequence,
i.e. CQO followed by IDIV, will produce result expected by Java spec
without any overflow.

Try it. E.g., in gforth-fast /S performs this sequence:

see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code

And when I divide LONG_MIN by -1, I get a trap:

$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Feb 19 01:20:09 2024

On Sun, 18 Feb 2024 22:40:08 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
says:

|if the dividend is the negative integer of largest possible
magnitude |for its type, and the divisor is -1, then integer
overflow occurs and |the result is equal to the dividend. Despite
the overflow, no |exception is thrown in this case. On the other
hand, if the value of |the divisor in an integer division is 0,
then an ArithmeticException |is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating
the divide instruction. Alternatively, the generated code could
just produce a division instruction, and the signal handler (on
Unix) or equivalent could then check if the divisor was 0 (and
then throw an ArithmeticException) or -1 (and then produce a
long_min result and continue execution).

- anton

I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division
sequence, i.e. CQO followed by IDIV, will produce result expected by
Java spec without any overflow.

Try it. E.g., in gforth-fast /S performs this sequence:

see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code

And when I divide LONG_MIN by -1, I get a trap:

$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<

- anton

You are right.
LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Feb 19 09:06:23 2024

Michael S <already5chosen@yahoo.com> writes:

LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.

The result of LONG_MIN/1 is LONG_MIN, which is in range, while the
result of LONG_MIN/-1 is LONG_MAX+1, which is not in range.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Mon Feb 19 14:11:51 2024

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.

I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow. The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)

All real-life implementations will define the conversion as modulo/truncation/wrapping, however you prefer to think of it, though it
is not specified in the standards.

while if you implement it with

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

then you would be OK, and the final result would be mathematically correct.

Converting the "int" to "int8_t" will give the correct value whenever it
is in the range of int8_t. But if we assume that the implementation
does out-of-range conversions as two's complement wrapping, then the
result will be the same no matter when the modulo operations are done.

For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.

There's no UB in either case. Only IB.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Mon Feb 19 18:47:32 2024

On Mon, 19 Feb 2024 01:20:09 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Sun, 18 Feb 2024 22:40:08 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sun, 18 Feb 2024 08:00:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
says:

|if the dividend is the negative integer of largest possible
magnitude |for its type, and the divisor is -1, then integer
overflow occurs and |the result is equal to the dividend. Despite
the overflow, no |exception is thrown in this case. On the other
hand, if the value of |the divisor in an integer division is 0,
then an ArithmeticException |is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just
generating the divide instruction. Alternatively, the generated
code could just produce a division instruction, and the signal
handler (on Unix) or equivalent could then check if the divisor
was 0 (and then throw an ArithmeticException) or -1 (and then
produce a long_min result and continue execution).

- anton

I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division
sequence, i.e. CQO followed by IDIV, will produce result expected
by Java spec without any overflow.

Try it. E.g., in gforth-fast /S performs this sequence:

see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov
0x8(%r13),%rax 0x00005614dd335635 <gforth_engine+3221>: add
$0x8,%r13 0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code

And when I divide LONG_MIN by -1, I get a trap:

$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<

- anton

You are right.
LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.

Most likely, back when I was reading the manual for the first time, I
read DIV paragraph thoroughly and then just looked briefly at IDIV
assuming that it is about the same and not paying attention to the
differences in corner cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Mon Feb 19 23:21:57 2024

David Brown wrote:

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20 >>>> where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.

I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.

I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Feb 20 01:10:23 2024

Terje Mathisen wrote:

David Brown wrote:

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like Java=20 >>>>> where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.

I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing performance and making it that much harder to tickle the overflow exception.

What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.

We are in an era where long has higher performance than ints (except for
cache footprint overheads.)

We are also in an era where the dust on dusty decks is starting to show
its accumulated depth.

I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.

Same argument holds for Kahan-Babuška summation.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Tue Feb 20 06:31:59 2024

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {

Just a side remark: This loop can get very long for len < 0.

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.

Depending on len and the data...

The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)

And that is one of the things that bugs me, in languages like C
and Fortran both.

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Feb 20 07:32:40 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Terje Mathisen wrote:

If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.

What "accumulation rules"?

Certainly with twos-complement modulo arithmetic, the following
distributive laws hold:

(a+b) mod 2^n = (a mod 2^n) + (b mod 2^n)

and this also holds for a "signed mod" operator smod that represents
the congruence classes modulo 2^n by the numbers -2^(n-1)..2^(n-1)-1
instead of 0..2^n-1. I actually would prefer to write the equivalence
above as a congruence modulo 2^n, which would avoid the need to
explain that separately, but I don't see a good way to do it. Maybe:

a+b is congruent with (a mod 2^n) + (b mod 2^n) modulo 2^n

but of course this still uses the mod operator that produces values
0..2^n-1.

We also have:

a is congruent to a mod 2^m modulo 2^n if m>=n

So, the result is that, yes, if we only have a wider addition
instruction, and need a narrower result at some point, we can
undistribute the narrowing operation (sign extension or zero
extension) and just apply it to the end result.

In effect, this hoists the smashes back to int out of the loop, increasing >performance and making it that much harder to tickle the overflow exception.

What overflow exception? I have yet to see a C compiler use an
addition instruction that causes an overflow exception on integer
overflow, unless specifically asked to do so with -ftrapv. And if the programmer explicitly asked for traps on overflow, then the C compiler
should not try to "optimize" them away. Note that the stuff above is
true for modulo arithmetic, not (in general) for trapping arithmetic.

We are in an era where long has higher performance than ints (except for >cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s. We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Tue Feb 20 08:15:22 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the architectures of old.

Those ideas are that integer overflows do not happen and that a
competent programmer proactively prevents them from happening, by
sizing the types accordingly, and checking the inputs. Therefore,
there is no need to check if an addition overflows, and many
architectures do not provide an easy way:

* The old ones-complement and sign-magnitude machines trapped on
overflow AFAIK.

* S/360 has some mode bits that allow trapping on overflow, or setting
some bits in some register (where the meaning of these bits depends
on the instructions that set them).

* MIPS and Alpha provide trap-on-signed-overflow addition,
subtraction, and (Alpha) multiplication, but no easy way to check
whether one of these operations would overflow. In MIPS64r6 (2014)
MIPS finally added BOVC/BNVC, which branches if adding two registers
would produce a signed overflow. RISC-V eliminates the
trap-on-signed-overflow instructions, but otherwise follows the
original MIPS and Alpha approach to overflow.

Over time both unsigned and signed overflow have become more
important: High-level programming languages nowadays often support
Bignums (arbitrarily long (signed) integers), and cryptography needs fixed-width wide arithmetics. For Bignums, you need to detect
(without trapping) overflows of signed-integer arithmetics, for wide
addition and subtraction (for both the wide path of Bignums, and for cryptography), you need carry/borrow, i.e. unsigned
overflow/underflow.

And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
MIPS64r6 has added BOVC/BNVC, the 88000 and Power have a carry bit
(and, for Power, IIRC a sticky overflow bit) outside their usual
handling of comparison results.

But of course, C standardization is barely moving into the 1970s; from
what I read, they finally managed to standardize twos-complement
arithmetics (as present in the S/360 in 1964 and pretty much every
architecture that was not just an extension of an earlier architecture
since then). Leave some time until the importance of overflow
handling dawns on them; it was not properly present in the S/360, and
is not really supported in RISC-V (and in MIPS only since 2014), so
it's obviously much too early to standardize such things in the
standardized subset of C, a language subset that targeted
ones-complement and sign-magnitude dinosaurs until recently.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Tue Feb 20 12:39:50 2024

On 19/02/2024 23:21, Terje Mathisen wrote:

David Brown wrote:

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like
Java=20
where it would be illegal to transform a temporarily trapping loop
into=20
one that would not trap and give the mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.

I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

Someone had asked for a comment from a C language lawyer. He might have
been joking, but when lawyer mode is active, all sense of humour is
deactivated :-)

More seriously, the exact rules for C can be complicated, especially
when you must consider unusual machines (IIRC the first Cray had 64-bit "short", "int" and "long", for example).

If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

Yes. Any integer type smaller than "int" gets automatically promoted to
"int" in many situations. (If the platform has "short int" or "char"
that is the same size as "int", then values of those types also
technically get promoted to "int", but that has little practical
effect.) Note that this means unsigned types smaller than "int" get
promoted to /signed/ int.

Then you get the "usual arithmetic conversions" for most binary
operators, picking a common type for the two operands. Basically, this
picks the first it can in the list "int", "unsigned int", "long int",
"unsigned long int", "long long int", "unsigned long long int". And
note that "<int> op <unsigned int>" will result in the "int" operand
being converted to "unsigned int", and the operation being carried out
in unsigned ints. (I'm ignoring floating point for brevity.)

(Personally, I don't think this is the best way to handle this kind of
thing. I'd rather not have any promotions, and that any "usual
arithmetic conversions" picked a common type that spanned the full
ranges of both operands, or simply didn't allow mixed signed operations.
I use gcc "-Wconversion" to complain about possible problems.)

So on a typical 32-bit int system, any arithmetic that looks like it is
being done on smaller types, is actually done in 32-bit signed
arithmetic. C arithmetic operators don't exist for smaller types. The generated code can, of course, use 64-bit registers and operations - as
long as the results are the same.

What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.

Yes, and your code was fine for that.

But I think it is helpful to understand that there is no overflow -
temporary or otherwise - according to the C meaning of the term. First,
the arithmetic is done as "int", not the type of the operands (when they
are smaller than "int"), and certainly not the type of the target
integer type. Signed integer arithmetic overflow should be avoided in
C, because there is no "right" answer, and it is therefore UB. But the narrowing conversion to a smaller integer type is defined behaviour
(defined by the implementation), and fine to use as long as you are
happy with the potentially non-portable results.

I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.

I agree that it is better to use a larger type for the accumulator. If
you are sure the calculation will never exceed the range of 16 bits, the
ideal type to use would perhaps be "int_fast16_t" (or use
"int_fast32_t", or "int_fast64_t" if you may need a bigger value). On
x86-64, these types are all 64-bit since that is more efficient, so they
are good choices for local variables.

Your final conversion to int8_t is still implementation-dependent (but
not UB) if the value in your accumulator is outside that range. And
while the implementation-dependent behaviour for out-of-range
conversions is allowed to raise a signal (or trap), I have never heard
of a system which actually does that. So if you want to check for
out-of-range conditions, you'll have to do it manually. (And then there
are obviously big advantages in doing that once at the end of the function.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue Feb 20 13:42:22 2024

On 20/02/2024 02:10, MitchAlsup1 wrote:

Terje Mathisen wrote:

David Brown wrote:

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like
Java=20
where it would be illegal to transform a temporarily trapping loop >>>>>> into=20
one that would not trap and give the mathematically correct answer. >>>>>

What "temporarily trapping loop" and "mathematically correct answer"? >>>>>
If you are talking about integer arithmetic, the limited integers in >>>>> Java have modulo semantics, i.e., they don't trap, and BigIntegers
certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does >>>>> not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.

I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

If we have a platform where the default integer size is 32 bits and
long is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing performance and making it that much harder to tickle the overflow
exception.

A compiler can make any transformations it wants, as long as the final observable behaviour is unchanged. And since signed integer arithmetic overflow is undefined behaviour, the compiler can do whatever it likes
if that happens. So if your target has 32-bit int and that is the type
you use in the source code for the accumulator, the compiler can
certainly use a 64-bit integer type for implementation. It could also
use a double floating point type. All that matters is that if the user
feeds in numbers that never lead to an 32-bit signed integer overflow,
the final output is correct.

It is not normal to have exceptions on integer overflow. If a compiler supports that as an extension (perhaps for debugging), then that is
giving defined behaviour to signed integer overflow, and now it is
observable - so the compiler cannot make optimisations that "make it
much harder to tickle". It would stop quite a few optimisations, in
fact - but certainly can be useful for debugging code.

What I was trying to illustrate was the principle that by having a
wider accumulator you could aggregate a series of numbers, both
positive and negative, and get the correct (in-range) result, even if
the input happened to be arranged in such a way that it would
temporarily overflow the target int type.

We are in an era where long has higher performance than ints (except for cache footprint overheads.)

"long" on many systems (Windows, and all 32-bit systems - which I think
have now overtaken 8-bit systems as the biggest market segment based on
unit volumes) is the same size as "int". But assuming you mean that
64-bit arithmetic has higher performance than 32-bit arithmetic on
modern "big" processors, that is sometimes correct. As well as the aforementioned cache and memory bandwidth differences, 32-bit can still
be faster for some types of operation (such as division) or if
operations can be vectorised. But it is not surprising that on 64-bit
systems, "int_fast32_t" is usually 64-bit, as that is typically faster
for most operations on local variables.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Feb 20 12:00:29 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s. We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.

Also, these architectures probably would not have the so-called 32-bit arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64. RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom. After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes). But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the compressed extension (16-bit encodings of frequent instructions).

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either. E.g., consider the function

void sext(int M, int *ic, int *is)
{
int k;
for (k = 1; k <= M; k++) {
ic[k] += is[k];
}
}

which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable performance difference (as reported in section 3.3 of <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
looks as follows on RV64GC:

int unsigned (u)intptr_t
.L3: .L8: .L15:
slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
sw a5,0(a6) addw a5,a5,a3
bge a0,a4,6 <.L3> sw a5,0(a6)
bgeu a0,a4,28 <.L8>

There is no difference between the intptr_t loop body and the
uintptr_t loop. And without -fwrapv, the int loop looks just like the (u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).

So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively. In this way you can
circumvent some of the damage that I32LP64 has done.

- anton

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Tue Feb 20 13:58:46 2024

On 20/02/2024 07:31, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {

Just a side remark: This loop can get very long for len < 0.

Yes. "len" would be converted to "unsigned" by addition of 2^n (2^32
for the sizes given here) before the comparison. (It is not an infinite
loop, however.)

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

Wouldn't it be better to forbid mixing of signedness? I don't know
Fortran, so that might be a silly question!

For my C programming, I like to have "gcc -Wconversion -Wsign-conversion -Wsign-compare" to catch unintended mixes of signedness.

s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.

Depending on len and the data...

Yes, but I think that was given by Terje in the specification for the
code that the intermediary calculations could be too big for "int8_t",
but not too big for "int".

The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)

And that is one of the things that bugs me, in languages like C
and Fortran both.

Me too.

Even now (from when C23 is officially released) that two's complement is
the only allowed representation for signed integers in C, conversion
does not need to be wrapping - conversion to a signed integer type from
a value that is outside its range is "implementation-defined or an implementation-defined signal is raised". I like that there is the
option of raising a signal - that lets you have debug options in the
compiler to find run-time issues. But I'd prefer that the alternative
to that was specified as modulo or wrapping behaviour.

(I like that signed integer overflow is UB, however.)

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

You'll be glad that this is now in C23:

#include <stdckdint.h>
bool ckd_add(type1 *result, type2 a, type3 b);
bool ckd_sub(type1 *result, type2 a, type3 b);
bool ckd_mul(type1 *result, type2 a, type3 b);

(Basically, it's the gcc/clang extensions that have been standardised.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 14:46:10 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

Those ideas are that integer overflows do not happen and that a

Can't say that I've known a programmer who thought that way.

And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,

Architectures in the 1960's had a flags register with and overflow bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Feb 20 14:39:10 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {

Just a side remark: This loop can get very long for len < 0.

Which is why len should have been declared as size_t. A negative
array length is nonsensical.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Feb 20 16:09:36 2024

Scott Lurndal wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
Â int8_t s = 0;
Â for (unsigned i = 0 i < len; i++) {

Just a side remark: This loop can get very long for len < 0.

Which is why len should have been declared as size_t. A negative
array length is nonsensical.

It was a pure typo from my side. In Rust array indices are always of
"size_t" type, you have to explicitely convert/cast anything else before
you can use it in a lookup:

opcodes[addr as usize]

when I needed addr (an i64 variable) to be able to take on negative
values, but here (knowing that it is now in fact positive) I am using it
as an array index.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Feb 20 16:17:21 2024

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

Those ideas are that integer overflows do not happen and that a

Can't say that I've known a programmer who thought that way.

And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,

Architectures in the 1960's had a flags register with and overflow bit.

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Not only that, these cpus also had a dedicated single-byte opcode INTO
(hex 0xCE) to allow you to implement exception-style overflow handling
with very little impact on the mainline program, just emit that INTO
opcode directly after any program sequence where the compiler believed
that an overflow which shjould be handled, might happen.

Terje
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Tue Feb 20 16:27:56 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.

I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.

Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Tue Feb 20 17:25:18 2024

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>> cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.

I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.

We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.

Also, these architectures probably would not have the so-called 32-bit arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64. RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom. After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes). But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the compressed extension (16-bit encodings of frequent instructions).

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need). These adapt suitably for different
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.

E.g., consider the function

void sext(int M, int *ic, int *is)
{
int k;
for (k = 1; k <= M; k++) {
ic[k] += is[k];
}
}

which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable performance difference (as reported in section 3.3 of <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
looks as follows on RV64GC:

int unsigned (u)intptr_t
.L3: .L8: .L15:
slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
sw a5,0(a6) addw a5,a5,a3
bge a0,a4,6 <.L3> sw a5,0(a6)
bgeu a0,a4,28 <.L8>

There is no difference between the intptr_t loop body and the
uintptr_t loop. And without -fwrapv, the int loop looks just like the (u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).

So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively. In this way you can circumvent some of the damage that I32LP64 has done.

I would say the takeaways here are :

If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.

Don't use unsigned types for counters and indexes unless you actually
need them. Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing. (And
it lets you check for overflow bugs using run-time sanitizers.)

It is not just RISCV - this advice applies to all the 64-bit
architectures I tried on <https://godbolt.org>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Tue Feb 20 16:37:39 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>>> architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

Those ideas are that integer overflows do not happen and that a

Can't say that I've known a programmer who thought that way.

And this is reflected in more recent architectures: Many architectures >>>> since about 1970 have a flags register with carry and overflow bits,

Architectures in the 1960's had a flags register with and overflow bit.

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and >superscaling, overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

ARM often has two encodings for the math instructions - one that sets the flags and one
that doesn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Tue Feb 20 16:22:31 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

x86 has had an 'O' (Overflow) flags bit since the very beginning

What is "x86"?

The 8086 architecture was introduced in 1978, so it's not from the
1960s. And yes, it's already part of the modern wave of architectures
which support, e.g., add-with-carry. For the 8086, this is probably
due to it being rooted in 8-bit-microprocessors, where add-with-carry
was necessary to have additions beyond 8 bits.

PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?

AFAICT it has not been used yet. My guess is that the AMD64 designers wanted to vacate a little-used single-byte opcode that can be readily
replaced with a JO to an appropriate target.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Tue Feb 20 17:28:45 2024

On 20/02/2024 16:17, Terje Mathisen wrote:

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

Those ideas are that integer overflows do not happen and that a

Can't say that I've known a programmer who thought that way.

And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,

Architectures in the 1960's had a flags register with and overflow bit.

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and superscaling, overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

Not only that, these cpus also had a dedicated single-byte opcode INTO
(hex 0xCE) to allow you to implement exception-style overflow handling
with very little impact on the mainline program, just emit that INTO
opcode directly after any program sequence where the compiler believed
that an overflow which shjould be handled, might happen.

Terje
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 16:39:16 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

IIRC S/360 has two modes of operation: One where, on signed addition,
overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.

Supposedly other architectures have instructions that trap on signed
integer overflow; if you cannot disable that feature on them, you
cannot use overflow-checking programs; if you can disable that
feature, you have the same problem as the S/360.

In which case the question is: What is the result of such a silent
overflowing addition/subtraction/multiplication? On 2s-complement architectures, the result likely is defined by modulo arithmetic, but
what about ones-complement and sign-magnitude machines?

Certainly the S/360 design indicates that the architect does not
expect an overflow to happen in regular execution, and that
overflow-checking in programs is not expected by the architect,
either.

Moreover, addition with carry-in was only added in ESA/390 in 1990.
So they certainly did not expect multi-precision arithmetic or Bignums
before then.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?

Those ideas are that integer overflows do not happen and that a

Can't say that I've known a programmer who thought that way.

Just read into the discussions about the default treatment of signed
integer overflow by, e.g. gcc, and clang. The line of argument goes
like this: The C standard does not define the behaviour on signed
overflow, therefore a C program must not have such overflows,
therefore it is not just correct but also desirable for a compiler to
assume that such overflows do not happen. I don't know whether people
who argue that way are programmers, but at least they pose as such on
the 'net.

Architectures in the 1960's had a flags register with and overflow bit.

S/360 certainly has no dedicated overflow bit, just multi-purpose bits
that sometimes mean "overflow".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 17:24:55 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the >>>architectures of old.

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.

The contemporaneous B3500 had a three-bit field for the 'flags'
called COMS/OVF (Comparison Toggles(COMH/COML) and Overflow Toggle).

The overflow toggle was sticky and only reset by the Branch on
Overflow (OFL) instruction.

There were no traps.

Moreover, addition with carry-in was only added in ESA/390 in 1990.
So they certainly did not expect multi-precision arithmetic or Bignums
before then.

The B3500 was BCD, with one to 100 digit operands. Effectively
bignums. The optional floating point instruction set had a two
digit exponent and a hundred digit fraction.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...

This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?

No. Burroughs Programming Language.

https://en.wikipedia.org/wiki/Burroughs_Medium_Systems

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Feb 20 17:32:59 2024

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {

Just a side remark: This loop can get very long for len < 0.

Which is why len should have been declared as size_t. A negative
array length is nonsensical.

Not in all languages. It can be just a shorthand for a zero-sized
array:

$ cat array.f90
program main
real, dimension(1:-1) :: a
print *,size(a)
end program main
$ gfortran array.f90 && ./a.out
0
$

The argument is the same as for a DO loop like

DO I=1,-1
...
END DO

which is also executed zero times (and not -3 times :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Feb 20 17:37:09 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

It was a pure typo from my side. In Rust array indices are always of
"size_t" type, you have to explicitely convert/cast anything else before
you can use it in a lookup:

opcodes[addr as usize]

No arbitrary array indices, then?

I sometimes find array bounds like

a(-3:3)

convenient, but it is not a killer for not using a language
(I use both C and Perl :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Tue Feb 20 17:42:24 2024

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

Wouldn't it be better to forbid mixing of signedness?

That is also in the proposal.

I don't know
Fortran, so that might be a silly question!

Not at all - mixing signed and unsigned arithmetic is a source
of headache for C, and I see no reason to impose the same sort of
headache on Fortran (and I see that were are in agreement here,
or you would not have written

For my C programming, I like to have "gcc -Wconversion -Wsign-conversion -Wsign-compare" to catch unintended mixes of signedness.

:-)

Not sure if it will pass, though - unsigned ints have been brought
up in the past, and rejected. Maybe, with the support of DIN, this
has a better chance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Feb 20 17:43:18 2024

Scott Lurndal <scott@slp53.sl.home> schrieb:

Architectures in the 1960's had a flags register with and overflow bit.

POWER still has a (sticky) overflow bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 18:27:40 2024

scott@slp53.sl.home (Scott Lurndal) writes:

Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.

In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various
places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?

In any case, I32LP64 caused breakage for my code, and I expect that
there was more code around with the assumption sizeof(int)==sizeof(int
*) than with the assumption sizeof(int)==4. Of course, we worked
around this misfeature of the C compilers on Digital OSF/1, but those
who assumed sizeof(int)==4 would have adapted their code if the
decision had been for ILP64.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Tue Feb 20 18:42:17 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

as flag registers are a bottleneck for OOO and
superscaling

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require execution resources, however.

overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

That's nonsense. People use carry for implementing multi-precision
arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Tue Feb 20 17:47:37 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.

I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

Guess 1: There was more software that depended on sizeof(int)==4 than
software that depended on sizeof(int)==sizeof(char *).

Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.

My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.

For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough. So I don't see a reason for int_fast32_t etc.

These adapt suitably for different

targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.

Why do you bring up "long long int"? As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.

If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.

I don't see a point in those types. What's wrong with (u)intptr_t IYO?

Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so >letting the compiler assume that will not happen is a good thing.

Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?

I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).

(And
it lets you check for overflow bugs using run-time sanitizers.)

If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

Anyway, I certainly have code that relies on modulo arithmetic.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 19:12:39 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.

In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various >places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?

Most of the problems encountered when moving unix (System V)
from 16-bit to 32-bit were more around missing typedefs for certain
data types (e.g. uids, gids, pids, etc), so there was
a lot of code that declared these as shorts, but the 32-bit
kernels defined these as 32-bit (unsigned) values).

That's when uid_t, pid_t, gid_t were added.

Then there were the folks who used 'short [int]' instead of 'int'
since they were the same size on the PDP-11.

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 19:31:20 2024

Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of >>fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

as flag registers are a bottleneck for OOO and
superscaling

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require execution resources, however.

These implementations have several 200 man design teams and decades
of architectural and µArchitectural understanding that upstart
RISC designs cannot afford (until they <also> reach 100M chips sold
per year--where you reach the required amount of cubic dollars).

The fact that some designs with flags can perform at the top or near
the top is only indicative that flags are "not that much" of an
impediment to performance at the scale of current µprocessors
(3-4 wide)

overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

You can implement BigNums efficiently without a RAW serializing
carry bit in some control register (see My 66000 CARRY instruction-
modifier).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Feb 20 19:44:57 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.

Not quite. It had regular add and subtract (A, AR, S, SR) and logical
(AL, ALR, SL, SLR). The former set the condition code as negative,zero,positive, or overflow, and interrupted if the overflow
interrupt was enabled. The latter set the condition code as zero no
carry, nonzero no carry, zero carry, or nonzero carry, and never
overflowed. There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.

Multiplication was always signed and took two single length operands
and produced a double length product. It couldn't overflow but you
could with some pain check to see if the high word of the product had
any significant bits.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brian G. Lucas on Tue Feb 20 21:49:39 2024

"Brian G. Lucas" <bagel99@gmail.com> writes:

What I would like is a compiler flag that did "IFF when an int (or unsigned) >ends up in a register, promote it to the 'fast' type".

That used to be the point of C's promotion-to-int rule, until the
I32LP64 mistake. Now, despite architectural workarounds like RISC-V's
addw, we see the fallout of this mistake.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 21:54:55 2024

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

I only heard about (u)intptr_t long after my first contact with
I32LP64 in 1995. I don't think it existed at the time. Of course we
defined our own (u)intptr_t-like types, but there are problems to this
day, e.g., when I want to use printf on that type, which is an int on
one platform and a long on another platform; I guess the solution is
to always use %ld etc, and cast the integer data to be printed to
long/unsigned long.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Feb 20 14:40:02 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

[some incidental text removed]

David Brown <david.brown@hesbynett.no> schrieb:

On 17/02/2024 19:58, Terje Mathisen wrote:

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

No. In C, int8_t values will be promoted to "int" (which is always
at least 16 bits, on any target) and the operation will therefore
not overflow.

Depending on len and the data...

The code as written does not overflow regardless of the
values in the data array or how many are processed.

[...]

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably. It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 22:59:10 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

I only heard about (u)intptr_t long after my first contact with
I32LP64 in 1995. I don't think it existed at the time.

Sorry, I meant ptrdiff_t, which was used for pointer math.

uintptr_t came later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Wed Feb 21 11:47:27 2024

On 20/02/2024 18:47, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)

C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.

I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.

I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).

Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.

My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor.

Sounds reasonable.

Another possible reason is that it is very useful to have integer types
with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
types, so if "int" is 64-bit, you have "short" as either 16-bit and have
no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With
32-bit "int", it's easy to have each size without having to add extended integer types or add new standard integer types (like "short short int"
for 16-bit and "short int" for 32-bit).

I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.

I saw benchmarks showing x32 being measurably faster, but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches), and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.

And support for 32-bit has /not/ been "eliminated from ARM cores". It
may have been eliminated from the latest AArch64 cores - I don't keep
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
and embedded systems. You might not be interested in anything that
isn't running on modern 64-bit x86 or AArch64 systems (and that's
absolutely fine - no one is interested in everything), but 32-bit is not
going away any time soon. Even 8-bit has not gone away.

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.

You need to pick the appropriate size for your data, as I said.

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.

I like my types to say what they mean. "uintptr_t" says "this object
holds addresses for converting pointer values back and forth with an
integer type". "uint_fast64_t" says "this holds an unsigned integer
with a range of at least 64 bits, as fast as the target can manage".
And if you want a type that says "this can hold a value as big as the
size of the biggest object on this target", "size_t" is the correct type.

I expect that on most 64-bit platforms, uint_fast16_t, uint_fast32_t, uint_fast64_t, uintptr_t, size_t, and many other types are all 64-bit
unsigned integers, and all are typedef's of unsigned long int or
unsigned long long int. The same goes for the signed types.
Nonetheless, it is good design to use appropriate type names for
appropriate usage. This makes the code clearer, and increases portability.

So "[u]intptr_t" is - IMHO - the wrong choice for anything other than
dealing with pointers that are converted to an integer type.

For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough.

Agreed.

(You could even use "int_least32_t" if you wanted extreme portability,
but I have difficulty imagining a platform where that would be anything
other than int32_t. There are DSP ISAs that are still current which
don't support int8_t or int16_t, but code for such devices is usually
highly specialised for such devices, and portability is not an issue.)

So I don't see a reason for int_fast32_t etc.

Use it when you want a 32-bit range, and as fast as possible.

These adapt suitably for different

targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.

Why do you bring up "long long int"?

If you want a standard integer type in C that has at least 64-bit, you
use "long long int". "long int" is only specified as being at least
32-bit. All other integer types, such as "intptr_t", "size_t" and
"int64_t", are aliases for the standard integer types. (They could be "extended integer types", but no major C implementation has any of these.)

As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.

And that is why it makes sense to use the "fast" types. If you need a
16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems,
32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
always supporting the range you need, as fast as possible.

If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.

I don't see a point in those types. What's wrong with (u)intptr_t IYO?

I've answered that above.

(I believe we are in close agreements about the facts - when different
sizes are faster - but differ in our opinions of which type names to use.)

Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.

Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Sorry, that comment went over my head - I don't know what
"demonstration" you are referring to.

Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?

If you have a mistake in your code, you want all the help you can get to
find it and fix it - leaving integer overflow as UB means compilers can
provide tools such as sanitizers to aid here, without having to break conformance with the language.

And if you have /not/ made a mistake in your code, presumably you want
the compiler to assume you have not make a mistake if that assumption
lets it generate more efficient object code. (I realise there are times
when minimising the consequences of possible mistakes is more important
than object code efficiency, and there can be many other factors to take
into account.)

I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).

OK.

We have had discussions before about the pros and cons of C's UB and
handling of signed integer overflow. We disagreed before, and I do not
expect anything has changed or that either of us has new arguments to
present. So it's probably best not to try to argue about whose
preferences or opinions are "best" or have the most justifiable
reasoning. But we can give facts.

In the examples you have given, using "int" and "-fwrapv" (or "unsigned
int", which is always wrapping) gives poorer code than either using a
64-bit type (whatever it is called) or using "int" without "-fwrapv".

And this could have been simpler if "int" had been 64-bit in the first
place.

(And
it lets you check for overflow bugs using run-time sanitizers.)

If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

A compiler assumes that the overflow does not happen in /correct/ code,
used correctly. With optimisation options, it will assume that if there
is garbage in, you don't care what kind of garbage comes out, and use
that to give more efficient object code when things are running as
intended. With debugging flags, you ask the compiler to tell you when
it sees the garbage.

But if overflow is defined (such as it is for unsigned arithmetic, or if "-fwrapv" is in effect), overflow is no longer an error that can be
trapped at run-time - it is behaviour you may use and rely on in your code.

Anyway, I certainly have code that relies on modulo arithmetic.

Sure. So do I. (Not often, but it happens.) I use unsigned types when
I need to do that, because that's how they are defined in the language.

For comparison to C, look at the Zig language. (It's not a language I
have used or know in detail, but I know a little about it.) Unsigned
integer arithmetic overflow is UB in Zig, just like signed integer
arithmetic overflow. There are standard options and block settings
(roughly equivalent to pragmas) to control whether these give a run-time
error, or are assumed never to happen (for optimisation purposes). And
if you want to add "x" and "y" with wrapping, you use "x +% y" as an
explicit choice of operator.

That seems to me to be the right approach for an efficient high-level
language.

But whatever one thinks about C - and I doubt if there is anyone who has
worked much with C who does not dislike and disagree with some of its
design decisions - I think it is important to program in C the way C
works. Compiler extensions, specific flags, relying on implementation-dependent behaviour, etc., can be acceptable, but should
not be the first choice.

If you want "int z = x + y;" with wrapping, write :

int z = (unsigned) x + (unsigned) y;

If you do it a lot, put it in an inline function (or macro). It is only
if you are doing it regularly throughout your code, or if you are using
code written by someone else who assumes signed arithmetic wraps, that "-fwrapv" is a good choice IMHO. And even then, I always put it in a
pragma so that the code works even if someone uses different compiler flags.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brian G. Lucas on Wed Feb 21 13:27:23 2024

On 20/02/2024 20:18, Brian G. Lucas wrote:

On 2/20/24 10:25, David Brown wrote:

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need). These adapt suitably for
different targets. If you want to force the issue, then "int64_t" is
IMHO clearer than "long long int" and does not give a strange
impression where you are using a type aimed at pointer arithmetic for
general integer arithmetic.

What I would like is a compiler flag that did "IFF when an int (or
unsigned)
ends up in a register, promote it to the 'fast' type". This would be great when compiling dusty C decks. (Was there ever C code on punched cards?)

Well, that will happen to at least some extent (with an optimising
compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int",
precisely because you promised the compiler that your arithmetic won't
overflow so it does not need to worry about that possibility.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Wed Feb 21 13:49:41 2024

On 20/02/2024 19:42, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep compatibility
with 32-bit code.

But I think it was fairly rare to use the "add 32-bit and update flags" instruction in 32-bit RSIC systems (except for 64-bit arithmetic), and
much rarer to use the "add 64-bit and update flags" version in 64-bit
versions.

as flag registers are a bottleneck for OOO and
superscaling

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).

I think it would mainly be /despite/ having a flag register, rather than /because/ of it?

Sometimes having flags for overflows, carries, etc., can be very handy.
So having it in the ISA is useful. But I think you would normally want
your code to avoid setting or reading flags.

Implementing a flags register in an OoO microarchitecture does require execution resources, however.

It would, I think, be particularly cumbersome to track several parallel
actions that all act on the flag register, as it is logically a shared resource.

Do you think it is an advantage for a RISC architecture to have a flags register compared to alternatives? Say we want to have a double-width
addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b". (I hope my
pseudo-code is clear enough here.) With flags and an "add with carry" instruction you could have :

carry = 0;
carry, res_lo = reg_a + reg_b + carry
carry, res_hi = 0 + 0 + carry

Alternatively, you could have a double-register result, at the cost of
having more complex register banks :

res_hi:res_lo = reg_a + reg_b

Or you could have an "add and take the high word" instruction and use
two additions :

res_hi = (reg_a + reg_b) >> N
res_lo = reg_a + reg_b

overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

Fair enough, you do want carry (or an equivalent) for big number work.
But I would still contend that the vast majority of integers and integer arithmetic used in code will fit within 32 bits, and the vast majority
of those that don't, will fit within 64 bits. Once you go beyond that,
you will need lots of bits (such as, as you say, cryptography).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 14:34:59 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

And support for 32-bit has /not/ been "eliminated from ARM cores". It
may have been eliminated from the latest AArch64 cores

The ARMv8 architecture fully supports the A32 and T32 instruction sets.

Implementations of the architecture can choose not to implement
the A32 and T32 instruction sets. Some ARMv8 implementations
(e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM implementations of ARMv8 implemented the A32/T32 instruction
sets for EL0 (user-mode) only - I'm not aware of any that
supported A32 at privileged exception levels (EL1, EL2 or EL3).

Some of the more recent ARM neoverse cores support A32/T32 at EL0,
and some of them don't. Cavium's cores were 64-bit only.

good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers

Indeed, and many ARMv8 SoCs include arm 32-bit M-series microcontrollers on-chip.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Wed Feb 21 18:23:16 2024

On Wed, 21 Feb 2024 14:34:59 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

And support for 32-bit has /not/ been "eliminated from ARM cores".
It may have been eliminated from the latest AArch64 cores

The ARMv8 architecture fully supports the A32 and T32 instruction
sets.

Implementations of the architecture can choose not to implement
the A32 and T32 instruction sets. Some ARMv8 implementations
(e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM implementations of ARMv8 implemented the A32/T32 instruction
sets for EL0 (user-mode) only - I'm not aware of any that
supported A32 at privileged exception levels (EL1, EL2 or EL3).

W.r.t. Arm Inc. Cortex-A cores that's simply wrong.
All 64-bit Cortex-A cores from the very first two (A53 and A57, 2012)
and up to A75 support A32/T32 at all four exception levels. Of those,
A53 and A55 are still produced and used in huge quantities.
The first Cortex-A 64-bit core that supports aarch32 only at EL0 is
Cortex-A76 (2018).

Some of the more recent ARM neoverse cores support A32/T32 at EL0,
and some of them don't. Cavium's cores were 64-bit only.

good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in
microcontrollers

Indeed, and many ARMv8 SoCs include arm 32-bit M-series
microcontrollers on-chip.

Still, I don't think that the ratio of all ARM cores combined to cores
in smartphone application processors is really hundreds. I'd say,
something between 15 and 30.
By now, some of smartphone cores in current production, most notable
"LITTLE" Cortex-A510, still support T32/A32, but in few years
everything in this space will be aarch64-only.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Wed Feb 21 18:33:07 2024

On Wed, 21 Feb 2024 13:27:23 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 20/02/2024 20:18, Brian G. Lucas wrote:

On 2/20/24 10:25, David Brown wrote:

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need).� These adapt suitably
for different targets.� If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.

What I would like is a compiler flag that did "IFF when an int (or unsigned)
ends up in a register, promote it to the 'fast' type".� This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)

Well, that will happen to at least some extent (with an optimising compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int", precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.

In case of array indices I'd replace your "a bit more" by "a lot
more".
If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a
naivety.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed Feb 21 16:53:08 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Feb 2024 14:34:59 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

And support for 32-bit has /not/ been "eliminated from ARM cores".
It may have been eliminated from the latest AArch64 cores

The ARMv8 architecture fully supports the A32 and T32 instruction
sets.

Implementations of the architecture can choose not to implement
the A32 and T32 instruction sets. Some ARMv8 implementations
(e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM
implementations of ARMv8 implemented the A32/T32 instruction
sets for EL0 (user-mode) only - I'm not aware of any that
supported A32 at privileged exception levels (EL1, EL2 or EL3).

W.r.t. Arm Inc. Cortex-A cores that's simply wrong.
All 64-bit Cortex-A cores from the very first two (A53 and A57, 2012)
and up to A75 support A32/T32 at all four exception levels. Of those,
A53 and A55 are still produced and used in huge quantities.
The first Cortex-A 64-bit core that supports aarch32 only at EL0 is >Cortex-A76 (2018).

I'm looking at it from a server-grade core (neoverse) perspective. Even
with the Cortex-A cores, there wasn't much -demand- for
aarch32 support at higher levels (except for Rasp Pi, which
granted there are a considerable number of them).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Wed Feb 21 18:08:08 2024

On 21/02/2024 17:23, Michael S wrote:

On Wed, 21 Feb 2024 14:34:59 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in
microcontrollers

Indeed, and many ARMv8 SoCs include arm 32-bit M-series
microcontrollers on-chip.

Still, I don't think that the ratio of all ARM cores combined to cores
in smartphone application processors is really hundreds. I'd say,
something between 15 and 30.

32-bit ARM Cortex-M cores are /everywhere/. Your smart phone probably
has several of them for the cellular modem, the wireless interface, and
other devices. Your TV will have them, your microwave, your keyboard,
your games console controllers. But my guestimate is no more than a guestimate, and the ratio is perhaps less than it used to be as
smartphones and other such things get more and more cores.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Wed Feb 21 19:09:16 2024

On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 20/02/2024 19:42, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning,
along with JO and JNO for Jump on Overflow and Jump if Not
Overflow.

Many processors had something similar. But I think they fell out
of fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep
compatibility with 32-bit code.

aarch64 is a completely new and incompatible instruction encoding.

But I think it was fairly rare to use the "add 32-bit and update
flags" instruction in 32-bit RSIC systems (except for 64-bit
arithmetic), and much rarer to use the "add 64-bit and update flags"
version in 64-bit versions.

Of course, the main use of flags is conditional branching. That's true
even on 16-bit.
The 2nd common use is conditional move/selection.
Other uses are just bonus, insignificant in The Great Scheme of Things.
However I don't think that "bonus" use of flags for bignum and similar
is any rarer on 64-bit machines than on 32-bit.

as flag registers are a bottleneck for OOO and
superscaling

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).

I think it would mainly be /despite/ having a flag register, rather
than /because/ of it?

That's what Mitch thinks. But he has no proof.

Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.

On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
Reading flags can't really be rare, because conditional branches
are among most common instructions in the real-world code.

Implementing a flags register in an OoO microarchitecture does
require execution resources, however.

It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.

Do you think it is an advantage for a RISC architecture to have a
flags register compared to alternatives? Say we want to have a
double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
(I hope my pseudo-code is clear enough here.) With flags and an "add
with carry" instruction you could have :

carry = 0;
carry, res_lo = reg_a + reg_b + carry
carry, res_hi = 0 + 0 + carry

That's not how we do it.
We just use normal add for the first instruction.

Alternatively, you could have a double-register result, at the cost
of having more complex register banks :

res_hi:res_lo = reg_a + reg_b

Or you could have an "add and take the high word" instruction and use
two additions :

res_hi = (reg_a + reg_b) >> N
res_lo = reg_a + reg_b

This variant does not scale to longer additions. Producing carry, i.e. instruction having effectively two outputs is smaller part of advantage
of flags-based scheme. The bigger part is consuming carry, i.e. having effectively three inputs. In order to see it, you have to think about
triple -width (or wider) case.
Those are words in between the first and the last which are the most challenging for MIPS/Alpha/RISC-V and where they need 5 instruction vs 1 instruction on x86/Arm/SPARC.

Still, as I said above, that just a bonus rather than a major reason to
have flags.

overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.

That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

Fair enough, you do want carry (or an equivalent) for big number
work. But I would still contend that the vast majority of integers
and integer arithmetic used in code will fit within 32 bits, and the
vast majority of those that don't, will fit within 64 bits. Once you
go beyond that, you will need lots of bits (such as, as you say, cryptography).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Wed Feb 21 18:10:25 2024

On 21/02/2024 17:33, Michael S wrote:

On Wed, 21 Feb 2024 13:27:23 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 20/02/2024 20:18, Brian G. Lucas wrote:

On 2/20/24 10:25, David Brown wrote:

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need). These adapt suitably
for different targets. If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.

What I would like is a compiler flag that did "IFF when an int (or
unsigned)
ends up in a register, promote it to the 'fast' type". This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)

Well, that will happen to at least some extent (with an optimising
compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int",
precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.

In case of array indices I'd replace your "a bit more" by "a lot
more".

I haven't measured the real-world performance impact (I am more
interested in performance on microcontroller cores). So I'll believe
whatever you and the others here say on that!

If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a naivety.

Indeed.

Compilers do what you tell them. You just have to be accurate about
what you say.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed Feb 21 17:30:31 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:

On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
Reading flags can't really be rare, because conditional branches
are among most common instructions in the real-world code.

Code generators are pretty good at using the non-flag-setting
arithmetic instructions when the flags don't matter, and using
the flag-setting versions (e.g. ADD vs. ADDS) when needed for conditional branches or conditional moves.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 17:27:21 2024

David Brown <david.brown@hesbynett.no> writes:

On 21/02/2024 17:33, Michael S wrote:

On Wed, 21 Feb 2024 13:27:23 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 20/02/2024 20:18, Brian G. Lucas wrote:

On 2/20/24 10:25, David Brown wrote:

I would suggest C "fast" types like int_fast32_t (or other "fast"
sizes, picked to fit the range you need). These adapt suitably
for different targets. If you want to force the issue, then
"int64_t" is IMHO clearer than "long long int" and does not give a
strange impression where you are using a type aimed at pointer
arithmetic for general integer arithmetic.

What I would like is a compiler flag that did "IFF when an int (or
unsigned)
ends up in a register, promote it to the 'fast' type". This would
be great when compiling dusty C decks. (Was there ever C code on
punched cards?)

Well, that will happen to at least some extent (with an optimising
compiler), at least as long as the answer is the same in the end. It
can be done a bit more often with "int" rather than "unsigned int",
precisely because you promised the compiler that your arithmetic
won't overflow so it does not need to worry about that possibility.

In case of array indices I'd replace your "a bit more" by "a lot
more".

I haven't measured the real-world performance impact (I am more
interested in performance on microcontroller cores). So I'll believe >whatever you and the others here say on that!

If one wants top performance on 64-bit architectures then avoiding
'unsigned int' indices is a very good idea. Hoping that compiler will
somehow figure out what you meant instead of doing what you wrote is a
naivety.

Indeed.

I hope you're not agreeing that unsigned array indicies should be
avoided - they should instead be preferred.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Wed Feb 21 18:31:18 2024

Michael S wrote:

On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 20/02/2024 19:42, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning,
along with JO and JNO for Jump on Overflow and Jump if Not
Overflow.

Many processors had something similar. But I think they fell out
of fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep
compatibility with 32-bit code.

aarch64 is a completely new and incompatible instruction encoding.

But I think it was fairly rare to use the "add 32-bit and update
flags" instruction in 32-bit RSIC systems (except for 64-bit
arithmetic), and much rarer to use the "add 64-bit and update flags"
version in 64-bit versions.

Of course, the main use of flags is conditional branching. That's true
even on 16-bit.
The 2nd common use is conditional move/selection.
Other uses are just bonus, insignificant in The Great Scheme of Things. However I don't think that "bonus" use of flags for bignum and similar
is any rarer on 64-bit machines than on 32-bit.

as flag registers are a bottleneck for OOO and
superscaling

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).

I think it would mainly be /despite/ having a flag register, rather
than /because/ of it?

That's what Mitch thinks. But he has no proof.

Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.

On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.
Reading flags can't really be rare, because conditional branches
are among most common instructions in the real-world code.

Implementing a flags register in an OoO microarchitecture does
require execution resources, however.

It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.

Do you think it is an advantage for a RISC architecture to have a
flags register compared to alternatives? Say we want to have a
double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
(I hope my pseudo-code is clear enough here.) With flags and an "add
with carry" instruction you could have :

carry = 0;
carry, res_lo = reg_a + reg_b + carry
carry, res_hi = 0 + 0 + carry

That's not how we do it.
We just use normal add for the first instruction.

Alternatively, you could have a double-register result, at the cost
of having more complex register banks :

res_hi:res_lo = reg_a + reg_b

Or you could have an "add and take the high word" instruction and use
two additions :

res_hi = (reg_a + reg_b) >> N
res_lo = reg_a + reg_b

This variant does not scale to longer additions. Producing carry, i.e. instruction having effectively two outputs is smaller part of advantage
of flags-based scheme. The bigger part is consuming carry, i.e. having effectively three inputs. In order to see it, you have to think about
triple -width (or wider) case.
Those are words in between the first and the last which are the most challenging for MIPS/Alpha/RISC-V and where they need 5 instruction vs 1 instruction on x86/Arm/SPARC.

I agree with this. Afair, on the Itanium you had two separate ADD
opcodes, ADD0 and ADD1, where the first (which was aliased to regular
ADD?) just added the two inputs, while the second addeded the inputs,
plus one, i.e. to be used only when the previous round generated an
outgoing carry.

Afair, the resulting code would also use predicated operations,
effectively doing both ADD0 and ADD1 at the same time, and letting the
previous carry out select between them.

s0 = add0(a[0],b[0]);
sum[0] = s0;
p0 = s0 < a[0];
s1 = p0 ? add1(a[1],b[1]) : add0(a[1],b[1])
sum[1] = s1;
p1 = p0 ? s1 <= a[1] : s1 < a[1];

That final line shows how the intermediate words cause extra
complications: You have to use two different comparison operations to
generate the carry predicate for the next stage, so the
predicate-generating instructions must themselves be predicated.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Thu Feb 22 17:08:43 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

The next question would be how to do the same for multiplication....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Fri Feb 23 18:34:08 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float). float
is cerainly faster than double.

It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.

Cray got away with 64-bit REAL and 128-bit DOUBLE PRECISION because
they were the fastest anyway, but anybody else making that choice
would have been laughed right out of the market.

So, backward compatibility, your favorite topic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Feb 24 10:21:00 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float).

I assume that you mean that people wanted sizeof(int)==sizeof(float).
Why would they want that? That certainly did not hold on the PDP-11
and many other 16-bit systems where sizeof(int)==2 and
sizeof(float)==4.

float
is cerainly faster than double.

On the 21064 or MIPS R4000 (the first 64-bit systems after those by
Cray)? I am pretty sure that FP addition, subtraction and
multiplication have the same speed on these CPUs in binary32 and
binary64.

It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.

C compilers choose their types according to their rules, and Fortran
chooses its types according to its rules. I don't see what C's int
type has to do with Fortran's INTEGER type. And if the rules you
specify above mean that Fortran on the PDP-11 has a 4-byte INTEGER
type, there is already precedent for int having a different size from
INTEGER.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Sat Feb 24 11:39:48 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Feb 2024 13:49:41 +0100
David Brown <david.brown@hesbynett.no> wrote:

[...]

However I don't think that "bonus" use of flags for bignum and similar
is any rarer on 64-bit machines than on 32-bit.

I think Bignums are more common on 64-bit machines. The popularity of languages with Bignums is now higher than it was before
general-purpose computing switched from 32-bit to 64-bit 10-30 years
ago. Also, these languages are more popular on general-purpose
computers (64 bits since at least two years) than on the embedded
computers where 32-bit processors still prevail.

I also think that general-purpose computers do more cryptography (and multi-precision arithmetic for that) than embedded computers.

Sometimes having flags for overflows, carries, etc., can be very
handy. So having it in the ISA is useful. But I think you would
normally want your code to avoid setting or reading flags.

On OoO, when you are setting flags almost all the time, you are
effectively telling an engine that flags results of the previous
arithmetic instructions are DNC. In theory, it can be used to avoid
majority of updates of flags in PRF. I don't know whether such
optimization is actually done in real HW.

In real hardware, AMD64 CPUs have as many physical flag registers as
physical integer registers (e.g., 280 on Golden Cove); maybe they are
actually part of the same register, but they would still need separate
tracking hardware (one for C, one for O, one fore NZP), so there is no particular reason to have them be part of the same register.

On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 <https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, but I guess that this is due to being able to suppress flags updates
rather than ignoring those that are overwritten without being read.

That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?

It would, I think, be particularly cumbersome to track several
parallel actions that all act on the flag register, as it is
logically a shared resource.

That's no problem for modern register renaming units. They rename the
single flags register just as readily as an integer register, with the
same logical register getting multiple physical register numbers per
cycle.

[multi-word addition/subtraction]

Still, as I said above, that just a bonus rather than a major reason to
have flags.

The 88000 and Power architectures have one mechanism for producing
comparison results, and a different mechanism for add-with-carry (with
carry-in and carry-out optional in both architectures). This shows
that multi-word addition and subtraction is not just a bonus, but a
major reason for an architectural mechanism.

However, I think that a separate flags register has its disadvantages
(e.g., the need for separate tracking resources), and that adding
carry and overflow to general-purpose registers is preferable.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sat Feb 24 15:13:42 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Michael S <already5chosen@yahoo.com> writes:

On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 ><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >but I guess that this is due to being able to suppress flags updates
rather than ignoring those that are overwritten without being read.

That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?

A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Anton Ertl on Sat Feb 24 17:51:56 2024

On 2024-02-24 13:39, Anton Ertl wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Feb 2024 13:49:41 +0100 David Brown
<david.brown@hesbynett.no> wrote:

[...]

However I don't think that "bonus" use of flags for bignum and
similar is any rarer on 64-bit machines than on 32-bit.

I think Bignums are more common on 64-bit machines.

[...]

I also think that general-purpose computers do more cryptography
(and multi-precision arithmetic for that) than embedded computers.

Hm. That may depend on whether you are comnparing absolute numbers or proportions of work-loads. Cryptography is very important for many
embedded systems (telecom, automatic teller machines, point-of-sale
terminals, etc.) Last I looked (several years ago), there were several 8051-based chips (small 8-bit processors) on the market with dedicated
on-chip HW accelerators for cryptography.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 17:21:25 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Michael S <already5chosen@yahoo.com> writes:

On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >>but I guess that this is due to being able to suppress flags updates
rather than ignoring those that are overwritten without being read.

That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?

A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.

What is T16? Google does not give me anything appropriate for "ARM
T16"? Do you mean the 16-bit-encoded instructions in T32?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Niklas Holsti on Sat Feb 24 17:00:48 2024

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

I also think that general-purpose computers do more cryptography
(and multi-precision arithmetic for that) than embedded computers.

Hm. That may depend on whether you are comnparing absolute numbers or >proportions of work-loads.

I am thinking about the proportions of workloads.

Cryptography is very important for many
embedded systems (telecom, automatic teller machines, point-of-sale >terminals, etc.) Last I looked (several years ago), there were several >8051-based chips (small 8-bit processors) on the market with dedicated >on-chip HW accelerators for cryptography.

And AMD64 and ARM A64 have acceleration instructions for symmetric
cryptographe (such as AES), too. Symmetric cryptography (with secret
keys only) does not use multi-precision arithmetic, but asymetric
cryptography (with public and private keys) does. I guess that there
is no special hardware for that because hardware that does it faster
than software would be very expensive, and because the asymmetric part
is only used at the start of a connection and when renewing the secret
key for the symmetric stuff.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 17:26:41 2024

scott@slp53.sl.home (Scott Lurndal) writes:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

And support for 32-bit has /not/ been "eliminated from ARM cores". It
may have been eliminated from the latest AArch64 cores

The ARMv8 architecture fully supports the A32 and T32 instruction sets.

There is no ARMv8 architecture. ARMv8-M does not support A64.
ARMv8-A and ARMv8-R do. They may also support A32 and T32, but then
the A cores released since 2021 (X2 and following, A710 and following,
A510 and following) are ARMv9-A, not ARMv8-A; and several of them do
not support A32 nor T32.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sat Feb 24 20:22:16 2024

On Sat, 24 Feb 2024 17:21:25 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Michael S <already5chosen@yahoo.com> writes:

On ARM cores the number of physical flags registers is roughly 1/3
of the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
but I guess that this is due to being able to suppress flags updates >>rather than ignoring those that are overwritten without being read.

That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?

A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.

What is T16? Google does not give me anything appropriate for "ARM
T16"? Do you mean the 16-bit-encoded instructions in T32?

- anton

Scott probably uses a name T16 for pre-ARMv7 variant of Thumb.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 10:38:30 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Terje Mathisen wrote:

If we have a platform where the default integer size is 32 bits
and long is 64 bits, then afaik the C promotion rules will use
int as the accumulator size, right?

Not necessarily:: accumulation rules allow the promotion of
int->long inside a loop

Yes.

if the long is smashed back to int immediately after the loop
terminates.

Doing this may be a good idea, but the C standard doesn't require
it. Once the horse is out of the barn, as far as the standard is
concerned there nothing that forces an implementation to evidence
any concern about getting the horse back inside.

In particular, an implementation may continue holding a 32-bit int
in a 64-bit memory for the rest of the program's execution, making
use of all 64 bits in any subsequent operations, and not run afoul
of what the standard requires.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 10:25:50 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

David Brown wrote:

On 17/02/2024 19:58, Terje Mathisen wrote:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

On the third (i.e gripping) hand you could have a language like
Java where it would be illegal to transform a temporarily
trapping loop into one that would not trap and give the
mathematically correct answer.

What "temporarily trapping loop" and "mathematically correct
answer"?

If you are talking about integer arithmetic, the limited integers
in Java have modulo semantics, i.e., they don't trap, and
BigIntegers certainly don't trap.

If you are talking about FP (like I did), by default FP addition
does not trap in Java, and any mention of "mathematically
correct" in connection with FP needs a lot of further
elaboration.

Sorry to be unclear:

I haven't really been following this thread, but there's a few
things here that stand out to me - at least as long as we are
talking about C.

I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

If we have a platform where the default integer size is 32 bits
and long is 64 bits, then afaik the C promotion rules will use int
as the accumulator size, right?

For a signed type T, a C implementation is free to hold values of
type T in any memory cell that is at least as wide as T, and in
particular can hold a 32-bit int in a 64-bit register, if it so
chooses. More specifically, an implementation may translate a
statement like (assume x and y are 32-bit ints)

x = x + y;

using a 64-bit add, and store the full 64-bit result into a
64-bit register that corresponds to x. Note that what causes the
problem is not the store of a large result but the addition that
overflows the nominal 32-bit type. As long as the results of the
addition are in range, everything is fine; but once any of the
additions overflows the nominal 32-bit range, all bets are off.

What I was trying to illustrate was the principle that by having a
wider accumulator you could aggregate a series of numbers, both
positive and negative, and get the correct (in-range) result, even
if the input happened to be arranged in such a way that it would
temporarily overflow the target int type.

I think it is much better to do it this way and then get a
conversion size trap at the very end when/if the final sum is in
fact too large for the result type.

Absolutely, and for the accumulator choose the widest signed type
available. At the tail end of the loop, the final value can (and
should?) be ranged checked, because storing an out-of-range value
into a smaller type is a "safe" operation in that what happens is implementation-defined behavior, so it's very likely nothing very
bad will happen (at least not right away ;).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sat Feb 24 17:37:26 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 19:42, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 16:17, Terje Mathisen wrote:

x86 has had an 'O' (Overflow) flags bit since the very beginning, along >>>> with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,

No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.

I was thinking more in terms of /using/ these flags, rather than ISA
support for them. ISAs would clearly have to keep the flag registers
and the instructions that used them if they wanted to keep compatibility
with 32-bit code.

64-bit architectures of this century have separate 64-bit instruction
sets on which you cannot run 32-bit code. The idea of having a 32-bit instruction set that's a subset of the 64-bit instruction set, and
supporting both 32-bit programs and 64-bit programs without mode
switch seems to have gone out of fashion; it used to be in fashion in
the 1990s when MIPS, SPARC, and PowerPC introduced 64-bit extensions
by just adding instructions to their 32-bit instruction set (and
defining what the 32-bit instructions do with the additional bits in
the GPRs). So if AMD and ARM had thought that they would be better
off in the long term without flags, they could have designed AMD64 and
A64 to work without flags.

But I think it was fairly rare to use the "add 32-bit and update flags" >instruction in 32-bit RSIC systems (except for 64-bit arithmetic), and
much rarer to use the "add 64-bit and update flags" version in 64-bit >versions.

On ARM A64's glibc-2.31:

[a76:/usr/lib/aarch64-linux-gnu:99372] objdump -d libc-2.31.so |grep '\<add\>'|wc -l
19330
[a76:/usr/lib/aarch64-linux-gnu:99373] objdump -d libc-2.31.so |grep '\<adds\>'|wc -l
277

However, glibc is written in C. If you look at code for a language
with Bignums, I expect that things look quite differently.

No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).

I think it would mainly be /despite/ having a flag register, rather than >/because/ of it?

There is no evidence for "despite".

Sometimes having flags for overflows, carries, etc., can be very handy.
So having it in the ISA is useful. But I think you would normally want
your code to avoid setting or reading flags.

If each instruction can specify whether it sets flags or not, that
costs one bit per instruction, which is also a cost.

AMD64 does not spend that bit, and I see no evidence that its
performance suffers from setting flags too often.

It would, I think, be particularly cumbersome to track several parallel >actions that all act on the flag register, as it is logically a shared >resource.

Not particularly. The renamer ensures that the instructions write to
different physical flags registers, and that instructions that read
flags read the right physical register.

Do you think it is an advantage for a RISC architecture to have a flags >register compared to alternatives?

I think my alternative of putting carry and overflow in the result
register has certain advantages.

If you compare a RISC with a flags register (such as A64) to RV64GC,
certainly the add-with-carry-in-carry-out is 1 instruction (with
typically 1 cycle latency) in A64 and 5 (with typically 3 cycles
latency) in RV64GC; and the overflow check in the fast path for a
Bignum addition is 2 instructions longer than on A64.

That's nonsense. People use carry for implementing multi-precision
arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

Fair enough, you do want carry (or an equivalent) for big number work.
But I would still contend that the vast majority of integers and integer >arithmetic used in code will fit within 32 bits, and the vast majority
of those that don't, will fit within 64 bits.

Sure. But with Bignums, you have to check for overflow after every
addition, subtraction, or multiplication. Actually, you need a check
even with division: The result of dividing the smallest small Bignum
by -1 does not fit in a small Bignum.

Once you go beyond that,
you will need lots of bits (such as, as you say, cryptography).

Yes, so you need multi-precision arithmetic on 64-bit systems just as
on 32-bit systems.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 22:29:01 2024

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

...

Sorry, I meant ptrdiff_t, which was used for pointer math.

I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 14:43:46 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

Missing my point:: which was::

The summation loop will not overflow, and overflow is detected at
the smash from int to int8_t.

So you wanted to make a point that's completely unrelated to what I
was saying?

In any case the conversion from type int to type int8_t that is
done after loop is finished is not going to detect any overflow,
regardless of whether it is done by an explicit cast or implicitly
by the return statement. Converting an out-of-range value to a
signed integer type is merely implementation-defined behavior. As
of C99 the C standard allows a signal to be raised in such cases,
but TTBOMK no implementations actually do that (and AFAICT neither
clang nor gcc even have an option to do so).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sat Feb 24 20:57:17 2024

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 18:47, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 20/02/2024 13:00, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

Another possible reason is that it is very useful to have integer types
with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
types, so if "int" is 64-bit, you have "short" as either 16-bit and have
no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With >32-bit "int", it's easy to have each size without having to add extended >integer types or add new standard integer types (like "short short int"
for 16-bit and "short int" for 32-bit).

short had been 16-bit on 16-bit machines and on 32-bit machines, so
the right choice for 64-bit machines is to make it 16-bit, too. As
for a 32-bit type, that would then obviously be "long short". The C
compiler people had no qualms at adding "long long" when they wanted
something bigger then 32 bits on 32-bit systems, so what should keep
them from adding "long short"?

I saw benchmarks showing x32 being measurably faster,

Sure, it's measurably faster. That's obviously not sufficient for
incurring the cost of x32.

but it's not
unlikely that the differences got less with more modern x86-64
processors (with bigger caches)

Doubtful. The L1 caches have not become bigger since the days of the
K7 (1999) with its 64KB D and I caches (and the K7 actually did not
support AMD64). There has been some growth in L2+L3 combined in
recent years, but x32 already flopped earlier.

and it's simply not worth the effort
having another set of libraries and compiler targets just to make some
kinds of code marginally faster.

Exactly.

And support for 32-bit has /not/ been "eliminated from ARM cores".

Of course it has. E.g., the Cortex-X1 supports A32/T32, and its
descendants Cortex-X2, X3, X4 don't. The Cortex-A710 supports
A32/T32, it's successors A715 and A720 do not. Cortex-A510 supports
A32/T32, A520 doesn't.

It
may have been eliminated from the latest AArch64 cores - I don't keep
good track of these. But for every such core sold, there will be
hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
and embedded systems.

Irrelevant for the question at hand: Are the performance benefits of
32-bit applications sufficient to pay for the cost of maintaining a
32-bit software infrastructure on an otherwise 64-bit system? The
answer is no.

I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>> picked to fit the range you need).

Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.

You need to pick the appropriate size for your data, as I said.

In general-purpose computing, you usually don't know that size. E.g.,
for a sort routine, do you use int_fast32_t, int_fast8_t,
int_fast16_t, or int_fast64_t for the array size?

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.

I like my types to say what they mean. "uintptr_t" says "this object
holds addresses for converting pointer values back and forth with an
integer type".

Exactly. It's the unnamed type of BCPL and B, and the int of Unix C
before the I32LP64 mistake.

Nonetheless, it is good design to use appropriate type names for
appropriate usage. This makes the code clearer, and increases portability.

On the contrary, in the Gforth project we have had many more
portability problems with C code with its integer type zoo than in the
Forth code which just has single-cell (a machine word), double cell,
and char as integer types. Likewise, Forth code from others tends to
be pretty portable between 32-bit and 64-bit systems, even if the code
has only been tested on one kind of system.

As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.

And that is why it makes sense to use the "fast" types. If you need a
16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems, >32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
always supporting the range you need, as fast as possible.

That makes no sense. E.g., an in-memory sorting routine might well
have to sort 100G elements or more on a suitably large (64-bit)
machine. So according you one should use int_fast64_t. But that
would be slow and unnecessarily large on a 32-bit system where you
cannot hold and sort that many items anyway.

Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.

Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Sorry, that comment went over my head - I don't know what
"demonstration" you are referring to.

I wrote in <2024Feb20.091522@mips.complang.tuwien.ac.at>:
|Those ideas are that integer overflows do not happen and that a
|competent programmer proactively prevents them from happening, by
|sizing the types accordingly, and checking the inputs.

Scott Lurndal replied <SW2BN.153110$taff.74839@fx41.iad>:
|Can't say that I've known a programmer who thought that way.

For comparison to C, look at the Zig language. (It's not a language I
have used or know in detail, but I know a little about it.) Unsigned
integer arithmetic overflow is UB in Zig, just like signed integer
arithmetic overflow. There are standard options and block settings
(roughly equivalent to pragmas) to control whether these give a run-time >error, or are assumed never to happen (for optimisation purposes). And
if you want to add "x" and "y" with wrapping, you use "x +% y" as an
explicit choice of operator.

That seems to me to be the right approach for an efficient high-level >language.

I don't know about Zig, but for a language like C, I would prefer
types that ask for trap-on-overflow arithmetic, or for modulo
arithmetic instead of introducing an additional operator.

The undefined option is just a bad idea: Wang et al <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf> found
that it gave a measurable speedup over -fwrapv in only one loop in
SPECint2006, and that speedup is only due to the I32LP64 mistake.
I.e., a competently designed dialect would not have that mistake, and
there would be no measurable speedup from the undefined option, just
the possibility of the compiler doing something unexpected.

If you want "int z = x + y;" with wrapping, write :

int z = (unsigned) x + (unsigned) y;

I'll leave that to you, not just for the following reason:

Once upon a time someone like you suggested using some casting
approach for getting x-1>=x (with signed x) to work as intended. I
tried it, and the result was that gcc-3.something still "optimized" it
to false.

And even then, I always put it in a
pragma so that the code works even if someone uses different compiler flags.

Yes, announcing such things in the source code is a good idea in
principle. In practice newer versions of gcc tend to need more sanity
flags (the default of newer gccs is more insanity), and the older
versions do not understand the new flags and fail if you pass them.
So we check every sanity flag in autoconf and use those that the gcc
version used accepts. Doing that through pragmas filled in by
configure does not seem to be any better than using flags through the
Makefile.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sat Feb 24 22:33:58 2024

John Levine <johnl@taugh.com> writes:

There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.

That is sufficient for double-word arithmetic, but not for multi-word arithmetic. ESA/390 adds addition-with-carry-in.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 18:19:43 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.

Oh. I didn't know that about Rust. Interesting.

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I
understand that it has the support of community sentiment, but
even so it seems like a poor choice here. I'm not a big fan of
the new attribute syntax, but a form like

return [[narrow]] s;

looks to be a better way of asking Rust to allow what is a
normally disallowed conversion. By contrast, using a cast is
overkill. There is unnecessary redundancy, by specifying a type
in two places, and the risk that they might get out of sync. And
on general principles requiring a cast violates good security
principles. If someone needs access to a particular room in a
building, we don't hand over a master key that opens every room
in the building. If someone needs to read some documents that
have classified materials, we don't give them an access code that
lets them read any sensitive material regardless of whether it's
relevant. Maybe Rust is different, but in C a cast allows any
conversion that is possible in the language, even the unsafe
ones. It just seems wrong to use the nuclear option of casting
for every minor infringement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Feb 25 02:33:32 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Michael S <already5chosen@yahoo.com> writes:

On ARM cores the number of physical flags registers is roughly 1/3 of
the nymber of physical integer registers (46 vs. 147 on A710, 39
vs. 120 on Neoverse N1/A76 >>><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
but I guess that this is due to being able to suppress flags updates >>>rather than ignoring those that are overwritten without being read.

That assumes that all of A32, T32, and A64 have the ability to
suppress flags updates. Do they?

A32, T32 and A64 have an bit in the instruction word that
specifies whether the flags should be updated. T16
only updates flags when the instruction is in an if-then (IT)
block.

What is T16? Google does not give me anything appropriate for "ARM
T16"? Do you mean the 16-bit-encoded instructions in T32?

The 16-bit subset of T32, yes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Feb 25 02:42:43 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

...

Sorry, I meant ptrdiff_t, which was used for pointer math.

I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.

Unixware, early 90's:

$ find . -name '*.[ch]' -print | xargs grep size_t | wc -l
3435
$ find . -name '*.[ch]' -print | xargs grep ptrdiff_t | wc -l
86

memcpy was defined using size_t in the 1989 SVID Third Edition, Volume 1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Anton Ertl on Sun Feb 25 02:56:20 2024

It appears that Anton Ertl <anton@mips.complang.tuwien.ac.at> said:

John Levine <johnl@taugh.com> writes:

There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.

That is sufficient for double-word arithmetic, but not for multi-word >arithmetic. ESA/390 adds addition-with-carry-in.

You could do it but you're right, it was a little trickier than that since
you had to check for the carry both on the addition and on the optional
second one. The add with carry instructions made it a lot simpler.

According to the IBM manuals, add with carry was added in z/Series but
you could use it in S/390 mode on a z machine, so I guess someone
really wanted it for some existing code.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Opus on Sat Feb 24 20:58:28 2024

Opus <ifonly@youknew.org> writes:

On 18/02/2024 05:01, Tim Rentsch wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

Of course the conversion will be done implicitly. C converts almost
anything implicitly. Not that this is its greatest feature.

The explicit cast is still useful: 1/ to express intent (it shows that
the potential loss of data is intentional) and then 2/ to avoid
compiler warnings (if you enable -Wconversion, which I usually
recommend) or warning from any serious static analyzer too (which I
highly recommend using too).

Using a cast is a poor way to express intent:

return s; // narrow accumulated value on return

is better.

Using a cast to prevent a compiler warning is an awful
convention. See also my reply to Terje's message about
Rust. Compiler writers should be ashamed for promulgating
such a moronic idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Sat Feb 24 21:05:49 2024

scott@slp53.sl.home (Scott Lurndal) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

But the return statement is where overflow (if any) is detected.

The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.

I suspect most experienced C programs know that.

I expect most C programmers, even most experienced C programmers,
do not know the rules for conversions of return values. How many
readers here in comp.arch do you think know those rules? How
confident are you that you can state them, without having to look
them up? I know I was surprised when I first discovered that
return statements do not convert values the way I expected.

Yet, the 'superfluous' cast is also documentation that the
programmer _intended_ that the return value would be narrowed.

Using a cast is a poor way to express such an intent, as I have
explained in other postings.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sat Feb 24 21:17:13 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

...

Sorry, I meant ptrdiff_t, which was used for pointer math.

I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.

In 1995 a problem with both size_t and ptrdiff_t is that there
were no corresponding length modifiers for those types in
printf() format conversions (corrected in C99).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Feb 24 22:05:35 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

Rather that do that I will explain.

An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.

Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

#if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
// this case is the one outlined above

#elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

Does that all make sense?

The next question would be how to do the same for multiplication....

Multiplication is a whole other ball game. First we need to
consider only the widest types, because narrower types can be
carried out in a wider type and the resulting product value
checked. Off the top of my head, for the widest types I would
try converting to float or double, do a floating-point multiply,
and do some trivial accepts and trivial rejects based on the
exponent of the result. Any remaining cases would need more
care, but probably (we hope!) there aren't many of those and they
don't happen very often. So for what it's worth there is my
first idea. Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
I better stop now or the ideas will probably get worse rather
than better. :/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Feb 25 08:40:32 2024

John Levine <johnl@taugh.com> writes:

According to the IBM manuals, add with carry was added in z/Series but
you could use it in S/390 mode on a z machine, so I guess someone
really wanted it for some existing code.

Interestingly,
<https://en.wikibooks.org/wiki/360_Assembly/360_Instructions> lists
ALC, ALCR, SLB and SLBR as belonging to the 390 instructions, and only
ALCG, ALCGR, SLBG and SLBGR (the 64-bit variants) as Z instructions.

If they added ALC, ALCR, SLB and SLBR only in Z (but in the S/390
mode), that is counterevidence for the claim that add-with-carry is
less important for 64-bit systems than for 32-bit systems.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Sun Feb 25 08:03:12 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.

What "standard multiple-precision arithmetic" is there in C? I am not
aware of any.

If you have widening multiplication in the language, things are
trivial. I'll use Forth because it has widening multiplication:

For the unsigned case:

( u1 u2 ) um* if ... ( handle the overflow case ) ... then

For the signed case:

( n1 n2 ) m* >r s>d r> <> if ... ( handle the overflow case ) ... then

D ( n -- d ) sign-extends a single-cell to a double-cell.

For those of you who are Forth-illiterate, here's how it might look in
C, with (U)Word and (U)Doubleword integer types and some assumptions
about things that the standardized subset of C does not define:

For the unsigned case:

UDoubleword ud = u1 * (UDoubleword)u2;
if ((ud>>(8*sizeof(UWord))) != 0) {
... /* handle the overflow case */ ...
}

For the signed case:

Doubleword d = n1 * (Doubleword)n2;
Word dlo = d; /* dlo contains the low-order bits of d */
if (d != (Doubleword)dlo) {
... /* handle the overflow case */ ...
}

I'll leave it to the advocates of the standardized subset of C to
rewrite this in a way that can be executed in a strictly conforming
program.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun Feb 25 10:40:35 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

A guess: people did not want sizeof(float) != sizeof(float).

I assume that you mean that people wanted sizeof(int)==sizeof(float).
Why would they want that? That certainly did not hold on the PDP-11
and many other 16-bit systems where sizeof(int)==2 and
sizeof(float)==4.

float
is cerainly faster than double.

On the 21064 or MIPS R4000 (the first 64-bit systems after those by
Cray)? I am pretty sure that FP addition, subtraction and
multiplication have the same speed on these CPUs in binary32 and
binary64.

Cache size and memory bandwidth also play a role...

When you're doing huge vector-matrix multiplications to solve
large sets of equations, memory bandwidth is usually the bottleneck.

If you can get away with 32-bit reals, you do it - it is a factor of
two, after all.

These days, people are actually trying to do preconditioning with 16-bit
floats go gain another factor of two.

And nowadays, with SIMD, the advantage of shorter data types is even
more pronounced.

It would also broken Fortran, where storage aasociation rules mean
that both REAL and INTEGER have to have the same size, and DOUBLE
PRECISION twice that. Breaking that would have invalidated just
about every large scientific program at the time.

C compilers choose their types according to their rules, and Fortran
chooses its types according to its rules. I don't see what C's int
type has to do with Fortran's INTEGER type.

C does not exist in a vacuum, especially if it is the systems
programming language for a system that Fortran is supposed to run
on, and run on well.

Two examples: not being able to call BLAS subroutines from C
would have made scientific C people unhappy, and not being able
to call C functions via the de-facto stanard established by Bell's
Fortran 77 compiler and later f2c would have made a lot of Fortran
people unhappy.

And if the rules you
specify above mean that Fortran on the PDP-11 has a 4-byte INTEGER
type, there is already precedent for int having a different size from INTEGER.

And that was suboptimal, but it did not make Fortran unusable by
requiring a 128-bit DOUBLE PRECISION like your suggestion would.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Feb 25 19:05:39 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).

...

Sorry, I meant ptrdiff_t, which was used for pointer math.

I have seen little code that uses ptrdiff_t; quite a bit that used
size_t (the unsigned brother of ptrdiff_t). But my memory tells me
that even size_t was not very widespread in 1995.

In 1995 a problem with both size_t and ptrdiff_t is that there

Calling it a "problem" is overstating the case. It was
straightforward enough, if not completely portable to
use the appropriate number of 'l' modifiers.

were no corresponding length modifiers for those types in
printf() format conversions (corrected in C99).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Feb 25 19:18:13 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

John Levine <johnl@taugh.com> writes:

According to the IBM manuals, add with carry was added in z/Series but
you could use it in S/390 mode on a z machine, so I guess someone
really wanted it for some existing code.

Interestingly,
<https://en.wikibooks.org/wiki/360_Assembly/360_Instructions> lists
ALC, ALCR, SLB and SLBR as belonging to the 390 instructions, and only
ALCG, ALCGR, SLBG and SLBGR (the 64-bit variants) as Z instructions.

These details are better found from the original references. Here's
the S/390 reference: https://publibfp.dhe.ibm.com/epubs/pdf/dz9ar008.pdf

And here's a link to the zSeries reference: https://www.ibm.com/support/pages/zarchitecture-principles-operation
You need an IBM account to download it, but signing up is easy.

What Wikipedia says is right, with the detail that the 390
instructions only exist in 390 mode on a z. Those "G" instructions
have 64 bit operands and make no sense in 390 mode.

If they added ALC, ALCR, SLB and SLBR only in Z (but in the S/390
mode), that is counterevidence for the claim that add-with-carry is
less important for 64-bit systems than for 32-bit systems.

Not necessarily. From S/360->370->390 most of the new instructions
were to deal with the address expansion kludges, the I/O system, and
IEEE floating point, with only a handful of general instructions like
CHECKSUM to speed up TCP/IP. In z/Series they made a complete u-turn
and added gazillions of new instructions. Some were to add 64 bit
versions of 32 bit instructions, some to fill well known gaps like
relative branches and longer offsets in memory references, but there
is a whole lot of stuff that seems to make some customer's workload a
little faster, such as a gzip accelerator and the fixed point decimal
vector facility.

z/Series uses what they call Millicode, microcode that uses the
hardware implemented part of the same instruction set, so the cost of
adding lots of new instructions is now low. They only backported a few instructions into S/390 which suggests someone really wanted the carry
stuff.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 27 11:01:02 2024

Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

Rather that do that I will explain.

An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.

Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

#if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
// this case is the one outlined above

#elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

Does that all make sense?

The next question would be how to do the same for multiplication....

Multiplication is a whole other ball game. First we need to
consider only the widest types, because narrower types can be
carried out in a wider type and the resulting product value
checked. Off the top of my head, for the widest types I would
try converting to float or double, do a floating-point multiply,
and do some trivial accepts and trivial rejects based on the
exponent of the result. Any remaining cases would need more
care, but probably (we hope!) there aren't many of those and they
don't happen very often. So for what it's worth there is my
first idea. Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
I better stop now or the ideas will probably get worse rather
than better. :/

Your floating point method is pretty bad, imho, since it can give you
both false negatives and false positives, with no way to know for sure,
except doing it all over again.

If I really had to write a 64x64->128 MUL, with no widening MUL or MULH
which returns the high half, then I would punt and do it using 32-bit
parts (all variables are u64):

p3 = (a >> 32) * (b >> 32);
p2 = (a & 0xffffffff) * (b >> 32);
p1 = (a >> 32) * (b & 0xffffffff);
p0 = (a & 0xffffffff) * (b & 0xffffffff);

// Middle sum, can give a 1 carry into high half
p12 = (p0 >> 32) + (p1 & 0xffffffff) + (p2 & 0xffffffff);

prod = (p0 & 0xffffffff) + (p12 << 32); // Final low word result

prod_hi = p3 + (p2 >> 32) + (p1 >> 32) + (p01 >> 32);

if (prod_hi != 0) overflow();

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 27 10:35:11 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.

Oh. I didn't know that about Rust. Interesting.

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target container.

This is of course the total opposite of the C "nuclear option", and much
more like other languages that try to be secure by default.

understand that it has the support of community sentiment, but
even so it seems like a poor choice here. I'm not a big fan of
the new attribute syntax, but a form like

return [[narrow]] s;

looks to be a better way of asking Rust to allow what is a
normally disallowed conversion. By contrast, using a cast is
overkill. There is unnecessary redundancy, by specifying a type
in two places, and the risk that they might get out of sync. And
on general principles requiring a cast violates good security
principles. If someone needs access to a particular room in a
building, we don't hand over a master key that opens every room
in the building. If someone needs to read some documents that
have classified materials, we don't give them an access code that
lets them read any sensitive material regardless of whether it's
relevant. Maybe Rust is different, but in C a cast allows any
conversion that is possible in the language, even the unsafe
ones. It just seems wrong to use the nuclear option of casting
for every minor infringement.

I agree, it Rust did it like C, then it would be very unsafe indeed.

I have not checked the generated asm, but I believe that if I write code
like this:

// x:u64
x = 0x1234567890abcdef;
let y:u8 = (x & 255) as u8;

the compiler will see the mask and realize that the conversion is safe,
so no need to interpose a

cmp x,256
jae trp_conversion

idiom.

OTOH, I have seen C compilers that insist on such a test at the end of a
fully saturated switch statement, even when the mask in front should
prove that no other values are possible.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Terje Mathisen on Tue Feb 27 13:50:43 2024

Terje Mathisen wrote:

Tim Rentsch wrote:

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target container.

This is of course the total opposite of the C "nuclear option", and much more like other languages that try to be secure by default.

Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:

Rust does have conversion operators/functions for downsizing variables,
and they come with full validity checking, but using "s as u8" as I
suggested will generate exactly the same code as a C "(uint8_t) s"
idiom, i.e. no verification and no safety checks.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Feb 27 15:07:22 2024

On Tue, 27 Feb 2024 13:50:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Terje Mathisen wrote:

Tim Rentsch wrote:

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules.� I

I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target container.

This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.

Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:

Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.

Terje

Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
less striking visual hint of doing something unusual and potentially
dangerous.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Feb 28 01:28:36 2024

Michael S wrote:

On Tue, 27 Feb 2024 13:50:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Terje Mathisen wrote:

Tim Rentsch wrote:

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target
container.

This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.

Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:

Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.

Terje

Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
less striking visual hint of doing something unusual and potentially dangerous.

No more dangerous than::

if( c >= 'A' and c <= 'Z' ) c -= 'A'-'a';

or

if( table[c] & CAPS ) c -='A'-'a';

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 28 15:01:19 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Michael S wrote:

On Tue, 27 Feb 2024 13:50:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Terje Mathisen wrote:

Tim Rentsch wrote:

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all
such casts are in fact legal, i.e. the data will fit in the target
container.

This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.

Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:

Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.

Terje

Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
less striking visual hint of doing something unusual and potentially
dangerous.

No more dangerous than::

if( c >= 'A' and c <= 'Z' ) c -= 'A'-'a';

If your character set is EBCDIC, this isn't perfect, but since
the gaps are generally unassigned, may not cause problems in
practice.

For EBCDIC, it was sufficent to clear or set bit<6> to change case.

or

if( table[c] & CAPS ) c -='A'-'a';

Safer than the former.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Mon Mar 11 07:54:07 2024

scott@slp53.sl.home (Scott Lurndal) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

The Unix code ported relatively easily to I32LP64 because
uintptr_t had been used extensively rather than assumptions
about
sizeof(int) == sizeof(int *).

...

Sorry, I meant ptrdiff_t, which was used for pointer math.

I have seen little code that uses ptrdiff_t; quite a bit that
used size_t (the unsigned brother of ptrdiff_t). But my memory
tells me that even size_t was not very widespread in 1995.

In 1995 a problem with both size_t and ptrdiff_t is that there

Calling it a "problem" is overstating the case. It was
straightforward enough, if not completely portable to
use the appropriate number of 'l' modifiers.

Whether it is called a problem or not, the lack of support from
printf() was mentioned upthread (by OP?), and that's why I pointed it
out. The point is that not having the appropriate length modifiers
in C90 makes the code clumsy and the coding inconvenient. Focusing
on what word is used is a red herring.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Mar 11 08:00:59 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.

What "standard multiple-precision arithmetic" is there in C? I am
not aware of any.

I didn't say there is. So what else might I have meant by that
phrase?

If you have widening multiplication in the language, things are
trivial. [...]

Sure. If things were different they wouldn't be the same.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 08:10:29 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[...]

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

The cast in the return statement is superfluous.

I am normally writing Rust these days, where UB is far less common,
but casts like this are mandatory.

Oh. I didn't know that about Rust. Interesting.

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target
container.

This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.

understand that it has the support of community sentiment, but
even so it seems like a poor choice here. I'm not a big fan of
the new attribute syntax, but a form like

return [[narrow]] s;

looks to be a better way of asking Rust to allow what is a
normally disallowed conversion. By contrast, using a cast is
overkill. There is unnecessary redundancy, by specifying a type
in two places, and the risk that they might get out of sync. And
on general principles requiring a cast violates good security
principles. If someone needs access to a particular room in a
building, we don't hand over a master key that opens every room
in the building. If someone needs to read some documents that
have classified materials, we don't give them an access code that
lets them read any sensitive material regardless of whether it's
relevant. Maybe Rust is different, but in C a cast allows any
conversion that is possible in the language, even the unsafe
ones. It just seems wrong to use the nuclear option of casting
for every minor infringement.

I agree, it Rust did it like C, then it would be very unsafe indeed.

I have not checked the generated asm, but I believe that if I write
code like this:

// x:u64
x = 0x1234567890abcdef;
let y:u8 = (x & 255) as u8;

the compiler will see the mask and realize that the conversion is
safe, so no need to interpose a

cmp x,256
jae trp_conversion

idiom.

Sounds like you and I are on the same page here.

OTOH, I have seen C compilers that insist on such a test at the
end of a fully saturated switch statement, even when the mask in
front should prove that no other values are possible.

Yeah, what's up with that? Even worse, when each of the branches
has a return statement, sometimes there is a warning saying the
end of the function can be reached without returning a value.
That really annoys me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 08:11:05 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Terje Mathisen wrote:

Tim Rentsch wrote:

I'm always somewhat surprised when someone advocates using a cast
for such things, and now more surprised to learn that Rust has
chosen to impose using a cast as part of its language rules. I

I am not _sure_ but I believe Rust will in fact verify that all such
casts are in fact legal, i.e. the data will fit in the target
container.

This is of course the total opposite of the C "nuclear option", and
much more like other languages that try to be secure by default.

Just to make sure I spoke to our resident Rust guru, and he told me I
was wrong:

Rust does have conversion operators/functions for downsizing
variables, and they come with full validity checking, but using "s as
u8" as I suggested will generate exactly the same code as a C
"(uint8_t) s" idiom, i.e. no verification and no safety checks.

Perfect. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 09:02:56 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

Rather that do that I will explain.

An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.

Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

#if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
// this case is the one outlined above

#elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

#elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

Does that all make sense?

The next question would be how to do the same for multiplication....

Multiplication is a whole other ball game. First we need to
consider only the widest types, because narrower types can be
carried out in a wider type and the resulting product value
checked. Off the top of my head, for the widest types I would
try converting to float or double, do a floating-point multiply,
and do some trivial accepts and trivial rejects based on the
exponent of the result. Any remaining cases would need more
care, but probably (we hope!) there aren't many of those and they
don't happen very often. So for what it's worth there is my
first idea. Second idea is to compute a double-width product,
or at least part of one, using standard multiple-precision
arithmetic, and speed compare against the floating-point method.
I better stop now or the ideas will probably get worse rather
than better. :/

Your floating point method is pretty bad, imho, since it can give
you both false negatives and false positives, with no way to know
for sure, except doing it all over again.

I think you misunderstood what I was suggesting. The initial
tests using floating point don't produce any false positives or
false negatives. They may give a lot of "not sure" cases, but
none of the other cases is ever wrong. It is only the "not sure"
cases that need further investigation. If the FP multiplication
is done using long double (10 byte IEEE), I'm pretty sure the
results are essentially perfect with respect to a 64x64 multiply;
that is, there are very few "not sure" cases, perhaps even zero.

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon Mar 11 20:10:15 2024

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Thomas Koenig on Mon Mar 11 18:01:31 2024

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like multiplicative hashes...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Mon Mar 11 18:19:19 2024

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).

My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.

And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon Mar 11 20:38:44 2024

On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).

So, speaking in C Standard language, UB.

My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.

Yes, I'd expect the same.

And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.

As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under
IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
preferable, tend to vary.

What is the size of Fortran's default UNSIGNED ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Mar 11 20:30:38 2024

On 11/03/2024 19:10, Michael S wrote:

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

Speaking for myself only, I'd like a choice. For the most part, I
consider overflow (signed or unsigned) to indicate an error in the code.
The two appropriate actions then are a run-time error, or that the
compiler assumes that such overflow does not happen for the purposes of optimisation. (Whether the choice of handling here is determined by the
source code, or compiler options, or a combination of both, is another
matter of choice.) And for some unusual code, I want overflow (signed
or unsigned) to be defined behaviour - either wrapping, or with
additional information about overflows or carries, or perhaps more
special cases such as saturation.

Basically, I think it is a mistake for a language to pick one kind of
treatment and pretend that this is /the/ correct handling of overflow. Different circumstances could call for a variety of different behaviours.

If I had to pick only one possible choice of treatment, then I suppose
it would be wrapping - because sometimes you really need that in code,
while UB based optimisation or run-time checking is just "nice to have"
rather than essential. Standard C having UB for signed overflow and
wrapping for unsigned overflow is a reasonable compromise when keeping
things simple, but I'd prefer a choice.

(Similarly, I'd prefer a choice regarding another favourite source of UB
and complaints in C - that of type-based alias analysis. I'd sometimes
like 16-bit, 32-bit and 64-bit types that could alias anything, and I'd sometimes like 8-bit types that did /not/ alias anything.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Mar 11 21:29:02 2024

On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

[...]

Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

Rather that do that I will explain.

An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.

Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

...

but that's implementation-defined behavior, correct?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Mon Mar 11 21:43:19 2024

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).

So, speaking in C Standard language, UB.

Yes, that would be the translation. In Fortran terms, it would
violate a "shall" directive.

My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.

Yes, I'd expect the same.

And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.

As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
preferable, tend to vary.

What is the size of Fortran's default UNSIGNED ?

It is not yet in the language; a paper has been passed by J3,
but it needs to be put to WG5, and WG5 has to agree that J3 should
put it into the standard proper for Fortran 202y (202x just
came out as Fortran 2023).

But if it does go in, it is likely that it will have the same
size as INTEGER, which is usually 32 bits.

However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.

Currently, in Fortran, the number of iterations of the loop

do i=m1,m2,m3
...
end do

is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).

So,

do i=1,3,-1

will be executed zero times, as will

do i=3,1

Translating that into arithmetic with unsigned integers makes
little sense, how many times should

do i=1,3,4294967295

be executed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Mar 11 23:34:19 2024

Thomas Koenig wrote:

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Thomas Koenig <tkoenig@netcologne.de> schrieb:

David Brown <david.brown@hesbynett.no> schrieb:

On 20/02/2024 07:31, Thomas Koenig wrote:

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which
hopefully will be considered in the next J3 meeting, it can be
found at https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.

(Testing a new news server, my old one was decomissioned...)

I was quite delighted, but also a little bit surprised that the
proposal (somewhat modified) actually passed.

Now, what's left is the people who do not want modular arithmetic,
for a reason that I am unable to fathom. I guess they don't like
multiplicative hashes...

What do they like?
To declare unsigned overflow UB? Or implementation defined? Or
trapping?

Illegal, hence an implementation would be free to trap or start
World War III (with a bit of an expectation that compilers would
trap when supplied with the right options).

So, speaking in C Standard language, UB.

Yes, that would be the translation. In Fortran terms, it would
violate a "shall" directive.

My expecation is different: It would then be treated like signed
overflow, which is also illegal in Fortran. So, everybody will
implement it as if it were modular 2^n anyway, plus start optimizing
on the assumption that overflow cannot happen.

Yes, I'd expect the same.

And, since in Fortran, arrays can start at arbitrary lower bounds
(and array can have a lower bound of -42 and an upper bound of -21,
for example), the use of unsigned integers for array indices is
somewhat less than in programming languages such as C or (I believe)
Rust where they always start at zero.

As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under
IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
preferable, tend to vary.

What is the size of Fortran's default UNSIGNED ?

It is not yet in the language; a paper has been passed by J3,
but it needs to be put to WG5, and WG5 has to agree that J3 should
put it into the standard proper for Fortran 202y (202x just
came out as Fortran 2023).

But if it does go in, it is likely that it will have the same
size as INTEGER, which is usually 32 bits.

However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.

Currently, in Fortran, the number of iterations of the loop

do i=m1,m2,m3
....
end do

is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).

So,

do i=1,3,-1

will be executed zero times, as will

do i=3,1

Translating that into arithmetic with unsigned integers makes
little sense, how many times should

do i=1,3,4294967295

be executed?

3-1+4294967295 = 4294967297 // (m2-m1+m3)

4294967297 / 4294967295 = 1.0000000004656612874161594750863

So the loop should be executed one time. {{And yes I know 4294967295 == 0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement)
where 4294967295 is representable naturally ??

Do i = 1 incrementing by 4294967295 until i > 3 should be executed once. Certainly 1 is <= 3 in any numeric system so it should be executed at least
once.
Certainly 1+4294967295 > 3 in any numeric system claiming to be algebraic,
and additionally when registers are larger than
32-bits. So the loop should not be executed more
than once.

This happens naturally on a 64-bit machine and on 64-bit machines which do
not have word-width calculation instructions. If you use the word width instructions you enter in to UB or IB behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Mar 12 07:03:24 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.

Currently, in Fortran, the number of iterations of the loop

do i=m1,m2,m3
....
end do

is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).

So,

do i=1,3,-1

will be executed zero times, as will

do i=3,1

Translating that into arithmetic with unsigned integers makes
little sense, how many times should

do i=1,3,4294967295

be executed?

3-1+4294967295 = 4294967297 // (m2-m1+m3)

4294967297 / 4294967295 = 1.0000000004656612874161594750863

So the loop should be executed one time. {{And yes I know 4294967295 == 0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement) where 4294967295 is representable naturally ??

Correct (of course).

The same result would be expected for

do i=1u,3u,-1u

(asusming an u suffix for unsigned numbers).

The problem is that this violates a Fortran basic assumption since
FORTRAN 77, which is that DO loops can be zero-trip.

This is a can of worms that I would like to leave unopened.

Same goes for array slices. Even assuming that no negative
indices are used, the slice a(1:3:-1) is zero-sized in Fortran,
as is a(3:1) .

For a(1u:3u:-1u) the same logic that you outlined above would apply,
making it a slice with one element.

Not going there :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Tue Mar 12 08:32:52 2024

On 11/03/2024 22:29, Thomas Koenig wrote:

On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

...

but that's implementation-defined behavior, correct?

It is, as far as I understand it. (Tim knows these things better than I
do, so if he finds something in the standards to contradict me, he's
probably right.)

For a signed integer type T in C, you have N value bits, P padding bits,
and a single sign bit. The N value bits can hold values between 0 and
2^N - 1. The sign bit can either indicate a negative value (sign and magnitude), -2^N (two's complement), or -(2^N - 1) (ones' complement).
There will also be a corresponding unsigned type, which has at least N
value bits and takes the same number of bytes, and has no sign bit. It
can still have padding bits (except for unsigned char, which has no
padding). But all non-negative integers that can be represented in the
signed type have exactly the same significant bits in both the signed
and unsigned types.

So an implementation can have a 32-bit two's complement "int", and use
that for unsigned types too, treating the MSB as a padding bit for
unsigned usage. (Of course, doing so would be inefficient in use on
most cpus - you'd have to keep masking off the padding bit when using
the value, or somehow guarantee that it is never non-zero.)

In such an implementation, converting an int to an unsigned int would
mask off the top bit (fully defined by the standard), and converting
from an unsigned int it to a signed int would leave it unchanged (implementation dependent in the standard). That means (int)(unsigned
int)(-1) would be positive 0x7fffffff, not -1.

Real-world systems, of course, all use two's complement types with no
padding, and Tim's description will work fine.

(And in C23, only two's complement representations will be allowed. I
can't remember if padding bits are still allowed, however. And of
course signed integer overflow is still UB.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Mar 12 07:58:05 2024

Michael S <already5chosen@yahoo.com> writes:

As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under >IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.

If Fortran makes unsigned overflow illegal, Fortran compilers can
perform the same shenanigans for unsigned that C compilers do for
signed integers; so if signed int really is preferable because of
these shenanigans, unsigned with the same shenanigans would be
preferable, too.

In general, I think that undersized ints (as in I32LP64 or IL32P64)
are always at a disadvantage; some architectures (particularly ARM
A64) go to extra lengths to compensate for these disadvantages, but I
don't think that these measures eliminates it completely, and the
additional effort in the architectures could have gone to better uses.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Mar 12 10:52:56 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.

At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Mar 12 17:13:38 2024

Terje Mathisen wrote:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

The low order N bits are all the same while the higher order N bits
are different; where N is operand size.

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

SW may have to do that, HW does not.

There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.

At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Mar 12 11:03:35 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.

At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.

It isn't quite that simple. Some of what you describe has a risk
of running afoul of implementation-defined behavior or undefined
behavior (as for example abs( INT_MIN )). I'm pretty sure it's
possible to avoid those pitfalls, but it requires a fair amount
of care and careful thinking.

Note that my goal is only to avoid the possibility of undefined
behavior that comes from signed overflow. My approach is to safely
determine whether the signed multiplication would overflow, and if
it wouldn't then simply use signed arithmetic to get the result.
I use unsigned types to determine the safety, and if it's safe then
using signed types to get a result. For the current problem I don't
care about widening, except as it might help to determine safety.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Tue Mar 12 18:38:20 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Michael S <already5chosen@yahoo.com> writes:

As discussed here just recently, there are good reason to avoid
'unsigned' array indices in performance-oriented programs running under >>IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.

If Fortran makes unsigned overflow illegal, Fortran compilers can
perform the same shenanigans for unsigned that C compilers do for
signed integers; so if signed int really is preferable because of
these shenanigans, unsigned with the same shenanigans would be
preferable, too.

One problem is that, without 2^n modulo, something like a
multiplicative hash would be illegal.

People would do it anyway, ignoring the prohibition, because it
is so useful, and subsequent hilarity will ensue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Mar 12 11:51:21 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. [...]

It isn't hard to write standard C code to determine whether a
proposed addition or subtraction would overflow, and does so
safely and reliably.

Also efficiently and without resorting to implementation-
defined or undefined behavior (and without needing a bigger
type)?

[...]

Heavens to Betsy! Are you impugning the quality and excellence
of my code? Of *my* code? I can only hope that you are suitably
chagrined and contrite. ;)

It's a little bit tedious perhaps but not
difficult. Checking code can be wrapped in an inline function
and invoke whatever handling is desired, within reason.

Maybe you could share such code?

Rather that do that I will explain.

An addition overflows if the two operands have the same sign and
the sign of an operand is the opposite of the sign of the sum
(taken mod the width of the operands). Convert the signed
operands to their unsigned counterparts, and form the sum of the
unsigned values. The sign is just the high-order bit in each
case. Thus the overflow condition can be detected with a few
bitwise xors and ands.

Subtraction is similar except now overflow can occur only when
the operands have different signs and the sign of the sum is
the opposite of the sign of the first operand.

The above description works for two's complement hardware where
unsigned types have the same width as their corresponding signed
types. I think for most people that's all they need. The three
other possibilities are all doable with minor adjustments, and
code appropriate to each particular implementation can be
selected using C preprocessor conditional, as for example

...

but that's implementation-defined behavior, correct?

There is implementation-dependent behavior but there isn't any implementation-defined behavior. The result has to depend on the implementation because different implementations can imply
different results, as for example whether the representation for
signed integers uses two's complement or ones' complement.
Roughly speaking the distinction is whether code is relying on an implementation choice other than the choice assumed. There is
nothing wrong, for example, with code that holds the value of the
character constant 'a' in a variable, as long as the code makes
sure that there are no wrong assumptions about what specific
value that is (as for example the wrong assumption that the
expression c + "A" - "a" can be used to change a letter from
lower case to upper case). The C standard doesn't clearly
differentiate behavior /of the implementation/ and behavior /of
the program/. I took your question to mean, Does the code resort
to implementation-defined behavior so as to rely on an unreliable
assumption, ie, the kind that can go wrong if a different implementation-defined choice is made? The answer is that the
code does not rely on any such assumption. So strictly speaking
the code does /involve/ implementation-defined choices (as indeed
essentially all programs do). But it does not /depend/ on implementation-defined choices in any way that risks changing the
correctness of its results.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Mar 12 19:08:18 2024

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Michael S <already5chosen@yahoo.com> writes:

As discussed here just recently, there are good reason to avoid >>>'unsigned' array indices in performance-oriented programs running under >>>IL32P64 or I32LP64 C environments. Everything else is preferable -
int, ptrdiff_t, size_t.

If Fortran makes unsigned overflow illegal, Fortran compilers can
perform the same shenanigans for unsigned that C compilers do for
signed integers; so if signed int really is preferable because of
these shenanigans, unsigned with the same shenanigans would be
preferable, too.

One problem is that, without 2^n modulo, something like a
multiplicative hash would be illegal.

In HW we can reverse the bit order of the fields at zero cost making
hashes that "whiten" the data better.

People would do it anyway, ignoring the prohibition, because it
is so useful, and subsequent hilarity will ensue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Tue Mar 12 19:07:01 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.

At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.

It isn't quite that simple. Some of what you describe has a risk
of running afoul of implementation-defined behavior or undefined
behavior (as for example abs( INT_MIN )). I'm pretty sure it's
possible to avoid those pitfalls, but it requires a fair amount
of care and careful thinking.

It would be supremely nice if we could go back in time before
computers and reserve an integer encoding that represents the
value of "there is no value here" and mandate if upon integer
arithmetic.

Note that my goal is only to avoid the possibility of undefined
behavior that comes from signed overflow. My approach is to safely
determine whether the signed multiplication would overflow, and if
it wouldn't then simply use signed arithmetic to get the result.

Double width multiplication cannot overflow. 2n = n×n then, ignoring
the top n bits gives you your non-overflowing multiply.

I use unsigned types to determine the safety, and if it's safe then
using signed types to get a result. For the current problem I don't
care about widening, except as it might help to determine safety.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Tue Mar 12 19:05:44 2024

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

In Gforth we use:

DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
DCell res;

res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}

I have this technique from Andrew Haley. It relies on twos-complement representation.

- anton

Yeah, that's what Alpha does with UMULH.
I'm still trying to figure out why it works.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Tue Mar 12 22:23:36 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

In Gforth we use:

DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
DCell res;

res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}

I have this technique from Andrew Haley. It relies on twos-complement representation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Tue Mar 12 20:12:20 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

There is a small gotcha if either of the inputs are of the 0x80000000
form, i.e. MININT, but the naive iabs() conversion will do the right
thing by leaving the input unchanged.

At the other end there cannot be any issues since restoring a negative
output sign cannot overflow/fail.

It isn't quite that simple. Some of what you describe has a risk
of running afoul of implementation-defined behavior or undefined
behavior (as for example abs( INT_MIN )). I'm pretty sure it's
possible to avoid those pitfalls, but it requires a fair amount
of care and careful thinking.

It would be supremely nice if we could go back in time before
computers and reserve an integer encoding that represents the
value of "there is no value here" and mandate if upon integer
arithmetic.

ISO C allows such an encoding, even for two's complement.

Sadly it appears that the latest C standard will be taking
away that allowance.

Note that my goal is only to avoid the possibility of undefined
behavior that comes from signed overflow. My approach is to safely
determine whether the signed multiplication would overflow, and if
it wouldn't then simply use signed arithmetic to get the result.

Double width multiplication cannot overflow. 2n = nxn then, ignoring
the top n bits gives you your non-overflowing multiply.

C does not guarantee that. The point of the exercise is to
write code assuming nothing more than what the C standard
mandates.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to EricP on Tue Mar 12 20:23:32 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the
signs, convert the inputs to unsigned, then do the unsigned
widening mul, before finally resotirng the sign as the XOR of the
input signs?

In Gforth we use:

DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
DCell res;

res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}

I have this technique from Andrew Haley. It relies on twos-complement
representation.

Yeah, that's what Alpha does with UMULH.
I'm still trying to figure out why it works.

It works because a sign bit works like a value bit
with a weight of -2**(N-1), where N is the width of
the memory holding the signed value. So instead
of subtracting 2**(N-1) * b, assuming a is negative,
we have instead added 2**(N-1) * b, so we need to
subtract 2 * 2**(N-1) * b, or 2**N * b, which means
subtracting b from the high order word of the result.
And of course similarly for when b is negative.

(Note that the above holds for two's complement, but
not for ones' complement or signed magnitude.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Wed Mar 13 17:09:37 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
DCell res;

res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}

I have this technique from Andrew Haley. It relies on twos-complement
representation.

- anton

Yeah, that's what Alpha does with UMULH.
I'm still trying to figure out why it works.

Let's consider the case where a>=0 and b<0, and cells are 64 bits. ua
is a interpreted as unsigned cell, and ub is b interpreted as unsigned
cell. The following computations are in Z (the unlimited
integers). For the case under consideration:

ua=a
ub=b+2^64

res = ua*ub = a*(b+2^64)= a*b + a*2^64

So,

a*b = res - a*2^64

The other cases are similar.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Mar 13 18:58:09 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

However, what I did put in the paper (and what the subsequent
revision by a J3 subcommittee left in) is a prohibition against
using unsigneds in a DO loop. The reason is semantics of
negative strides.

Currently, in Fortran, the number of iterations of the loop

do i=m1,m2,m3
....
end do

is (m2-m1+m3)/m3 unless that value is negative, in which case it
is zero (m3 defaults to 1 if it is not present).

So,

do i=1,3,-1

will be executed zero times, as will

do i=3,1

Translating that into arithmetic with unsigned integers makes
little sense, how many times should

do i=1,3,4294967295

be executed?

3-1+4294967295 = 4294967297 // (m2-m1+m3)

4294967297 / 4294967295 = 1.0000000004656612874161594750863

So the loop should be executed one time. {{And yes I know 4294967295 ==
0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement)
where 4294967295 is representable naturally ??

Correct (of course).

It seems to me that the problem is not using unsigned integers as
DO LOOP indexes, the problem is there is no compiler error message
from "4294967295 does not fit in integer container". Bringing the
problem to the programmer. THEN everybody is free (under the above)
to implement unsigned DO LOOPs.

The same result would be expected for

do i=1u,3u,-1u

(asusming an u suffix for unsigned numbers).

The problem is that this violates a Fortran basic assumption since
FORTRAN 77, which is that DO loops can be zero-trip.

(m2-m1+m3)/m3

(3-1+(-1))/-1 = -1 and the loop should not be taken at all.
BUT
-1u = 4294967293
Therefore:
(3u-1u+(-1u))/-1u =
(3-1+4294967293)/4294967293 = 1.0000000004656612874161594750863 again.

Once again this is a problem only when a constant integer value
cannot be precisely represented--and deserves a warning/error
message instead of a complete ban.

This is a can of worms that I would like to leave unopened.

I understand why, I just think the fickle finger of fate should
point at the constant rather than the type.

Same goes for array slices. Even assuming that no negative
indices are used, the slice a(1:3:-1) is zero-sized in Fortran,
as is a(3:1) .

For a(1u:3u:-1u) the same logic that you outlined above would apply,
making it a slice with one element.

Not going there :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Fri Mar 15 10:33:21 2024

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

If I really had to write a 64x64->128 MUL, with no widening MUL or
MULH which returns the high half, then I would punt and do it using
32-bit parts (all variables are u64): [...]

I wrote some code along the same lines. A difference is you
are considering unsigned multiplication, and I am considering
signed multiplication.

Signed mul is just a special case of unsigned mul, right?

I.e. in case of a signed widening mul, you'd first extract the signs,
convert the inputs to unsigned, then do the unsigned widening mul,
before finally resotirng the sign as the XOR of the input signs?

In Gforth we use:

DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
DCell res;

res = UD2D(ummul (a, b));
if (a < 0)
res.hi -= b;
if (b < 0)
res.hi -= a;
return res;
}

I have this technique from Andrew Haley. It relies on twos-complement representation.

Nice!

Subtracting the results of having used the sign bit as part of the multiplication.

Here you can probably schedule the fixup to happen in parallel with the
actual multiplication:

;; inputs in r9 & r10, result in rdx:rax, rbx & rcx as scratch

mov rax,r9 ;; All these can start in the first cycle
mul r10
mov rbx,r9 ;; The MOV can be handled by the renamer
sar r9,63
mov rcx,r10 ;; Ditto
sar r10,63

and rbx,r9 ;; Second set of ops
and rcx,r10

add rbx,rcx ;; Third cycle

sub rdx,rbx ;; Do a single adjustment as soon as the MUL finishes

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Fri Mar 15 17:07:19 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Here you can probably schedule the fixup to happen in parallel with the >actual multiplication:

;; inputs in r9 & r10, result in rdx:rax, rbx & rcx as scratch

mov rax,r9 ;; All these can start in the first cycle
mul r10
mov rbx,r9 ;; The MOV can be handled by the renamer
sar r9,63
mov rcx,r10 ;; Ditto
sar r10,63

and rbx,r9 ;; Second set of ops
and rcx,r10

add rbx,rcx ;; Third cycle

sub rdx,rbx ;; Do a single adjustment as soon as the MUL finishes

Of course on AMD64 you could just use imul instead.

RISC-V also supports signed as well as unsigned (and also
signed*unsigned) multiplication, and I think that's also the case for
ARM A64. But on Alpha this technique would be useful.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:20
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

A Very Bad Idea

Who's Online

Recent Visitors

System Info