• A Very Bad Idea

    From Quadibloc@21:1/5 to All on Mon Feb 5 06:48:59 2024
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.

    These days, Moore's Law has limped along well enough to allow
    putting a lot of cache memory on a single die and so on.

    So, perhaps it might be possible to design a chip that is
    basically similar to the IBM/SONY CELL microprocessor,
    except that the satellite processors handle Cray-style vectors,
    and have multiple megabytes of individual local storage.

    It might be possible to design such a chip. The main processor
    with access to external DRAM would be a conventional processor,
    with only ordinary SIMD vector capabilities. And such a chip
    might well be able to execute lots of instructions if one runs
    a suitable benchmark on it.

    But try as I might, I can't see a useful application for such
    a chip. The restricted access to memory would basically hobble
    it for anything but a narrow class of embarassingly parallel
    applications. The original CELL was thought of as being useful
    for graphics applications, but GPUs are much better at that.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Mon Feb 5 07:44:24 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
    512-bit registers to the 256-bit units in two successive cycles.
    Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
    register and AFAIK it split 256-bit operations into two 128-bit
    operations that could be scheduled arbitrarily by the OoO engine
    (while Zen4 treats the 512-bit operation as a unit that consumes two
    cycles of a pipelined 256-bit unit). Similar things have been done by
    Intel and AMD in other CPUs, implementing 256-bit operations with
    128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or implementing 128-bit operations with 64-bit units (e.g., on the K8).

    Why are they not using longer vectors with the same FUs or narrower
    FUs? For Gracemont, that's really the question; they even disabled
    AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
    with disabled Gracemont) because Gracemont does not do AVX-512.
    Supposedly the reason is that Gracemont does not have enough physical
    128-bit registers for AVX-512 (128 such registers would be needed to
    implement the 32 logical ZMM registers, and probably some more to
    avoid deadlocks and maybe for some microcoded operations; <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
    doubt that explanation).

    Anyway, the size of the register files is one reason for avoiding
    longer vectors.

    Also, the question is how much it buys. For Zen4, I remember seeing
    results that coding the same stuff as using two 256-bit instructions
    rather than one 512-bit instruction increased power consumption a
    little, resulting in the CPU (running at the power limit) lowering the
    clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
    benefit. How much would the benefit be from longer vectors? Probably
    not more than another 100MHz: From 256-bit instructions to 512-bit
    instructions already halves the number of instructions to process in
    the front end; eliminating the other half would require infinitely
    long vectors.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues.

    My memory says is that he mentioned memory latency. He did not
    explain why he thinks so, but caches and prefetchers seem to be doing
    ok for bridging the latency from DRAM to L2 or L1.

    As for main memory bandwidth, that is certainly a problem for
    applications that have frequent cache misses (many, but not all HPC applications are among them). And once you are limited by main memory bandwidth, the ISA makes little difference.

    But for those applications where caches work (e.g., dense matrix
    multiplication in the HPC realm), I don't see a reason why a
    long-vector architecture would be unworkable. It's just that, as
    discussed above, the benefits are small.

    The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.

    Caches work well for most applications. So mainstream CPUs are
    designed with a certain amount of cache and enough main-memory
    bandwidth to satisfy most applications. For the niche that needs more main-memory bandwidth, there are GPGPUs which have high bandwidth
    because their original application needs it (and AFAIK GPGPUs have
    long vectors). For the remaining niche, having a CPU with several
    stacks of HBM memory attached (like the NEC vector CPUs) is a good
    idea; and given that there is legacy software for NEC vector CPUs,
    providing that ISA also covers that need.

    So, perhaps it might be possible to design a chip that is
    basically similar to the IBM/SONY CELL microprocessor,
    except that the satellite processors handle Cray-style vectors,
    and have multiple megabytes of individual local storage.

    Who would buy such a microprocessor? Megabytes? Laughable. If
    that's intended to be a buffer for main memory, you need the
    main-memory bandwidth; and why would you go for explicitly managed
    local memory (which deservedly vanished from the market, see below)
    rather than the well-working setup of cache and prefetchers? BTW,
    Raptor Cove gives you 2MB of private L2.

    The original CELL was thought of as being useful
    for graphics applications, but GPUs are much better at that.

    The Playstation 3 has a separate GPU based on the Nvidia G70 <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

    What I heard/read about the Cell CPU is that the SPEs were too hard to
    make good use of and that consequently they were not used much.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Mon Feb 5 13:19:41 2024
    On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

    Who would buy such a microprocessor? Megabytes? Laughable. If
    that's intended to be a buffer for main memory, you need the
    main-memory bandwidth;

    Well, the original Cray I had a main memory of eight megabytes, and the
    Cray Y-MP had up to 512 megabytes of memory.

    I was keeping as close to the original CELL design as possible, but
    certainly one could try to improve. After all, if Intel could make
    a device like the Xeon Phi, having multiple CPUs on a chip all sharing
    access to external memory, however inadequate, could still be done (but
    then I wouldn't be addressing Mitch Alsup's objection).

    Instead of imitating the CELL, or the Xeon Phi, for that matter, what
    I think of as a more practical way to make a consumer Cray-like chip
    would be to put only one core in a package, and give that core an
    eight-channel memory bus.

    Some older NEC designs used a sixteen-channel memory bus, but I felt
    that eight channels will already be expensive for a consumer product.

    Given Mitch Alsup's objection, though, I threw out the opposite kind
    of design, one patterned after the CELL, as one that maybe could allow
    a vector CPU to churn out more FLOPs. But as I noted, it seems to have
    the fatal flaw of very little capacity for any kind of useful work...
    which is kind of the whole point of any CPU.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Quadibloc on Mon Feb 5 14:34:00 2024
    In article <upqn9d$a0i2$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    I was keeping as close to the original CELL design as possible

    That, I think, was a mistake. Given how unsuccessful it was, it's fair
    evidence that strategy isn't useful for very much.

    There's an approach that isn't specifically for vectors, but gives good
    memory bandwidth, and is having some commercial success. Apple's M-series
    ARM SoCs have limited RAM (8GB to 24GB) on the SoC, but it's much larger
    than caches, or a Cray-1's main memory. They then use fast SSDs for
    swapping, but there's no reason, apart from cost, that you couldn't have
    a layer with 1 TB or so of DRAM between the SoC and the SSD.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Mon Feb 5 13:44:56 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

    Who would buy such a microprocessor? Megabytes? Laughable. If
    that's intended to be a buffer for main memory, you need the
    main-memory bandwidth;

    Well, the original Cray I had a main memory of eight megabytes

    If you want to compete with a 1976 supercomputer, Megabytes may be
    enough. However, if you want to compete with something from 2024,
    better look at how much local memory the likes of these NEC cards, or
    Nvidia or AMD GPGPUs provide. And that's Gigabytes.

    I was keeping as close to the original CELL design as possible, but
    certainly one could try to improve. After all, if Intel could make
    a device like the Xeon Phi, having multiple CPUs on a chip all sharing
    access to external memory, however inadequate, could still be done

    You don't have to look for the Xeon Phi. The lowly Athlon 64 X2 or
    Pentium D from 2005 already have several cores sharing access the
    external memory (and the UltraSPARC T1 from the same year even has 8
    cores.

    The Xeon Phis are interesting:

    * Knight's Corner is a PCIe card with up to 16GB local memory and
    bandwidths up to 352GB/s (plus access to the host system's DRAM and
    anemic bandwidth (PCIe 2.0 x16)).

    * Knight's Landing was available as PCIe card or as socketed CPU with
    16GB of local memory with "400+ GB/s" bandwidth and up to 384GB of
    DDR4 memory with 102.4GB/s.

    * Knight's Mill was only available in a socketed version with similar
    specs.

    * Eventually they were replaced by the big mainstream Xeons without
    local memory, stuff like the Xeon Platinum 8180 with about 128GB/s
    DRAM bandwidth.

    It seems that running the HPC processor as a coprocessor was not good
    enough for the Xeon Phi, and that the applications that needed lots of bandwidth to local memory also did not provide enough revenue to
    sustain Xeon Phi development; OTOH, Nvidia has great success with its
    GPGPU line, so maybe the market is there, but the Xeon Phi was
    uncompetetive.

    If you are interested in such things, the recently announced AMD
    Instinct MI300A (CPUs+GPUs) with 128GB local memory or MI300X (GPUs
    only) with 192GB local memory with 5300GB/s bandwidth may be of
    interest to you.

    Instead of imitating the CELL, or the Xeon Phi, for that matter, what
    I think of as a more practical way to make a consumer Cray-like chip
    would be to put only one core in a package, and give that core an >eight-channel memory bus.

    IBM sells Power systems with few cores and the full memory system.
    Similarly, you can buy AMD EPYCs with few active cores and the full
    memory system. Some of them have a lot of cache, too (e.g., 72F3 and
    7373X).

    Some older NEC designs used a sixteen-channel memory bus, but I felt
    that eight channels will already be expensive for a consumer product.

    If you want high bandwidth in a consumer product, buy a graphics card.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Mon Feb 5 19:20:51 2024
    Quadibloc wrote:

    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one

    Memory LATENCY issues not BW issues. The length of the vector
    has to be able to absorb a miss at all cache levels without
    stalling the core. 5GHz processors, 60 ns DRAM access times
    means the minimum vector length is 300 registers in a single
    vector. Which also means it takes a loop of count 300+ to
    reach peak efficiency.

    To a certain extent the B registers of the CRAY 2 were to
    do that (absorb longer and longer memory latencies) but
    this B register set is now considered a failure.

    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.

    They also increased vector length as memory latency increased.
    Ending up at (IIRC) 256 entry VRF[k].

    These days, Moore's Law has limped along well enough to allow
    putting a lot of cache memory on a single die and so on.

    Consider FFT:: sooner or later you are reading and writing
    vastly scattered memory containers considerably smaller than
    any cache line. FFT is one you want peak efficiency on !
    So, if you want FFT to run at core peak efficiency, your
    interconnect has to be able to pass either containers from
    different memory banks on alternating cycles, or whole
    cache lines in a single cycle. {{The later is easier to do}}

    A vector machine (done properly) is a bandwidth machine rather
    than a latency based machine (which can be optimized by cache
    hierarchy).

    So, perhaps it might be possible to design a chip that is
    basically similar to the IBM/SONY CELL microprocessor,
    except that the satellite processors handle Cray-style vectors,
    and have multiple megabytes of individual local storage.

    Generally microprocessors are pin limited as are DRAM chips,
    so in order to get the required BW--2LD1ST per cycle continuously
    with latency less than vector length--you end up needing a way
    to access 16-to-64 DRAM DIMMs simultaneously. Yo might be able
    to do with with PCIe 6.0 is you have 64 twisted quads, one for
    each DRAM DIMM. Minimum memory size is 64 DIMMs !

    A processor box with 64 DIMMs (as its minimum) is not mass market.

    One reason CRAY sold a lot of supercomputers is that its I/O
    system was also up to the task--CRAY YMP had 4× the I/O BW
    of NEC SX{4,5,6} so when the application became I/O bound
    the 6ns YMP was faster than the SX.

    It is perfectly OK to try to build a CRAY-like vector processor.
    But designing a vector processor is a lot more about the memory
    system (feeding the beast) than about the processor (the beast).

    It might be possible to design such a chip. The main processor
    with access to external DRAM would be a conventional processor,
    with only ordinary SIMD vector capabilities. And such a chip
    might well be able to execute lots of instructions if one runs
    a suitable benchmark on it.

    If you figure this out, there is a market for 100-200 vector
    supercomputers mainframes per year. If you can build a company
    that makes money on this volume-- go for it !

    But try as I might, I can't see a useful application for such
    a chip. The restricted access to memory would basically hobble
    it for anything but a narrow class of embarassingly parallel
    applications. The original CELL was thought of as being useful
    for graphics applications, but GPUs are much better at that.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Feb 5 19:43:15 2024
    BGB wrote:

    On 2/5/2024 12:48 AM, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.

    These days, Moore's Law has limped along well enough to allow
    putting a lot of cache memory on a single die and so on.

    So, perhaps it might be possible to design a chip that is
    basically similar to the IBM/SONY CELL microprocessor,
    except that the satellite processors handle Cray-style vectors,
    and have multiple megabytes of individual local storage.

    It might be possible to design such a chip. The main processor
    with access to external DRAM would be a conventional processor,
    with only ordinary SIMD vector capabilities. And such a chip
    might well be able to execute lots of instructions if one runs
    a suitable benchmark on it.


    One doesn't need to disallow access to external RAM, but maybe:
    Memory coherence is fairly weak for these cores;
    The local RAM addresses are treated as "strongly preferable".

    Or, say, there is a region on RAM that is divided among the cores, where
    the core has fast access to its own local chunk, but slow access to any
    of the other chunks (which are treated more like external RAM).

    Large FFTs do not fit n this category. FFTs are one of the most valuable
    means of calculating Great Big Physics "stuff". We used FFT back in the
    NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
    VERY Many big physics simulations do similarly.

    That problem was matrix-matrix multiplication !!

    MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );

    {where pattern was FFTd earlier }

    Lookup the data access pattern and apply that knowledge to TB sized
    matrixes and then ask yourself if caches bring anything to the party ?

    Here, threads would be assigned to particular cores, and the scheduler
    may not move a thread from one core to another if it is assigned to a
    given core.


    As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
    is cheap and simple.

    If you are happy adding 1,000+ instructions to your ISA, then yes.

    The Cell cores were, if anything, more of a "SIMD First, ALU Second" approach, building it around 128-bit registers but only using part of
    these for integer code.

    I went a slightly different direction, using 64-bit registers that may
    be used in pairs for 128-bit ops. This may make more sense if one
    assumes that the core is going to be used for a lot more general purpose code, rather than used almost entirely for SIMD.


    I have some hesitation about "vector processing", as it seems fairly
    alien to how this stuff normally sort of works; seems more complicated
    than SIMD for an implementation; ...

    Vector design is a lot more about the memory system (feeding the beast)
    than the core (the beast) consuming memory BW.

    It is arguably more scalable, but as I see it, much past 64 or 128 bit vectors, SIMD rapidly goes into diminishing returns, and it makes more
    sense to be like "128-bit is good enough" than to try to chase after
    ever wider SIMD vectors.

    Architecture is more about "what to leave OUT" as about "what to put in".

    But, I can also note that even for semi-general use, an ISA design like
    RV64G is suffers a significant disadvantage, say, vs my own ISA, in the

    They disobeyed the "what to leave out" and "what to put in" rules.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Feb 5 19:30:20 2024
    Anton Ertl wrote:

    Quadibloc <quadibloc@servername.invalid> writes:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
    512-bit registers to the 256-bit units in two successive cycles.
    Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
    register and AFAIK it split 256-bit operations into two 128-bit
    operations that could be scheduled arbitrarily by the OoO engine
    (while Zen4 treats the 512-bit operation as a unit that consumes two
    cycles of a pipelined 256-bit unit). Similar things have been done by
    Intel and AMD in other CPUs, implementing 256-bit operations with
    128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or implementing 128-bit operations with 64-bit units (e.g., on the K8).

    Why are they not using longer vectors with the same FUs or narrower
    FUs? For Gracemont, that's really the question; they even disabled
    AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
    with disabled Gracemont) because Gracemont does not do AVX-512.

    They wanted to keep core power under some <thermal> limit 256-bits
    fit under this limit, 513 did not.

    Supposedly the reason is that Gracemont does not have enough physical
    128-bit registers for AVX-512 (128 such registers would be needed to implement the 32 logical ZMM registers, and probably some more to
    avoid deadlocks and maybe for some microcoded operations; <https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/> reports 191+16 XMM registers and 95+16 YMM registers, which makes me
    doubt that explanation).

    Anyway, the size of the register files is one reason for avoiding
    longer vectors.

    Also, the question is how much it buys. For Zen4, I remember seeing
    results that coding the same stuff as using two 256-bit instructions
    rather than one 512-bit instruction increased power consumption a
    little, resulting in the CPU (running at the power limit) lowering the
    clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
    benefit. How much would the benefit be from longer vectors? Probably
    not more than another 100MHz: From 256-bit instructions to 512-bit instructions already halves the number of instructions to process in
    the front end; eliminating the other half would require infinitely
    long vectors.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues.

    My memory says is that he mentioned memory latency. He did not
    explain why he thinks so, but caches and prefetchers seem to be doing
    ok for bridging the latency from DRAM to L2 or L1.

    As seen by scalar cores, yes, as seen by vector cores (like CRAY) no.

    I might note:: RISC-V has a CRAY-like vector extension and a SIMD-like
    vector extension. ... make of that what you may.

    As for main memory bandwidth, that is certainly a problem for
    applications that have frequent cache misses (many, but not all HPC applications are among them). And once you are limited by main memory bandwidth, the ISA makes little difference.

    My point in the previous post.

    But for those applications where caches work (e.g., dense matrix multiplication in the HPC realm), I don't see a reason why a
    long-vector architecture would be unworkable. It's just that, as
    discussed above, the benefits are small.

    TeraByte 2D and 3D FFTs are not cache friendly...

    The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.

    Caches work well for most applications. So mainstream CPUs are
    designed with a certain amount of cache and enough main-memory
    bandwidth to satisfy most applications. For the niche that needs more main-memory bandwidth, there are GPGPUs which have high bandwidth
    because their original application needs it (and AFAIK GPGPUs have

    And can afford to absorb the latency.

    long vectors). For the remaining niche, having a CPU with several
    stacks of HBM memory attached (like the NEC vector CPUs) is a good
    idea; and given that there is legacy software for NEC vector CPUs,
    providing that ISA also covers that need.

    So, perhaps it might be possible to design a chip that is
    basically similar to the IBM/SONY CELL microprocessor,
    except that the satellite processors handle Cray-style vectors,
    and have multiple megabytes of individual local storage.

    Who would buy such a microprocessor? Megabytes? Laughable. If
    that's intended to be a buffer for main memory, you need the
    main-memory bandwidth; and why would you go for explicitly managed
    local memory (which deservedly vanished from the market, see below)
    rather than the well-working setup of cache and prefetchers? BTW,
    Raptor Cove gives you 2MB of private L2.

    The original CELL was thought of as being useful
    for graphics applications, but GPUs are much better at that.

    The Playstation 3 has a separate GPU based on the Nvidia G70 <https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

    What I heard/read about the Cell CPU is that the SPEs were too hard to
    make good use of and that consequently they were not used much.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Mon Feb 5 19:46:55 2024
    Quadibloc wrote:

    On Mon, 05 Feb 2024 07:44:24 +0000, Anton Ertl wrote:

    Who would buy such a microprocessor? Megabytes? Laughable. If
    that's intended to be a buffer for main memory, you need the
    main-memory bandwidth;

    Well, the original Cray I had a main memory of eight megabytes, and the
    Cray Y-MP had up to 512 megabytes of memory.

    CRAY-1 could access one 64-bit memory container per cycle continuously. CRAY-XMP could access 3 64-bit memory containers in 2LD, 1 SR per cycle continuously.
    Where memory started at about 16 cycles away (12.5ns version) and ended
    up about 30 cycles away (6ns version) and a memory bank could be accessed
    about every 7 cycles.

    I was keeping as close to the original CELL design as possible, but
    certainly one could try to improve. After all, if Intel could make
    a device like the Xeon Phi, having multiple CPUs on a chip all sharing
    access to external memory, however inadequate, could still be done (but
    then I wouldn't be addressing Mitch Alsup's objection).

    Instead of imitating the CELL, or the Xeon Phi, for that matter, what
    I think of as a more practical way to make a consumer Cray-like chip
    would be to put only one core in a package, and give that core an eight-channel memory bus.

    Some older NEC designs used a sixteen-channel memory bus, but I felt
    that eight channels will already be expensive for a consumer product.

    Given Mitch Alsup's objection, though, I threw out the opposite kind
    of design, one patterned after the CELL, as one that maybe could allow
    a vector CPU to churn out more FLOPs. But as I noted, it seems to have
    the fatal flaw of very little capacity for any kind of useful work...
    which is kind of the whole point of any CPU.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Quadibloc on Mon Feb 5 22:56:20 2024
    On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

    I am very fond of the vector architecture of the Cray I and similar
    machines, because it seems to me the one way of increasing computer performance that proved effective in the past that still isn't being
    applied to microprocessors today.

    Mitch Alsup, however, has noted that such an architecture is unworkable
    today due to memory bandwidth issues.

    RISC-V has a long-vector feature very consciously modelled on the Cray
    one. It eschews the short-vector SIMD fashion that has infested so many architectures these days precisely because the resulting combinatorial explosion in added instructions makes a mockery of the “R” in “RISC”.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Feb 10 23:27:35 2024
    Lawrence D'Oliveiro wrote:

    On Mon, 5 Feb 2024 06:48:59 -0000 (UTC), Quadibloc wrote:

    I am very fond of the vector architecture of the Cray I and similar
    machines, because it seems to me the one way of increasing computer
    performance that proved effective in the past that still isn't being
    applied to microprocessors today.

    Mitch Alsup, however, has noted that such an architecture is unworkable
    today due to memory bandwidth issues.

    RISC-V has a long-vector feature very consciously modelled on the Cray
    one. It eschews the short-vector SIMD fashion that has infested so many architectures these days precisely because the resulting combinatorial explosion in added instructions makes a mockery of the “R” in “RISC”.

    So does the C extension--its all redundant...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Quadibloc on Tue Feb 13 19:57:28 2024
    On 2024-02-05, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.


    FWIW I would just like to share my positive experience with MRISC32
    style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).

    My machine can start and finish at most one 32-bit operation on every
    clock cycle, so it is very simple. The same thing goes for vector
    operations: at most one 32-bit vector element per clock cycle.

    Thus, it always feels like using vector instructions would not give any performance gains. Yet, every time I vectorize a scalar loop (basically
    change scalar registers for vector registers), I see a very healthy
    performance increase.

    I attribute this to reduced loop overhead, eliminated hazards, reduced
    I$ pressure and possibly improved cache locality and reduced register
    pressure.

    (I know very well that VVM gives similar gains without the VRF)

    I guess my point here is that I think that there are opportunities in
    the very low end space (e.g. in order) to improve performance by simply
    adding MRISC32-style vector support. I think that the gains would be
    even bigger for non-pipelined machines, that could start "pumping" the
    execute stage on every cycle when processing vectors, skipping the fetch
    and decode cycles.

    BTW, I have also noticed that I often only need a very limited number of
    vector registers in the core vectorized loops (e.g. 2-4 registers), so I
    don't think that the VRF has to be excruciatingly big to add value to a
    small core. I also envision that for most cases you never have to
    preserve vector registers over function calls. I.e. there's really no
    need to push/pop vector registers to the stack, except for context
    switches (which I believe should be optimized by tagging unused vector registers to save on stack bandwidth).

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Marcus on Wed Feb 14 05:24:27 2024
    On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

    (I know very well that VVM gives similar gains without the VRF)

    Other than the Cray I being around longer than VVM, what good is
    a vector register file?

    The obvious answer is that it's internal storage, rather than main
    memory, so it's useful for the same reason that cache memory is
    useful - access to frequently used values is much faster.

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.

    So, the vector register file being a _large shared resource_, one
    faces the dilemma... make extra copies for as many programs as may
    be running, or save and restore it.

    I've come up with _one_ possible solution. Remember the Texas Instruments
    9900, which kept its registers in memory, because it was a 16-bit CPU
    back when there weren't really enough gates on a die to make one
    possible... leading to fast context switching?

    Well, why not have an on-chip memory, smaller than L2 cache but made
    of similar memory cells, and use it for multiple vector register files, indicated by a pointer register?

    But then the on-chip memory has to be divided into areas locked off
    from different users, just like external DRAM, and _that_ becomes
    a bit painful to contemplate.

    The Cray I was intended to be used basically in *batch* mode. Having
    a huge vector register file in an ISA meant for *timesharing* is the
    problem.

    Perhaps what is really needed is VVM combined with some very good
    cache hinting mechanisms. I don't have the expertise needed to work
    that out, so I'll have to settle for something rather more kludgey
    instead.

    Of course, if a Cray I is a *batch* processing computer, that sort
    of justifies the notion I came up with earlier - in a thread I
    aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
    vector registers an auxilliary processor after the fashion of those
    in the IBM/Sony CELL processor. But one wants high-bandwidth access
    to DRAM, not no access to DRAM!

    The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
    inside a module that looks a lot like a video card. You just have to
    settle for 48 gigabytes of memory that won't be expandable.

    Some database computers, of course, have as much as a terabyte of
    DRAM - which used to be the size of a large magnetic hard drive.

    People who can afford a terabyte of DRAM can also afford an eight-channel memory bus, so it should be possible to manage something.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Wed Feb 14 05:53:02 2024
    On Wed, 14 Feb 2024 05:24:27 +0000, Quadibloc wrote:

    Of course, if a Cray I is a *batch* processing computer, that sort
    of justifies the notion I came up with earlier - in a thread I
    aptly titled "A Very Bad Idea"

    Didn't look very carefully. That's _this_ thread.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Marcus on Wed Feb 14 11:14:22 2024
    On Tue, 13 Feb 2024 19:57:28 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-05, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.


    FWIW I would just like to share my positive experience with MRISC32
    style vectors (very similar to Cray 1, except 32-bit instead of
    64-bit).


    Does it means that you have 8 VRs and each VR is 2048 bits?

    My machine can start and finish at most one 32-bit operation on every
    clock cycle, so it is very simple. The same thing goes for vector
    operations: at most one 32-bit vector element per clock cycle.

    Thus, it always feels like using vector instructions would not give
    any performance gains. Yet, every time I vectorize a scalar loop
    (basically change scalar registers for vector registers), I see a
    very healthy performance increase.

    I attribute this to reduced loop overhead, eliminated hazards, reduced
    I$ pressure and possibly improved cache locality and reduced register pressure.

    (I know very well that VVM gives similar gains without the VRF)

    I guess my point here is that I think that there are opportunities in
    the very low end space (e.g. in order) to improve performance by
    simply adding MRISC32-style vector support. I think that the gains
    would be even bigger for non-pipelined machines, that could start
    "pumping" the execute stage on every cycle when processing vectors,
    skipping the fetch and decode cycles.

    BTW, I have also noticed that I often only need a very limited number
    of vector registers in the core vectorized loops (e.g. 2-4
    registers), so I don't think that the VRF has to be excruciatingly
    big to add value to a small core.

    It depends on what you are doing.
    If you want good performance in matrix multiply type of algorithm then
    8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
    VR can help somewhat, but the difference between 32 and 16 (in this
    type of kernels) is much much smaller than difference between 8 and
    16.
    Radix-4 and mixed-radix FFT are probably similar except that I never
    profiled as thoroughly as I did SGEMM.

    I also envision that for most cases
    you never have to preserve vector registers over function calls. I.e.
    there's really no need to push/pop vector registers to the stack,
    except for context switches (which I believe should be optimized by
    tagging unused vector registers to save on stack bandwidth).

    /Marcus

    If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM Helium-style, would not work as well or better.
    My personal opinion is that even for low ens in-order cores the
    CRAY-like huge ratio between VR width and execution width is far from
    optimal. Ratio of 8 looks like more optimal in case when performance of vectorized loops is a top priority. Ratio of 4 is a wise choice
    otherwise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Quadibloc on Wed Feb 14 15:37:30 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

    (I know very well that VVM gives similar gains without the VRF)

    Other than the Cray I being around longer than VVM, what good is
    a vector register file?

    The obvious answer is that it's internal storage, rather than main
    memory, so it's useful for the same reason that cache memory is
    useful - access to frequently used values is much faster.

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances.

    The Cray systems weren't used as general purpose timesharing systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Wed Feb 14 17:13:28 2024
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

    (I know very well that VVM gives similar gains without the VRF)

    Other than the Cray I being around longer than VVM, what good is
    a vector register file?

    The obvious answer is that it's internal storage, rather than main
    memory, so it's useful for the same reason that cache memory is
    useful - access to frequently used values is much faster.

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances.

    The Cray systems weren't used as general purpose timesharing systems.

    They were used as database server, though - fast I/O, cheaper than
    an IBM machine of the same performance.

    Or so I heard, ~ 30 years ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Feb 14 20:45:36 2024
    Thomas Koenig wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Quadibloc <quadibloc@servername.invalid> writes:
    On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

    (I know very well that VVM gives similar gains without the VRF)

    Other than the Cray I being around longer than VVM, what good is
    a vector register file?

    The obvious answer is that it's internal storage, rather than main >>>memory, so it's useful for the same reason that cache memory is
    useful - access to frequently used values is much faster.

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances.

    The Cray systems weren't used as general purpose timesharing systems.

    They were used as database server, though - fast I/O, cheaper than
    an IBM machine of the same performance.

    The only thing they lacked for timesharing was paging:: CRAYs had a
    base and bounds memory map. They made up for lack of paging with an
    stupidly fast I/O system.

    Or so I heard, ~ 30 years ago.

    Should be closer to 40 years ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Thu Feb 15 11:21:14 2024
    On Wed, 14 Feb 2024 20:45:36 +0000, MitchAlsup1 wrote:
    Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Quadibloc <quadibloc@servername.invalid> writes:

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under >>>>certain circumstances.

    The Cray systems weren't used as general purpose timesharing systems.

    I wasn't intending this as a criticism of the Cray systems, but
    of my plan to copy their vector architecture in a chip intended
    for general purpose desktop computer use.

    They were used as database server, though - fast I/O, cheaper than
    an IBM machine of the same performance.

    Interesting.

    The only thing they lacked for timesharing was paging:: CRAYs had a
    base and bounds memory map. They made up for lack of paging with an
    stupidly fast I/O system.

    Good to know; the Cray I was a success, so it's good to learn from
    it.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Michael S on Thu Feb 15 20:00:20 2024
    On 2024-02-14, Michael S wrote:
    On Tue, 13 Feb 2024 19:57:28 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-05, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.


    FWIW I would just like to share my positive experience with MRISC32
    style vectors (very similar to Cray 1, except 32-bit instead of
    64-bit).


    Does it means that you have 8 VRs and each VR is 2048 bits?

    No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
    number of registers as I have five-bit vector address fields in the
    instruction encoding (because 32 scalar registers). I have been thinking
    about reducing it to 16 vector registers, and find some clever use for
    the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not there yet.

    The number of vector elements in each register is implementation
    defined, but currently the minimum number of vector elements is set to
    16 (I wanted to set it relatively high to push myself to come up with
    solutions to problems related to large vector registers).

    Each vector element is 32 bits wide.

    So, in total: 32 x 16 x 32 bits = 16384 bits

    This is, incidentally, exactly the same as for AVX-512.

    My machine can start and finish at most one 32-bit operation on every
    clock cycle, so it is very simple. The same thing goes for vector
    operations: at most one 32-bit vector element per clock cycle.

    Thus, it always feels like using vector instructions would not give
    any performance gains. Yet, every time I vectorize a scalar loop
    (basically change scalar registers for vector registers), I see a
    very healthy performance increase.

    I attribute this to reduced loop overhead, eliminated hazards, reduced
    I$ pressure and possibly improved cache locality and reduced register
    pressure.

    (I know very well that VVM gives similar gains without the VRF)

    I guess my point here is that I think that there are opportunities in
    the very low end space (e.g. in order) to improve performance by
    simply adding MRISC32-style vector support. I think that the gains
    would be even bigger for non-pipelined machines, that could start
    "pumping" the execute stage on every cycle when processing vectors,
    skipping the fetch and decode cycles.

    BTW, I have also noticed that I often only need a very limited number
    of vector registers in the core vectorized loops (e.g. 2-4
    registers), so I don't think that the VRF has to be excruciatingly
    big to add value to a small core.

    It depends on what you are doing.
    If you want good performance in matrix multiply type of algorithm then
    8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
    VR can help somewhat, but the difference between 32 and 16 (in this
    type of kernels) is much much smaller than difference between 8 and
    16.
    Radix-4 and mixed-radix FFT are probably similar except that I never
    profiled as thoroughly as I did SGEMM.


    I expect that people will want to do such things with an MRISC32 core.
    However, for the "small cores" that I'm talking about, I doubt that they
    would even have floating-point support. It's more a question of simple
    loop optimizations - e.g. the kinds you find in libc or software
    rasterization kernels. For those you will often get lots of work done
    with just four vector registers.

    I also envision that for most cases
    you never have to preserve vector registers over function calls. I.e.
    there's really no need to push/pop vector registers to the stack,
    except for context switches (which I believe should be optimized by
    tagging unused vector registers to save on stack bandwidth).

    /Marcus

    If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM Helium-style, would not work as well or better.
    My personal opinion is that even for low ens in-order cores the
    CRAY-like huge ratio between VR width and execution width is far from optimal. Ratio of 8 looks like more optimal in case when performance of vectorized loops is a top priority. Ratio of 4 is a wise choice
    otherwise.

    For MRISC32 I'm aiming for splitting a vector operation into four. That
    seems to eliminate most RAW hazards as execution pipelines tend to be at
    most four stages long (or thereabout). So, with a pipeline width of 128
    bits (which seems to be the goto width for many implementations), you
    want registers that have 4 x 128 = 512 bits, which is one of the reasons
    that I mandate at least 512-bit vector registers in MRISC32.

    Of course, nothing is set in stone, but so far that has been my
    thinking.

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Quadibloc on Thu Feb 15 19:44:27 2024
    On 2024-02-14, Quadibloc wrote:
    On Tue, 13 Feb 2024 19:57:28 +0100, Marcus wrote:

    (I know very well that VVM gives similar gains without the VRF)

    Other than the Cray I being around longer than VVM, what good is
    a vector register file?

    The obvious answer is that it's internal storage, rather than main
    memory, so it's useful for the same reason that cache memory is
    useful - access to frequently used values is much faster.

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.


    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    My current vision (not MRISC32), which is a very simple
    microcontroller type implementation (basically in the same ballpark as
    Cortex-M or small RV32I implementations), would have a relatively
    limited vector register file.

    I scribbled down a suggestion here:

    * https://gitlab.com/-/snippets/3673883

    In particular, pay attention to the sections "Vector state on context
    switches" and "Thread context".

    My idea is not new, but I think that it takes some old ideas a few steps further. So here goes...

    There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
    for a grand total of 128 bytes of vector thread context state. To start
    with, this is not an enormous amount of state (it's the same size as the integer register file of RV32I).

    Each vector register is associated with a "vector in use" flag, which is
    set as soon as the vector register is written to.

    The novel part (AFAIK) is that all "vector in use" flags are cleared as
    soon as a function returns (rts) or another function is called (bl/jl),
    which takes advantage of the ABI that says that all vector registers are scratch registers.

    I then predict that the ISA will have some sort of intelligent store
    and restore state instructions, that will only waste memory cycles
    for vector registers that are marked as "in use". I also predict that
    most vector registers will be unused most of the time (except for
    threads that use up 100% CPU time with heavy data processing, which
    should hopefully be in minority - especially in the kind of systems
    where you want to put a microcontroller style CPU).

    I do not yet know if this will fly, though...

    So, the vector register file being a _large shared resource_, one
    faces the dilemma... make extra copies for as many programs as may
    be running, or save and restore it.

    I've come up with _one_ possible solution. Remember the Texas Instruments 9900, which kept its registers in memory, because it was a 16-bit CPU
    back when there weren't really enough gates on a die to make one
    possible... leading to fast context switching?

    Well, why not have an on-chip memory, smaller than L2 cache but made
    of similar memory cells, and use it for multiple vector register files, indicated by a pointer register?


    I have had a similar idea for "big" implementations that have a huge
    vector register file. My idea, though, is more of a hybrid: Basically
    keep a few copies (e.g. 4-8 copies?) of vector registers for hot threads
    that can be quickly switched between (no cost - just a logical "vector
    register file ID" that is changed), and then have a more or less
    separate memory path to a bigger vector register file cache, and swap
    register file copies in/out of the hot storage asynchronously.

    I'm not sure if it would be feasible to either implement next-thread
    prediction in hardware, or get help from the OS in the form of hints
    about the next likely thread(s) to execute, but the idea is that it
    should be possible to hide most of the context switch overhead this way.

    But then the on-chip memory has to be divided into areas locked off
    from different users, just like external DRAM, and _that_ becomes
    a bit painful to contemplate.


    Wouldn't a kernel space "thread ID" or "vector register file ID" do?

    The Cray I was intended to be used basically in *batch* mode. Having
    a huge vector register file in an ISA meant for *timesharing* is the
    problem.

    Perhaps what is really needed is VVM combined with some very good
    cache hinting mechanisms. I don't have the expertise needed to work
    that out, so I'll have to settle for something rather more kludgey
    instead.

    Of course, if a Cray I is a *batch* processing computer, that sort
    of justifies the notion I came up with earlier - in a thread I
    aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
    vector registers an auxilliary processor after the fashion of those
    in the IBM/Sony CELL processor. But one wants high-bandwidth access
    to DRAM, not no access to DRAM!

    The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
    inside a module that looks a lot like a video card. You just have to
    settle for 48 gigabytes of memory that won't be expandable.

    Some database computers, of course, have as much as a terabyte of
    DRAM - which used to be the size of a large magnetic hard drive.

    People who can afford a terabyte of DRAM can also afford an eight-channel memory bus, so it should be possible to manage something.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Marcus on Thu Feb 15 19:12:00 2024
    Marcus wrote:

    On 2024-02-14, Quadibloc wrote:

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.


    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    My current vision (not MRISC32), which is a very simple
    microcontroller type implementation (basically in the same ballpark as Cortex-M or small RV32I implementations), would have a relatively
    limited vector register file.

    I scribbled down a suggestion here:

    * https://gitlab.com/-/snippets/3673883

    In particular, pay attention to the sections "Vector state on context switches" and "Thread context".

    My idea is not new, but I think that it takes some old ideas a few steps further. So here goes...

    There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
    for a grand total of 128 bytes of vector thread context state. To start
    with, this is not an enormous amount of state (it's the same size as the integer register file of RV32I).

    Each vector register is associated with a "vector in use" flag, which is
    set as soon as the vector register is written to.

    The novel part (AFAIK) is that all "vector in use" flags are cleared as
    soon as a function returns (rts) or another function is called (bl/jl),
    which takes advantage of the ABI that says that all vector registers are scratch registers.

    I then predict that the ISA will have some sort of intelligent store
    and restore state instructions, that will only waste memory cycles
    for vector registers that are marked as "in use". I also predict that
    most vector registers will be unused most of the time (except for
    threads that use up 100% CPU time with heavy data processing, which
    should hopefully be in minority - especially in the kind of systems
    where you want to put a microcontroller style CPU).

    VVM is designed such that even ISRs can use the vectorized parts of the implementation. Move data, clear pages, string.h, ... so allowing GuestOSs
    to use vectorization fall out for free.

    I do not yet know if this will fly, though...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Marcus on Thu Feb 15 23:00:33 2024
    On Thu, 15 Feb 2024 20:00:20 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-14, Michael S wrote:
    On Tue, 13 Feb 2024 19:57:28 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-05, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.


    FWIW I would just like to share my positive experience with MRISC32
    style vectors (very similar to Cray 1, except 32-bit instead of
    64-bit).


    Does it means that you have 8 VRs and each VR is 2048 bits?

    No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
    number of registers as I have five-bit vector address fields in the instruction encoding (because 32 scalar registers). I have been
    thinking about reducing it to 16 vector registers, and find some
    clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
    there yet.

    The number of vector elements in each register is implementation
    defined, but currently the minimum number of vector elements is set to
    16 (I wanted to set it relatively high to push myself to come up with solutions to problems related to large vector registers).

    Each vector element is 32 bits wide.

    So, in total: 32 x 16 x 32 bits = 16384 bits

    This is, incidentally, exactly the same as for AVX-512.

    My machine can start and finish at most one 32-bit operation on
    every clock cycle, so it is very simple. The same thing goes for
    vector operations: at most one 32-bit vector element per clock
    cycle.

    Thus, it always feels like using vector instructions would not give
    any performance gains. Yet, every time I vectorize a scalar loop
    (basically change scalar registers for vector registers), I see a
    very healthy performance increase.

    I attribute this to reduced loop overhead, eliminated hazards,
    reduced I$ pressure and possibly improved cache locality and
    reduced register pressure.

    (I know very well that VVM gives similar gains without the VRF)

    I guess my point here is that I think that there are opportunities
    in the very low end space (e.g. in order) to improve performance by
    simply adding MRISC32-style vector support. I think that the gains
    would be even bigger for non-pipelined machines, that could start
    "pumping" the execute stage on every cycle when processing vectors,
    skipping the fetch and decode cycles.

    BTW, I have also noticed that I often only need a very limited
    number of vector registers in the core vectorized loops (e.g. 2-4
    registers), so I don't think that the VRF has to be excruciatingly
    big to add value to a small core.

    It depends on what you are doing.
    If you want good performance in matrix multiply type of algorithm
    then 8 VRs would not take you very far. 16 VRs are ALOT better.
    More than 16 VR can help somewhat, but the difference between 32
    and 16 (in this type of kernels) is much much smaller than
    difference between 8 and 16.
    Radix-4 and mixed-radix FFT are probably similar except that I never profiled as thoroughly as I did SGEMM.


    I expect that people will want to do such things with an MRISC32 core. However, for the "small cores" that I'm talking about, I doubt that
    they would even have floating-point support. It's more a question of
    simple loop optimizations - e.g. the kinds you find in libc or
    software rasterization kernels. For those you will often get lots of
    work done with just four vector registers.

    I also envision that for most cases
    you never have to preserve vector registers over function calls.
    I.e. there's really no need to push/pop vector registers to the
    stack, except for context switches (which I believe should be
    optimized by tagging unused vector registers to save on stack
    bandwidth).

    /Marcus

    If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
    ARM Helium-style, would not work as well or better.
    My personal opinion is that even for low ens in-order cores the
    CRAY-like huge ratio between VR width and execution width is far
    from optimal. Ratio of 8 looks like more optimal in case when
    performance of vectorized loops is a top priority. Ratio of 4 is a
    wise choice otherwise.

    For MRISC32 I'm aiming for splitting a vector operation into four.
    That seems to eliminate most RAW hazards as execution pipelines tend
    to be at most four stages long (or thereabout). So, with a pipeline
    width of 128 bits (which seems to be the goto width for many implementations), you want registers that have 4 x 128 = 512 bits,
    which is one of the reasons that I mandate at least 512-bit vector
    registers in MRISC32.

    Of course, nothing is set in stone, but so far that has been my
    thinking.

    /Marcus

    Sounds quite reasonable, but I wouldn't call it "Cray-style".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Marcus on Fri Feb 16 04:10:21 2024
    On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
    On 2024-02-14, Quadibloc wrote:

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.

    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    Yes, and therefore I am looking into ways to deal with it somehow.

    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    But because the historical precedent seems to indicate otherwise, and
    because while data forwarding is very definitely a good thing (and,
    indeed, necessary to have for best performance _on_ a vector register
    machine too) it has its limits.

    What _could_ substitute for vector registers isn't data forwarding,
    it's the cache, since that does the same thing vector registers do:
    it brings in vector operands closer to the CPU where they're more
    quickly accessible. So a STAR-100 with a *really good cache* as well
    as data forwarding could, I suppose, compete with a Cray I.

    My first question, though, is whether or not we can really make caches
    that good.

    But skepticism about VVM isn't actually helpful if Cray-style vectors
    are now impossible to be made to work given current memory speeds.

    The basic way in which I originally felt I could make it work was really
    quite simple. The operating system, from privileged code, could set a
    bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

    The details of how one may have to make use of that capability... well,
    that's software. So maybe the OS has to stipulate that one can only have
    one process at a time that uses these vectors - and that process has to
    run as a batch process!

    Hey, the GPU in a computer these days is also a singular resource.

    Having resources that have to be treated that way is not really what
    people are used to, but a computer that _can_ run your CFD codes
    efficiently is better than a computer that *can't* run your CFD codes.

    Given _that_, obviously if VVM is a better fit to the regular computer
    model, and it offers nearly the same performance, then what I should do
    is offer VVM or something very much like it _in addition_ to Cray-style vectors, so that the best possible vector performance for conventional non-batch programs is also available.

    Now, what would I think of as being "something very much like VVM" without actually being VVM?

    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions
    and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    So this makes those exact combinations part of the... ISA syntax...
    which I think is too hard for assembler programmers to remember, and
    I think it's also too hard for at least some implementors. I see it
    as asking for trouble in a way that I'd rather avoid.

    So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Feb 16 04:30:34 2024
    On Fri, 16 Feb 2024 04:10:21 +0000, Quadibloc wrote:

    The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
    bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

    The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
    one process at a time that uses these vectors - and that process has to
    run as a batch process!

    and then I also wrote...

    So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

    However, an obvious objection can be raised.

    Vector programs that can only be run one at a time on a computer using
    your new chip? That's a throwback to ancient times; people using today's computers with GUI operating systems aren't used to that sort of thing,
    and will therefore end up tossing your computer out, thinking that it's
    broken!

    So there is one more stratagem that I need to employ to avoid that
    disaster.

    Nothing is stopping the operating system and compilers from supporting
    a particular kind of *fat binaries* that addresses this issue, making
    it all invisible to the user.

    Vector programs would come in a form that includes _both_ Cray I
    style code and STAR-100 style code, and the highest-priority
    vector program on the machine would get to run in Cray I mode until
    it finishes.

    Yes, that means that later programs with even higher priority would
    be doomed to run slow, but this horse can't be changed in midstream,
    and so one just has to live with this limitation.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Fri Feb 16 07:27:36 2024
    Quadibloc <quadibloc@servername.invalid> writes:
    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    I don't think that's a proper characterization of VVM. One advantage
    that vector registers have over memory-memory machines is that vector registers, once loaded, can be used several times. And AFAIK VVM has
    that advantage, too. E.g., if you have the loop

    for (i=0; i<n; i++) {
    double b = a[i];
    c[i] = b;
    d[i] = b;
    }

    a[i] is loaded only once (also in VVM), while a memory-memory
    formulation would load a[i] twice. And on the microarchiectural
    level, VVM may work with vector registers, but the nice part is that
    it's only microarchiecture, and it avoids all the nasty consequences
    of making it architectural, such as more expensive context switches.

    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions
    and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?), and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures. I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    So this makes those exact combinations part of the... ISA syntax...
    which I think is too hard for assembler programmers to remember,

    My understanding is that there is no need to remember much. Just
    remember that it has to be a simple loop, and mark it. But, as in all auto-vectorization schemes, there are cases where it works better than
    in others.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Quadibloc on Fri Feb 16 12:29:33 2024
    On 2024-02-16, Quadibloc wrote:
    On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
    On 2024-02-14, Quadibloc wrote:

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.

    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    Yes, and therefore I am looking into ways to deal with it somehow.

    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    But because the historical precedent seems to indicate otherwise, and
    because while data forwarding is very definitely a good thing (and,
    indeed, necessary to have for best performance _on_ a vector register
    machine too) it has its limits.

    What _could_ substitute for vector registers isn't data forwarding,
    it's the cache, since that does the same thing vector registers do:
    it brings in vector operands closer to the CPU where they're more
    quickly accessible. So a STAR-100 with a *really good cache* as well
    as data forwarding could, I suppose, compete with a Cray I.

    My first question, though, is whether or not we can really make caches
    that good.


    I think that you are missing some of the points that I'm trying to make.
    In my recent comments I have been talking about very low end machines,
    the kinds that can execute at most one instruction per clock cycle, or
    maybe less, and that may not even have a cache at all.

    I'm saying that I believe that within this category there is an
    opportunity for improving performance with very little cost by adding
    vector operations.

    E.g. imagine a non-pipelined implementation with a single memory port,
    shared by instruction fetch and data load/store, that requires perhaps
    two cycles to fetch and decode an instruction, and executes the
    instruction in the third cycle (possibly accessing the memory, which
    precludes fetching a new instruction until the fourth or even fifth
    cycle).

    Now imagine if a single instruction could iterate over several elements
    of a vector register. This would mean that the execution unit could
    execute up to one operation every clock cycle, approaching similar
    performance levels as a pipelined 1 CPI machine. The memory port would
    be free for data traffic as no new instructions have to be fetched
    during the vector loop. And so on.

    Similarly, imagine a very simple strictly in-order pipelined
    implementation, where you have to resolve hazards by stalling the
    pipeline every time there is RAW hazard for instance, and you have to
    throw away cycles every time you mispredict a branch (which may be
    quite often if you only have a very primitive predictor).

    With vector operations you pause the front end (fetch and decode) while iterating over vector elements, which eliminates branch misprediction penalties. You also magically do away with RAW hazards as by the time
    you start issuing a new instruction the vector elements needed from the previous instruction have already been written to the register file.
    And of course you do away with loop overhead instructions (increment,
    compare, branch).

    As a bonus, I believe that a vector solution like that would be more
    energy efficient, as less work has to be done for each operation than if
    you have to fetch and decode an instruction for every operation that you
    do.

    As I said, VVM has many similar properties, but I am currently exploring
    if a VRF solution can be made sufficiently cheap to be feasible in this
    very low end space, where I believe that VVM may be a bit too much (this assumption is mostly based on my own ignorance, so take it with a grain
    of salt).

    For reference, the microarchitectural complexity that I'm thinking about
    is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

    https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Marcus on Fri Feb 16 14:04:25 2024
    On Fri, 16 Feb 2024 12:37:55 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:


    Then what would you call it?

    I just use the term "Cray-style" to differentiate the style of vector
    ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style memory-memory vector ISA:s, etc.

    /Marcus

    I'd call it a variant of SIMD.
    For me everything with vector register width to ALU width ratio <= 4 is
    SIMD. 8 is borderline, above 8 is vector.
    It means that sometimes I classify by implementation instead of by
    architecture which in theory is problematic. But I don't care, I am not
    in academy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Michael S on Fri Feb 16 12:37:55 2024
    On 2024-02-15, Michael S wrote:
    On Thu, 15 Feb 2024 20:00:20 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-14, Michael S wrote:
    On Tue, 13 Feb 2024 19:57:28 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:

    On 2024-02-05, Quadibloc wrote:
    I am very fond of the vector architecture of the Cray I and
    similar machines, because it seems to me the one way of
    increasing computer performance that proved effective in
    the past that still isn't being applied to microprocessors
    today.

    Mitch Alsup, however, has noted that such an architecture is
    unworkable today due to memory bandwidth issues. The one
    extant example of this architecture these days, the NEC
    SX-Aurora TSUBASA, keeps its entire main memory of up to 48
    gigabytes on the same card as the CPU, with a form factor
    resembling a video card - it doesn't try to use the main
    memory bus of a PC motherboard. So that seems to confirm
    this.


    FWIW I would just like to share my positive experience with MRISC32
    style vectors (very similar to Cray 1, except 32-bit instead of
    64-bit).


    Does it means that you have 8 VRs and each VR is 2048 bits?

    No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
    number of registers as I have five-bit vector address fields in the
    instruction encoding (because 32 scalar registers). I have been
    thinking about reducing it to 16 vector registers, and find some
    clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
    there yet.

    The number of vector elements in each register is implementation
    defined, but currently the minimum number of vector elements is set to
    16 (I wanted to set it relatively high to push myself to come up with
    solutions to problems related to large vector registers).

    Each vector element is 32 bits wide.

    So, in total: 32 x 16 x 32 bits = 16384 bits

    This is, incidentally, exactly the same as for AVX-512.

    My machine can start and finish at most one 32-bit operation on
    every clock cycle, so it is very simple. The same thing goes for
    vector operations: at most one 32-bit vector element per clock
    cycle.

    Thus, it always feels like using vector instructions would not give
    any performance gains. Yet, every time I vectorize a scalar loop
    (basically change scalar registers for vector registers), I see a
    very healthy performance increase.

    I attribute this to reduced loop overhead, eliminated hazards,
    reduced I$ pressure and possibly improved cache locality and
    reduced register pressure.

    (I know very well that VVM gives similar gains without the VRF)

    I guess my point here is that I think that there are opportunities
    in the very low end space (e.g. in order) to improve performance by
    simply adding MRISC32-style vector support. I think that the gains
    would be even bigger for non-pipelined machines, that could start
    "pumping" the execute stage on every cycle when processing vectors,
    skipping the fetch and decode cycles.

    BTW, I have also noticed that I often only need a very limited
    number of vector registers in the core vectorized loops (e.g. 2-4
    registers), so I don't think that the VRF has to be excruciatingly
    big to add value to a small core.

    It depends on what you are doing.
    If you want good performance in matrix multiply type of algorithm
    then 8 VRs would not take you very far. 16 VRs are ALOT better.
    More than 16 VR can help somewhat, but the difference between 32
    and 16 (in this type of kernels) is much much smaller than
    difference between 8 and 16.
    Radix-4 and mixed-radix FFT are probably similar except that I never
    profiled as thoroughly as I did SGEMM.


    I expect that people will want to do such things with an MRISC32 core.
    However, for the "small cores" that I'm talking about, I doubt that
    they would even have floating-point support. It's more a question of
    simple loop optimizations - e.g. the kinds you find in libc or
    software rasterization kernels. For those you will often get lots of
    work done with just four vector registers.

    I also envision that for most cases
    you never have to preserve vector registers over function calls.
    I.e. there's really no need to push/pop vector registers to the
    stack, except for context switches (which I believe should be
    optimized by tagging unused vector registers to save on stack
    bandwidth).

    /Marcus

    If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
    ARM Helium-style, would not work as well or better.
    My personal opinion is that even for low ens in-order cores the
    CRAY-like huge ratio between VR width and execution width is far
    from optimal. Ratio of 8 looks like more optimal in case when
    performance of vectorized loops is a top priority. Ratio of 4 is a
    wise choice otherwise.

    For MRISC32 I'm aiming for splitting a vector operation into four.
    That seems to eliminate most RAW hazards as execution pipelines tend
    to be at most four stages long (or thereabout). So, with a pipeline
    width of 128 bits (which seems to be the goto width for many
    implementations), you want registers that have 4 x 128 = 512 bits,
    which is one of the reasons that I mandate at least 512-bit vector
    registers in MRISC32.

    Of course, nothing is set in stone, but so far that has been my
    thinking.

    /Marcus

    Sounds quite reasonable, but I wouldn't call it "Cray-style".


    Then what would you call it?

    I just use the term "Cray-style" to differentiate the style of vector
    ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style memory-memory vector ISA:s, etc.

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcus@21:1/5 to Michael S on Fri Feb 16 13:27:00 2024
    On 2024-02-16, Michael S wrote:
    On Fri, 16 Feb 2024 12:37:55 +0100
    Marcus <m.delete@this.bitsnbites.eu> wrote:


    Then what would you call it?

    I just use the term "Cray-style" to differentiate the style of vector
    ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
    memory-memory vector ISA:s, etc.

    /Marcus

    I'd call it a variant of SIMD.
    For me everything with vector register width to ALU width ratio <= 4 is
    SIMD. 8 is borderline, above 8 is vector.
    It means that sometimes I classify by implementation instead of by architecture which in theory is problematic. But I don't care, I am not
    in academy.

    Ok, I am generally talking about the ISA, which dictates the semantics
    and what kind of implementations that are possible (or at least
    feasible).

    For my current MRISC32-A1 implementation, the vector register width to
    ALU width ratio is 16, so it would definitely qualify as "vector" then.

    The ISA is designed, however, to support wider execution, but the idea
    is to *not require* very wide execution, but rather encourage sequential execution (up to a point where things like hazard resolution become
    less of a problem and OoO is not really a necessity for high
    throughput).

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Fri Feb 16 05:11:42 2024
    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    I don't think that's a proper characterization of VVM. One advantage
    that vector registers have over memory-memory machines is that vector registers, once loaded, can be used several times. And AFAIK VVM has
    that advantage, too. E.g., if you have the loop

    for (i=0; i<n; i++) {
    double b = a[i];
    c[i] = b;
    d[i] = b;
    }

    a[i] is loaded only once (also in VVM), while a memory-memory
    formulation would load a[i] twice. And on the microarchiectural
    level, VVM may work with vector registers, but the nice part is that
    it's only microarchiecture, and it avoids all the nasty consequences
    of making it architectural, such as more expensive context switches.

    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions
    and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the
    instructions in the loop can be fetched and decoded only once, it allows
    the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a
    single instruction. Perhaps the HW could figure out all of that by
    analyzing a "normal" instruction stream, but that seems much harder.



    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Fri Feb 16 14:23:20 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the
    instructions in the loop can be fetched and decoded only once, it allows
    the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a >single instruction. Perhaps the HW could figure out all of that by
    analyzing a "normal" instruction stream, but that seems much harder.

    Compared to the rest of the VVM stuff, recognizing it in hardware does
    not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
    in the coming years.

    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    Sure, predication is not a control structure.

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the >reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware. For FP addition that should give the same
    result as the sequential code, it's probably much harder. Of course,
    you can ask the programmer to write:

    double r;
    double r0=0.0;
    ...
    double r15=0.0;
    for (i=0; i<n-15; i+=16) {
    r0 += a[i];
    ...
    r15 += a[i+15];
    }
    ... deal with the remaining iterations ...
    r = r0+...+r15;

    But then the point of auto-vectorization is that the programmers are
    unaware of what's going on behind the curtain, and that promise is not
    kept if they have to write code like above.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Marcus on Fri Feb 16 18:45:59 2024
    Marcus wrote:

    On 2024-02-16, Quadibloc wrote:
    On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
    On 2024-02-14, Quadibloc wrote:

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.

    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    Yes, and therefore I am looking into ways to deal with it somehow.

    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    But because the historical precedent seems to indicate otherwise, and
    because while data forwarding is very definitely a good thing (and,
    indeed, necessary to have for best performance _on_ a vector register
    machine too) it has its limits.

    What _could_ substitute for vector registers isn't data forwarding,
    it's the cache, since that does the same thing vector registers do:
    it brings in vector operands closer to the CPU where they're more
    quickly accessible. So a STAR-100 with a *really good cache* as well
    as data forwarding could, I suppose, compete with a Cray I.

    My first question, though, is whether or not we can really make caches
    that good.


    I think that you are missing some of the points that I'm trying to make.
    In my recent comments I have been talking about very low end machines,
    the kinds that can execute at most one instruction per clock cycle, or
    maybe less, and that may not even have a cache at all.

    I'm saying that I believe that within this category there is an
    opportunity for improving performance with very little cost by adding
    vector operations.

    E.g. imagine a non-pipelined implementation with a single memory port,
    shared by instruction fetch and data load/store, that requires perhaps
    two cycles to fetch and decode an instruction, and executes the
    instruction in the third cycle (possibly accessing the memory, which precludes fetching a new instruction until the fourth or even fifth
    cycle).

    Now imagine if a single instruction could iterate over several elements
    of a vector register. This would mean that the execution unit could
    execute up to one operation every clock cycle, approaching similar performance levels as a pipelined 1 CPI machine. The memory port would
    be free for data traffic as no new instructions have to be fetched
    during the vector loop. And so on.

    You should think of it like:: VVM can execute as many operations per
    cycle as it has function units. In particular, the low end machine
    can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
    cycle. LDs operate at 128-bits wide, so one can execute a LD on even
    cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

    Bigger implementations can have more cache ports and more FMAC units;
    and include "lanes" in SIMD-like fashion.

    Similarly, imagine a very simple strictly in-order pipelined
    implementation, where you have to resolve hazards by stalling the
    pipeline every time there is RAW hazard for instance, and you have to
    throw away cycles every time you mispredict a branch (which may be
    quite often if you only have a very primitive predictor).

    With vector operations you pause the front end (fetch and decode) while iterating over vector elements, which eliminates branch misprediction penalties. You also magically do away with RAW hazards as by the time
    you start issuing a new instruction the vector elements needed from the previous instruction have already been written to the register file.
    And of course you do away with loop overhead instructions (increment, compare, branch).

    VVM does not use branch prediction--it uses a zero-loss ADD-CMP-BC
    instruction I call LOOP.

    And you do not have to lose precise exceptions, either.

    As a bonus, I believe that a vector solution like that would be more
    energy efficient, as less work has to be done for each operation than if
    you have to fetch and decode an instruction for every operation that you
    do.

    More energy efficient, but consumes more energy because it is running
    more data in less time.

    As I said, VVM has many similar properties, but I am currently exploring
    if a VRF solution can be made sufficiently cheap to be feasible in this
    very low end space, where I believe that VVM may be a bit too much (this assumption is mostly based on my own ignorance, so take it with a grain
    of salt).

    For reference, the microarchitectural complexity that I'm thinking about
    is comparable to FemtoRV32 by Bruno Levy (400 LOC, with comments):

    https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v

    /Marcus

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Quadibloc on Fri Feb 16 18:35:01 2024
    Quadibloc wrote:

    On Thu, 15 Feb 2024 19:44:27 +0100, Marcus wrote:
    On 2024-02-14, Quadibloc wrote:

    But there's also one very bad thing about a vector register file.

    Like any register file, it has to be *saved* and *restored* under
    certain circumstances. Most especially, it has to be saved before,
    and restored after, other user-mode programs run, even if they
    aren't _expected_ to use vectors, as a program interrupted by
    a real-time-clock interrupt to let other users do stuff has to
    be able to *rely* on its registers all staying undisturbed, as if
    no interrupts happened.

    Yes, that is the major drawback of a vector register file, so it has to
    be dealt with somehow.

    Yes, and therefore I am looking into ways to deal with it somehow.

    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    VVM on My 66000 remains a RISC ISA--what it does is provide an implemen-
    tation freedom to perform multiple loops (SIMD -style) at the same time.
    CRAY nomenclature would call this "lanes".

    But because the historical precedent seems to indicate otherwise, and
    because while data forwarding is very definitely a good thing (and,
    indeed, necessary to have for best performance _on_ a vector register
    machine too) it has its limits.

    What _could_ substitute for vector registers isn't data forwarding,
    it's the cache, since that does the same thing vector registers do:
    it brings in vector operands closer to the CPU where they're more
    quickly accessible. So a STAR-100 with a *really good cache* as well
    as data forwarding could, I suppose, compete with a Cray I.

    Cache buffers to be more precise.

    My first question, though, is whether or not we can really make caches
    that good.

    Once a memory reference in a vectorized loop starts to miss, you quit
    storing the data in the cache and just strip mine it through the cache
    buffers, and avoid polluting the DCache with data that will be displaced
    before the loop completes.

    But skepticism about VVM isn't actually helpful if Cray-style vectors
    are now impossible to be made to work given current memory speeds.

    The basic way in which I originally felt I could make it work was really quite simple. The operating system, from privileged code, could set a
    bit in the PSW that turns on, or off, the ability to run instructions that access the vector registers.

    The details of how one may have to make use of that capability... well, that's software. So maybe the OS has to stipulate that one can only have
    one process at a time that uses these vectors - and that process has to
    run as a batch process!

    Hey, the GPU in a computer these days is also a singular resource.

    Having resources that have to be treated that way is not really what
    people are used to, but a computer that _can_ run your CFD codes
    efficiently is better than a computer that *can't* run your CFD codes.

    Given _that_, obviously if VVM is a better fit to the regular computer
    model, and it offers nearly the same performance, then what I should do
    is offer VVM or something very much like it _in addition_ to Cray-style vectors, so that the best possible vector performance for conventional non-batch programs is also available.

    Now, what would I think of as being "something very much like VVM" without actually being VVM?

    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions
    and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    So this makes those exact combinations part of the... ISA syntax...
    which I think is too hard for assembler programmers to remember, and
    I think it's also too hard for at least some implementors. I see it
    as asking for trouble in a way that I'd rather avoid.

    So my substitute for VVM should now be obvious - explicit memory-to-memory vector instructions, like on an old STAR-100.

    Gasp........

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Feb 16 18:53:00 2024
    Stephen Fuld wrote:

    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Why not just use Mitch Alsup's wonderful VVM?

    It is true that the state of the art has advanced since the Cray I
    was first introduced. So, perhaps Mitch Alsup has indeed found,
    through improving data forwarding, as I understand it, a way to make
    the performance of a memory-memory vector machine (like the Control
    Data STAR-100) match that of one with vector registers (like the
    Cray I, which succeeded where the STAR-100 failed).

    I don't think that's a proper characterization of VVM. One advantage
    that vector registers have over memory-memory machines is that vector
    registers, once loaded, can be used several times. And AFAIK VVM has
    that advantage, too. E.g., if you have the loop

    for (i=0; i<n; i++) {
    double b = a[i];
    c[i] = b;
    d[i] = b;
    }

    a[i] is loaded only once (also in VVM), while a memory-memory
    formulation would load a[i] twice. And on the microarchiectural
    level, VVM may work with vector registers, but the nice part is that
    it's only microarchiecture, and it avoids all the nasty consequences
    of making it architectural, such as more expensive context switches.

    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions >>> and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Bookends on the loop provide the information the HW needs, the VEC
    instruction at the top provides the IP for the LOOP instruction at
    the bottom to branch to, and also provides a bit map of registers
    which are live-out of the loop, discarding other used loop registers.

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the
    instructions in the loop can be fetched and decoded only once, it allows
    the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a single instruction. Perhaps the HW could figure out all of that by
    analyzing a "normal" instruction stream, but that seems much harder.

    All of that is correct.

    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    Predicated control flow--yes, branch flow-control no.

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    Right, register and memory dependencies are observed and obeyed. So,
    in the above loop, the recurrence slows the loop down to the latency of
    FADD, but the LD, ADD-CMP-BC run concurrently; so, you are still faster
    than if you did no VVM the loop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Feb 16 18:57:11 2024
    Anton Ertl wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the >>instructions in the loop can be fetched and decoded only once, it allows >>the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a >>single instruction. Perhaps the HW could figure out all of that by >>analyzing a "normal" instruction stream, but that seems much harder.

    Compared to the rest of the VVM stuff, recognizing it in hardware does
    not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
    in the coming years.

    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    Sure, predication is not a control structure.

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the >>reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware. For FP addition that should give the same
    result as the sequential code, it's probably much harder. Of course,
    you can ask the programmer to write:

    double r;
    double r0=0.0;
    ....
    double r15=0.0;
    for (i=0; i<n-15; i+=16) {
    r0 += a[i];
    ...
    r15 += a[i+15];
    }
    .... deal with the remaining iterations ...
    r = r0+...+r15;

    But then the point of auto-vectorization is that the programmers are
    unaware of what's going on behind the curtain, and that promise is not
    kept if they have to write code like above.

    VVM is also adept at vectorizing str* and mem* functions from the C
    library, and as such, you have to do it in a way that even ISRs can
    use VVM (when it is to their advantage).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Fri Feb 16 11:03:26 2024
    On 2/16/2024 6:23 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Basically, Mitch has his architecture designed for implementation on
    CPUs that are smart enough to notice certain combinations of instructions >>>> and execute them as though they're single instructions doing the same
    thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the
    instructions in the loop can be fetched and decoded only once, it allows
    the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a
    single instruction. Perhaps the HW could figure out all of that by
    analyzing a "normal" instruction stream, but that seems much harder.

    Compared to the rest of the VVM stuff, recognizing it in hardware does
    not add much difficulty.

    IANAHG, but if it were that simple, I would think Mitch would have
    implemented it that way.


    Maybe we'll see it in some Intel or AMD CPU
    in the coming years.

    One can hope!



    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    Sure, predication is not a control structure.

    OK, but my point is that you can do conditional execution within a VVM loop.


    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the
    reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware.

    Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware recognizing that the loop contains no instructions that can't be
    parallelized. There are also some issues like doing a sum of signed
    integer values and knowing whether overflow occurred, etc. The
    programmer may know that overflow cannot occur, but the HW doesn't.



    For FP addition that should give the same
    result as the sequential code, it's probably much harder. Of course,
    you can ask the programmer to write:

    double r;
    double r0=0.0;
    ...
    double r15=0.0;
    for (i=0; i<n-15; i+=16) {
    r0 += a[i];
    ...
    r15 += a[i+15];
    }
    ... deal with the remaining iterations ...
    r = r0+...+r15;

    But then the point of auto-vectorization is that the programmers are
    unaware of what's going on behind the curtain, and that promise is not
    kept if they have to write code like above.

    Agreed.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Feb 16 14:27:19 2024
    MitchAlsup1 wrote:

    You should think of it like:: VVM can execute as many operations per
    cycle as it has function units. In particular, the low end machine
    can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
    cycle. LDs operate at 128-bits wide, so one can execute a LD on even
    cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

    Bigger implementations can have more cache ports and more FMAC units;
    and include "lanes" in SIMD-like fashion.

    Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
    two consecutive 64-bit LD or ST to consecutive addresses and merges
    them into a single cache access?
    Is that done by disambiguation logic, checking for same cache line access?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Feb 16 23:34:33 2024
    EricP wrote:

    MitchAlsup1 wrote:

    You should think of it like:: VVM can execute as many operations per
    cycle as it has function units. In particular, the low end machine
    can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
    cycle. LDs operate at 128-bits wide, so one can execute a LD on even
    cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

    Bigger implementations can have more cache ports and more FMAC units;
    and include "lanes" in SIMD-like fashion.

    Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
    two consecutive 64-bit LD or ST to consecutive addresses and merges
    them into a single cache access?

    first: memory is inherently misaligned in My 66000 architecture. So, since
    the width of the machine is 64-bits, we read or write in 128-bit quantities
    so that we have enough bits to extract the misaligned data from or a container large enough to store a 64-bit value into. {{And there are all the associated corner cases}}

    Second: over in VVM-land, the implementation can decide to read and write wider, but is architecturally constrained not to shrink below 128-bits.

    A 1-wide My66160 would read pairs of double precision FP values, or quads
    of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. This supports loops of 6IPC or greater in a 1-wide machine. This machine
    would process suitable loops at 128-bits per cycle--depending on "other
    things" that are generally allowable.

    A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
    per cycle. This supports 20 IPC or greater in the 6-wide machine. As many as
    8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
    to support this rate.

    Is that done by disambiguation logic, checking for same cache line access?

    Before I have said that the front end observes the first iteration of the
    loop and makes some determinations as to how wide the loop can be run on
    the machine at hand. One of those observations is whether memory addresses
    are dense, whether they all go in the same direction, and what registers
    carry loop-to-loop dependencies.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Stephen Fuld on Fri Feb 16 23:22:08 2024
    Stephen Fuld wrote:

    On 2/16/2024 6:23 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:
    Basically, Mitch has his architecture designed for implementation on >>>>> CPUs that are smart enough to notice certain combinations of instructions >>>>> and execute them as though they're single instructions doing the same >>>>> thing, which can then be executed more efficiently.

    My understanding is that he requires explicit marking (why?),

    Of course, Mitch can answer for himself, but ISTM that the explicit
    marking allows a more efficient implementation, specifically the
    instructions in the loop can be fetched and decoded only once, it allows >>> the HW to elide some register writes, and saves an instruction by
    combining the loop count decrement and test and the return branch into a >>> single instruction. Perhaps the HW could figure out all of that by
    analyzing a "normal" instruction stream, but that seems much harder.

    Compared to the rest of the VVM stuff, recognizing it in hardware does
    not add much difficulty.

    IANAHG, but if it were that simple, I would think Mitch would have implemented it that way.


    Maybe we'll see it in some Intel or AMD CPU
    in the coming years.

    One can hope!



    and that
    the loop can do almost anything, but (I think) it must be a simple
    loop without further control structures.

    It allows predicated instructions within the loop

    Sure, predication is not a control structure.

    OK, but my point is that you can do conditional execution within a VVM loop.


    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
    r += a[i];

    This is particularly nasty given that FP addition is not associative;
    but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

    From what I understand, while you can do reductions in a VVM loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the
    reduction, thus avoids the problem you mention. That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware.

    Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware recognizing that the loop contains no instructions that can't be parallelized. There are also some issues like doing a sum of signed
    integer values and knowing whether overflow occurred, etc. The
    programmer may know that overflow cannot occur, but the HW doesn't.

    The HW does not need preceding knowledge. If an exception happens, the vectorized loop collapses into a scalar loop precisely, and can be
    handled in the standard fashion.

    For FP addition that should give the same
    result as the sequential code, it's probably much harder. Of course,
    you can ask the programmer to write:

    double r;
    double r0=0.0;
    ...
    double r15=0.0;
    for (i=0; i<n-15; i+=16) {
    r0 += a[i];
    ...
    r15 += a[i+15];
    }
    ... deal with the remaining iterations ...
    r = r0+...+r15;

    But then the point of auto-vectorization is that the programmers are
    unaware of what's going on behind the curtain, and that promise is not
    kept if they have to write code like above.

    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Sat Feb 17 04:30:37 2024
    On Fri, 16 Feb 2024 18:35:01 +0000, MitchAlsup1 wrote:
    Quadibloc wrote:

    So my substitute for VVM should now be obvious - explicit memory-to-memory >> vector instructions, like on an old STAR-100.

    Gasp........

    Oh, dear. But, yes, old-style memory-to-memory vector instructions omit
    at least one very important thing that VVM provides, which I do indeed
    want to make sure I include.

    So there would need to be instructions like

    multiply v1 by v2 giving scratch-1
    add scratch-1 to scratch-2 giving scratch-3
    divide scratch-2 by v1 giving v4

    ... that is, instead of vector registers, there would still be another
    kind of thing that isn't a vector in memory, but instead an *explicit* reference to a forwarding node.

    And so these vector instructions would have to be in explicitly
    delimited groups (since forwarding nodes, unlike vector registers, aren't intended to be _persistent_, so a group of vector instructions would have
    to combine into a clause which for some purposes acts like a single instruction)... which then makes it look a whole lot _more_ like VVM,
    even though the inside of the sandwich is now special instructiions,
    instead of ordinary arithmetic instructions as in VVM.

    I think there _may_ have been something like this already in the
    original Concertina.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Fri Feb 16 23:33:32 2024
    On 2/16/2024 3:22 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 2/16/2024 6:23 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:

    snip

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how
    his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
        r += a[i];

    This is particularly nasty given that FP addition is not associative; >>>>> but even if you allow fast-math-style reassociation, doing this in
    hardware seems to be quite a bit harder than the rest of VVM.

     From what I understand, while you can do reductions in a VVM loop, and >>>> it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>> reduction, thus avoids the problem you mention.  That does cost
    performance if the reduction could be parallelized, e.g. find the max
    value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware.

    Sure.  ISTM, and again, IANAHG, that the problem for VVM is the
    hardware recognizing that the loop contains no instructions that can't
    be parallelized.  There are also some issues like doing a sum of
    signed integer values and knowing whether overflow occurred, etc.  The
    programmer may know that overflow cannot occur, but the HW doesn't.

    The HW does not need preceding knowledge. If an exception happens, the vectorized loop collapses into a scalar loop precisely, and can be
    handled in the standard fashion.

    I think you might have missed my point. If you are summing the signed
    integer elements of an array, whether you get an overflow or not can
    depend on the order the additions are done. Thus, without knowledge
    that only the programmer has (i.e. that with the size of the actual data
    used, overflow is impossible) the hardware cannot parallelize such an operation. If the programmer knows that overflow cannot occur, he has
    no way to communicate that to the VVM hardware, such that the HW could parallelize the summation.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Sat Feb 17 10:34:05 2024
    Stephen Fuld wrote:
    On 2/16/2024 3:22 PM, MitchAlsup wrote:
    Stephen Fuld wrote:

    On 2/16/2024 6:23 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 2/15/2024 11:27 PM, Anton Ertl wrote:

    snip

    I think he also allows
    recurrences (in particular, reductions), but I don't understand how >>>>>> his hardware auto-vectorizes that; e.g.:

    double r=0.0;
    for (i=0; i<n; i++)
        r += a[i];

    This is particularly nasty given that FP addition is not associative; >>>>>> but even if you allow fast-math-style reassociation, doing this in >>>>>> hardware seems to be quite a bit harder than the rest of VVM.

     From what I understand, while you can do reductions in a VVM
    loop, and
    it takes advantage of wide fetch etc., it doesn't auto parallelize the >>>>> reduction, thus avoids the problem you mention.  That does cost
    performance if the reduction could be parallelized, e.g. find the max >>>>> value in an array.

    My feeling is that, for max it's relatively easy to perform a wide
    reduction in hardware.

    Sure.  ISTM, and again, IANAHG, that the problem for VVM is the
    hardware recognizing that the loop contains no instructions that
    can't be parallelized.  There are also some issues like doing a sum
    of signed integer values and knowing whether overflow occurred,
    etc.  The programmer may know that overflow cannot occur, but the HW >>> doesn't.

    The HW does not need preceding knowledge. If an exception happens, the
    vectorized loop collapses into a scalar loop precisely, and can be
    handled in the standard fashion.

    I think you might have missed my point.  If you are summing the signed integer elements of an array, whether you get an overflow or not can
    depend on the order the additions are done.  Thus, without knowledge
    that only the programmer has (i.e. that with the size of the actual data used, overflow is impossible) the hardware cannot parallelize such an operation.  If the programmer knows that overflow cannot occur, he has
    no way to communicate that to the VVM hardware, such that the HW could parallelize the summation.

    I am not sure, but I strongly believe that VMM cannot be caught out this
    way, simply because it would observe the accumulator loop dependency.

    I.e. it could do all the other loop instructions (load/add/loop counter decrement & branch) completeley overlapped, but the actual adds to the accumulator register would limit total throughput to the ADD-to-ADD latency.

    So, on the first hand, VMM cannot automagically parallelize this to use multiple accumulators, on the other hand a programmer would be free to
    use a pair of wider accumulators to sidestep the issue.

    On the third (i.e gripping) hand you could have a language like Java
    where it would be illegal to transform a temporarily trapping loop into
    one that would not trap and give the mathematically correct answer.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to BGB on Sat Feb 17 10:20:44 2024
    BGB wrote:
    On 2/16/2024 5:29 AM, Marcus wrote:
    I'm saying that I believe that within this category there is an
    opportunity for improving performance with very little cost by adding
    vector operations.

    E.g. imagine a non-pipelined implementation with a single memory port,
    shared by instruction fetch and data load/store, that requires perhaps
    two cycles to fetch and decode an instruction, and executes the
    instruction in the third cycle (possibly accessing the memory, which
    precludes fetching a new instruction until the fourth or even fifth
    cycle).

    Now imagine if a single instruction could iterate over several elements
    of a vector register. This would mean that the execution unit could
    execute up to one operation every clock cycle, approaching similar
    performance levels as a pipelined 1 CPI machine. The memory port would
    be free for data traffic as no new instructions have to be fetched
    during the vector loop. And so on.


    I guess possible.

    Absolutely possible. After all, the IBM block move and all the 1978 x86
    string ops were designed to make an internal, interruptible, loop. No
    need to load more instructions, just let the internal state machine run
    until completion.

    The current state of the art (i.e. VMM) is of course far more capable,
    but the original idea is old.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Feb 17 07:49:34 2024
    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup1 wrote:

    You should think of it like:: VVM can execute as many operations per
    cycle as it has function units. In particular, the low end machine
    can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
    cycle. LDs operate at 128-bits wide, so one can execute a LD on even
    cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

    Bigger implementations can have more cache ports and more FMAC units;
    and include "lanes" in SIMD-like fashion.

    Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
    two consecutive 64-bit LD or ST to consecutive addresses and merges
    them into a single cache access?

    first: memory is inherently misaligned in My 66000 architecture. So, since the width of the machine is 64-bits, we read or write in 128-bit quantities so that we have enough bits to extract the misaligned data from or a container
    large enough to store a 64-bit value into. {{And there are all the
    associated
    corner cases}}

    Second: over in VVM-land, the implementation can decide to read and write wider, but is architecturally constrained not to shrink below 128-bits.

    A 1-wide My66160 would read pairs of double precision FP values, or quads
    of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. This supports loops of 6IPC or greater in a 1-wide machine. This machine would process suitable loops at 128-bits per cycle--depending on "other things" that are generally allowable.

    A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
    per cycle. This supports 20 IPC or greater in the 6-wide machine. As
    many as
    8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths to support this rate.

    Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop. That's more straight forward than fusing LD's or ST's in LSQ.

    Is that done by disambiguation logic, checking for same cache line
    access?

    Before I have said that the front end observes the first iteration of
    the loop and makes some determinations as to how wide the loop can be
    run on
    the machine at hand. One of those observations is whether memory addresses are dense, whether they all go in the same direction, and what registers carry loop-to-loop dependencies.

    How does it know when to use LDP/STP uOps?
    That decision would have to be made early in the front end, likely Decode
    and before Rename because you have to know how many dest registers you need.

    But the decision on the legality to use LDP/STP depends on knowing the
    current loop counter >= 2 and address(es) aligned on a 16 byte boundary,
    which are multiple dynamic, possibly calculated, values only available
    much later to the back end.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Feb 17 17:18:36 2024
    EricP wrote:

    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup1 wrote:

    You should think of it like:: VVM can execute as many operations per
    cycle as it has function units. In particular, the low end machine
    can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
    cycle. LDs operate at 128-bits wide, so one can execute a LD on even
    cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

    Bigger implementations can have more cache ports and more FMAC units;
    and include "lanes" in SIMD-like fashion.

    Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
    two consecutive 64-bit LD or ST to consecutive addresses and merges
    them into a single cache access?

    first: memory is inherently misaligned in My 66000 architecture. So, since >> the width of the machine is 64-bits, we read or write in 128-bit quantities >> so that we have enough bits to extract the misaligned data from or a
    container
    large enough to store a 64-bit value into. {{And there are all the
    associated
    corner cases}}

    Second: over in VVM-land, the implementation can decide to read and write
    wider, but is architecturally constrained not to shrink below 128-bits.

    A 1-wide My66160 would read pairs of double precision FP values, or quads
    of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values. >> This supports loops of 6IPC or greater in a 1-wide machine. This machine
    would process suitable loops at 128-bits per cycle--depending on "other
    things" that are generally allowable.

    A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
    per cycle. This supports 20 IPC or greater in the 6-wide machine. As
    many as
    8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths >> to support this rate.

    Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop. That's more straight forward than fusing LD's or ST's in LSQ.

    Is that done by disambiguation logic, checking for same cache line
    access?

    Before I have said that the front end observes the first iteration of
    the loop and makes some determinations as to how wide the loop can be
    run on
    the machine at hand. One of those observations is whether memory addresses >> are dense, whether they all go in the same direction, and what registers
    carry loop-to-loop dependencies.

    How does it know when to use LDP/STP uOps?

    It does not have LDP/STP ops to use.
    It uses the width of the cache port it has.
    It just so happens that the low end machine has a cache width of 128-bits.
    But each implementation gets to choose its own width.

    That decision would have to be made early in the front end, likely Decode
    and before Rename because you have to know how many dest registers you need.

    It is not using a register, although it is using flip-flops. It is not
    using something that is visible to SW but is visible to HW.

    But the decision on the legality to use LDP/STP depends on knowing the current loop counter >= 2 and address(es) aligned on a 16 byte boundary, which are multiple dynamic, possibly calculated, values only available
    much later to the back end.

    It does not need to see the address aligned to a 16-byte boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Sat Feb 17 18:03:53 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20
    where it would be illegal to transform a temporarily trapping loop into=20 >one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Sat Feb 17 19:58:19 2024
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20
    where it would be illegal to transform a temporarily trapping loop into=20 >> one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I was specifically talking about adding a bunch of integers together,
    some positive and some negative, so that by doing them in program order
    you will get an overflow, but if you did them in some other order, or
    with a double-wide accumulator, the final result would in fact fit in
    the designated target variable.

    int8_t sum(int len, int8_t data[])
    {
    int8_t s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    while if you implement it with

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    then you would be OK, and the final result would be mathematically correct.

    For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Sat Feb 17 20:03:01 2024
    Terje Mathisen wrote:

    Anton Ertl wrote:

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I was specifically talking about adding a bunch of integers together,
    some positive and some negative, so that by doing them in program order
    you will get an overflow, but if you did them in some other order, or
    with a double-wide accumulator, the final result would in fact fit in
    the designated target variable.

    int8_t sum(int len, int8_t data[])
    {
    int8_t s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    Yes, and it should not be vectorized when your vector resource has
    CRAY-like vector registers--however, it can be vectorized with VVM
    like resources.

    while if you implement it with

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    then you would be OK, and the final result would be mathematically correct.

    when len > 2^24 it may still not be mathematically correct for 32-bit ints
    or len > 2^60 for 64-bit ints.

    For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Feb 17 22:03:16 2024
    BGB wrote:

    On 2/17/2024 3:20 AM, Terje Mathisen wrote:
    BGB wrote:


    But, I am not entirely sure how one would go about implementing it, as
    VADD.H would need to do the equivalent of:
    MOV.Q (R4), R16
    MOV.Q (R5), R17
    ADD 8, R4
    ADD 8, R5
    PADD.H R16, R17, R18
    MOV.Q R18, (R6)
    ADD 8, R6
    All in a single instruction.

    With the proper instruction set, the above is::

    VEC R9,{}
    LDSH R10,[R1,Ri<<1]
    LDSH R11,[R2,Ri<<1]
    ADD R12,R10,R11
    STH R12,[R3,Ri<<1]
    LOOP LT,Ri,#1,Rmax

    Once you see that there is no loop recurrence, then the loops can be run concurrently as wide as you have arithmetic capabilities and cache BW--
    in this case we have an arithmetic capability of 4 Halfword ADDs per cycle
    and a memory capability of 128-bits every cycle creating a BW of 4×5 inst every 1.5 cycles or 13.3 IPC and we are memory limited, not arithmetic
    limited.

    Though, could be reduced if auto-increment were re-added:
    MOV.Q @R4+, R16
    MOV.Q @R5+, R17
    PADD.H R16, R17, R18
    MOV.Q R18, @R6+

    You will find the requisite patterns harder to recognize when the memory reference size is NOT the calculation size. In your case, the calculation
    is .H while memory reference is .Q .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Feb 17 22:08:30 2024
    BGB wrote:

    On 2/17/2024 12:03 PM, Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20
    where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly >> don't trap.


    Yes.

    Trap on overflow is not really a thing in the JVM, the basic integer
    types are modulo, and don't actually distinguish signed from unsigned (unsigned arithmetic is merely faked in some cases with special
    operators, with signed arithmetic assumed as the default).


    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    People skilled in numerical analysis hate java FP semantics.

    Yeah. No traps, only NaNs.


    FWIW: My own languages, and BGBCC, also partly followed Java's model in
    this area. But, it wasn't hard: This is generally how C behaves as well
    on most targets.

    Well, except that C will often trap for things like divide by zero and similar, at least on x86. Though, off-hand, I don't remember whether or
    not JVM throws an exception on divide-by-zero.


    On BJX2, there isn't currently any divide-by-zero trap, since:
    This case doesn't happen in normal program execution;
    Handling it with a trap would cost more than not bothering.

    This sounds like it should make your machine safe to program and use,
    but it does not.

    So, IIRC, integer divide-by-zero will just give 0, and FP divide-by-zero
    will give Inf or NaN.

    Can I volunteer this as the worst possible value for int/0, [un]signedMAX
    is trivially harder to implement.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 17 15:36:31 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 01:03:23 2024
    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Feb 18 01:06:46 2024
    BGB wrote:

    On 2/17/2024 4:08 PM, MitchAlsup1 wrote:
    BGB wrote:


    On BJX2, there isn't currently any divide-by-zero trap, since:
       This case doesn't happen in normal program execution;
       Handling it with a trap would cost more than not bothering.

    This sounds like it should make your machine safe to program and use,
    but it does not.


    It is more concerned with "cheap" than "safe".

    Trap on divide-by-zero would require having a way for the divider unit
    to signal divide-by-zero has been encountered (say, so some external
    logic can raise the corresponding exception code). This is not free.

    Most result busses have a bit that carries exception to the retire end
    of the pipeline. The retire stage looks at the bit, sees a DIV instruction
    and knows what exception was raised. FP generally needs 3-such bits on
    the result bus.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 17 20:01:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Sun Feb 18 07:47:13 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20
    where it would be illegal to transform a temporarily trapping loop into=20 >>> one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?
    ...
    I was specifically talking about adding a bunch of integers together,
    some positive and some negative, so that by doing them in program order
    you will get an overflow, but if you did them in some other order, or
    with a double-wide accumulator, the final result would in fact fit in
    the designated target variable.

    As mentioned, Java defines addition of the integral base types to use
    modulo (aka wrapping) arithmetic, i.e., overflow is fully defined with
    nice properties. In particular, the associative law hold for modulo
    addition, which allows all kinds of reassociations that are helpful
    for parallelizing reduction.

    int8_t sum(int len, int8_t data[])
    {
    int8_t s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    I don't think that int8_t or unsigned are Java types.

    If that is C code: C standard lawyers will tell you what the C
    standard says about storing 128 into s (the addition itself does not
    overflow, because it uses ints).

    For this particular example, you would also get the correct answer with >wrapping arithmetic, even if that by default is UB in modern C.

    The standardized subset of C is not relevant for discussing Java.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Opus@21:1/5 to Tim Rentsch on Sun Feb 18 09:24:23 2024
    On 18/02/2024 05:01, Tim Rentsch wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    Of course the conversion will be done implicitly. C converts almost
    anything implicitly. Not that this is its greatest feature.

    The explicit cast is still useful: 1/ to express intent (it shows that
    the potential loss of data is intentional) and then 2/ to avoid compiler warnings (if you enable -Wconversion, which I usually recommend) or
    warning from any serious static analyzer too (which I highly recommend
    using too).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sun Feb 18 08:00:18 2024
    BGB <cr88192@gmail.com> writes:
    Well, except that C will often trap for things like divide by zero and >similar, at least on x86.

    The division instructions of IA-32 and AMD64 trap on divide-by-zero
    and when the result is out of range. Unsurprisingly, C compilers
    usually use these instructions when compiling division on these
    architectures. One interesting case is what C compilers do when you
    write

    long foo(long x)
    {
    return x/-1;
    }

    Both gcc and clang compile this to

    0: 48 89 f8 mov %rdi,%rax
    3: 48 f7 d8 neg %rax
    6: c3 retq

    and you don't get a trap when you call foo(LONG_MIN), while you would
    if the compiler did not know that the divisor is -1 (and it was -1 at run-time).

    By contrast, when I implemented division-by-constant optimization in
    Gforth, I decided not "optimize" the division by -1 case, so you get
    the ordinary division operation and its behaviour. If a programmer
    codes a division by -1 rather than just NEGATE, they probably want
    something other than NEGATE.

    Though, off-hand, I don't remember whether or
    not JVM throws an exception on divide-by-zero.

    Reading up on Java, <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> says:

    |if the dividend is the negative integer of largest possible magnitude
    |for its type, and the divisor is -1, then integer overflow occurs and
    |the result is equal to the dividend. Despite the overflow, no
    |exception is thrown in this case. On the other hand, if the value of
    |the divisor in an integer division is 0, then an ArithmeticException
    |is thrown.

    I expect that the JVM has matching wording.

    So on, e.g., AMD64 the JVM has to generate code that catches the
    long_min/-1 case and produces long_min rather then just generating the
    divide instruction. Alternatively, the generated code could just
    produce a division instruction, and the signal handler (on Unix) or
    equivalent could then check if the divisor was 0 (and then throw an ArithmeticException) or -1 (and then produce a long_min result and
    continue execution).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Sun Feb 18 11:26:09 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    I am normally writing Rust these days, where UB is far less common, but
    casts like this are mandatory.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Feb 18 16:10:52 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    I suspect most experienced C programs know that.

    Yet, the 'superfluous' cast is also documentation that the
    programmer _intended_ that the return value would be narrowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 17:48:09 2024
    Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    Missing my point:: which was::

    The summation loop will not overflow, and overflow is detected at
    the smash from int to int8_t.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sun Feb 18 20:14:05 2024
    On Sun, 18 Feb 2024 08:00:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    BGB <cr88192@gmail.com> writes:
    Well, except that C will often trap for things like divide by zero
    and similar, at least on x86.

    The division instructions of IA-32 and AMD64 trap on divide-by-zero
    and when the result is out of range. Unsurprisingly, C compilers
    usually use these instructions when compiling division on these architectures. One interesting case is what C compilers do when you
    write

    long foo(long x)
    {
    return x/-1;
    }

    Both gcc and clang compile this to

    0: 48 89 f8 mov %rdi,%rax
    3: 48 f7 d8 neg %rax
    6: c3 retq

    and you don't get a trap when you call foo(LONG_MIN), while you would
    if the compiler did not know that the divisor is -1 (and it was -1 at run-time).

    By contrast, when I implemented division-by-constant optimization in
    Gforth, I decided not "optimize" the division by -1 case, so you get
    the ordinary division operation and its behaviour. If a programmer
    codes a division by -1 rather than just NEGATE, they probably want
    something other than NEGATE.

    Though, off-hand, I don't remember whether or
    not JVM throws an exception on divide-by-zero.

    Reading up on Java, <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> says:

    |if the dividend is the negative integer of largest possible magnitude
    |for its type, and the divisor is -1, then integer overflow occurs and
    |the result is equal to the dividend. Despite the overflow, no
    |exception is thrown in this case. On the other hand, if the value of
    |the divisor in an integer division is 0, then an ArithmeticException
    |is thrown.

    I expect that the JVM has matching wording.

    So on, e.g., AMD64 the JVM has to generate code that catches the
    long_min/-1 case and produces long_min rather then just generating the
    divide instruction. Alternatively, the generated code could just
    produce a division instruction, and the signal handler (on Unix) or equivalent could then check if the divisor was 0 (and then throw an ArithmeticException) or -1 (and then produce a long_min result and
    continue execution).

    - anton

    I don't understand why case of LONG_MIN/-1 would possibly require
    special handling. IMHO, regular iAMD64 64-bit integer division sequence,
    i.e. CQO followed by IDIV, will produce result expected by Java spec
    without any overflow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sun Feb 18 22:40:08 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 18 Feb 2024 08:00:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Reading up on Java,
    <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2> >> says:

    |if the dividend is the negative integer of largest possible magnitude
    |for its type, and the divisor is -1, then integer overflow occurs and
    |the result is equal to the dividend. Despite the overflow, no
    |exception is thrown in this case. On the other hand, if the value of
    |the divisor in an integer division is 0, then an ArithmeticException
    |is thrown.

    I expect that the JVM has matching wording.

    So on, e.g., AMD64 the JVM has to generate code that catches the
    long_min/-1 case and produces long_min rather then just generating the
    divide instruction. Alternatively, the generated code could just
    produce a division instruction, and the signal handler (on Unix) or
    equivalent could then check if the divisor was 0 (and then throw an
    ArithmeticException) or -1 (and then produce a long_min result and
    continue execution).

    - anton

    I don't understand why case of LONG_MIN/-1 would possibly require
    special handling. IMHO, regular iAMD64 64-bit integer division sequence,
    i.e. CQO followed by IDIV, will produce result expected by Java spec
    without any overflow.

    Try it. E.g., in gforth-fast /S performs this sequence:

    see /s
    Code /s
    0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
    0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
    0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
    0x00005614dd335639 <gforth_engine+3225>: cqto
    0x00005614dd33563b <gforth_engine+3227>: idiv %r8
    0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
    0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
    0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
    end-code

    And when I divide LONG_MIN by -1, I get a trap:

    $8000000000000000 -1 /s
    *the terminal*:12:22: error: Division by zero
    $8000000000000000 -1 >>>/s<<<

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Feb 19 01:20:09 2024
    On Sun, 18 Feb 2024 22:40:08 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 18 Feb 2024 08:00:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Reading up on Java,
    <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
    says:

    |if the dividend is the negative integer of largest possible
    magnitude |for its type, and the divisor is -1, then integer
    overflow occurs and |the result is equal to the dividend. Despite
    the overflow, no |exception is thrown in this case. On the other
    hand, if the value of |the divisor in an integer division is 0,
    then an ArithmeticException |is thrown.

    I expect that the JVM has matching wording.

    So on, e.g., AMD64 the JVM has to generate code that catches the
    long_min/-1 case and produces long_min rather then just generating
    the divide instruction. Alternatively, the generated code could
    just produce a division instruction, and the signal handler (on
    Unix) or equivalent could then check if the divisor was 0 (and
    then throw an ArithmeticException) or -1 (and then produce a
    long_min result and continue execution).

    - anton

    I don't understand why case of LONG_MIN/-1 would possibly require
    special handling. IMHO, regular iAMD64 64-bit integer division
    sequence, i.e. CQO followed by IDIV, will produce result expected by
    Java spec without any overflow.

    Try it. E.g., in gforth-fast /S performs this sequence:

    see /s
    Code /s
    0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
    0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
    0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
    0x00005614dd335639 <gforth_engine+3225>: cqto
    0x00005614dd33563b <gforth_engine+3227>: idiv %r8
    0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
    0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
    0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
    end-code

    And when I divide LONG_MIN by -1, I get a trap:

    $8000000000000000 -1 /s
    *the terminal*:12:22: error: Division by zero
    $8000000000000000 -1 >>>/s<<<

    - anton

    You are right.
    LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
    Seems like I didn't RTFM with regard to IDIV for too many years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Feb 19 09:06:23 2024
    Michael S <already5chosen@yahoo.com> writes:
    LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
    Seems like I didn't RTFM with regard to IDIV for too many years.

    The result of LONG_MIN/1 is LONG_MIN, which is in range, while the
    result of LONG_MIN/-1 is LONG_MAX+1, which is not in range.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Mon Feb 19 14:11:51 2024
    On 17/02/2024 19:58, Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20
    where it would be illegal to transform a temporarily trapping loop
    into=20
    one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers
    certainly
    don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few things
    here that stand out to me - at least as long as we are talking about C.


    I was specifically talking about adding a bunch of integers together,
    some positive and some negative, so that by doing them in program order
    you will get an overflow, but if you did them in some other order, or
    with a double-wide accumulator, the final result would in fact fit in
    the designated target variable.

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {
        s += data[i];
      }
      return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    No. In C, int8_t values will be promoted to "int" (which is always at
    least 16 bits, on any target) and the operation will therefore not
    overflow. The conversion of the result of "s + data[i]" from int to
    int8_t, implicit in the assignment, also cannot "overflow" since that
    term applies only to the evaluation of operators. But if this value is
    outside the range for int8_t, then the conversion is
    implementation-defined behaviour. (That is unlike signed integer
    overflow, which is undefined behaviour.)

    All real-life implementations will define the conversion as modulo/truncation/wrapping, however you prefer to think of it, though it
    is not specified in the standards.


    while if you implement it with

    int8_t sum(int len, int8_t data[])
    {
      int s = 0;
      for (unsigned i = 0 i < len; i++) {
        s += data[i];
      }
      return (int8_t) s;
    }

    then you would be OK, and the final result would be mathematically correct.

    Converting the "int" to "int8_t" will give the correct value whenever it
    is in the range of int8_t. But if we assume that the implementation
    does out-of-range conversions as two's complement wrapping, then the
    result will be the same no matter when the modulo operations are done.


    For this particular example, you would also get the correct answer with wrapping arithmetic, even if that by default is UB in modern C.


    There's no UB in either case. Only IB.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Mon Feb 19 18:47:32 2024
    On Mon, 19 Feb 2024 01:20:09 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Sun, 18 Feb 2024 22:40:08 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 18 Feb 2024 08:00:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Reading up on Java,
    <https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
    says:

    |if the dividend is the negative integer of largest possible
    magnitude |for its type, and the divisor is -1, then integer
    overflow occurs and |the result is equal to the dividend. Despite
    the overflow, no |exception is thrown in this case. On the other
    hand, if the value of |the divisor in an integer division is 0,
    then an ArithmeticException |is thrown.

    I expect that the JVM has matching wording.

    So on, e.g., AMD64 the JVM has to generate code that catches the
    long_min/-1 case and produces long_min rather then just
    generating the divide instruction. Alternatively, the generated
    code could just produce a division instruction, and the signal
    handler (on Unix) or equivalent could then check if the divisor
    was 0 (and then throw an ArithmeticException) or -1 (and then
    produce a long_min result and continue execution).

    - anton

    I don't understand why case of LONG_MIN/-1 would possibly require
    special handling. IMHO, regular iAMD64 64-bit integer division
    sequence, i.e. CQO followed by IDIV, will produce result expected
    by Java spec without any overflow.

    Try it. E.g., in gforth-fast /S performs this sequence:

    see /s
    Code /s
    0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
    0x00005614dd335631 <gforth_engine+3217>: mov
    0x8(%r13),%rax 0x00005614dd335635 <gforth_engine+3221>: add
    $0x8,%r13 0x00005614dd335639 <gforth_engine+3225>: cqto
    0x00005614dd33563b <gforth_engine+3227>: idiv %r8
    0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
    0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
    0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
    end-code

    And when I divide LONG_MIN by -1, I get a trap:

    $8000000000000000 -1 /s
    *the terminal*:12:22: error: Division by zero
    $8000000000000000 -1 >>>/s<<<

    - anton

    You are right.
    LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
    Seems like I didn't RTFM with regard to IDIV for too many years.



    Most likely, back when I was reading the manual for the first time, I
    read DIV paragraph thoroughly and then just looked briefly at IDIV
    assuming that it is about the same and not paying attention to the
    differences in corner cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Mon Feb 19 23:21:57 2024
    David Brown wrote:
    On 17/02/2024 19:58, Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20 >>>> where it would be illegal to transform a temporarily trapping loop
    into=20
    one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers
    certainly
    don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few things
    here that stand out to me - at least as long as we are talking about C.

    I realized a bunch of messages ago that it was a bad idea to write
    (pseudo-)C to illustrate a general problem. :-(

    If we have a platform where the default integer size is 32 bits and long
    is 64 bits, then afaik the C promotion rules will use int as the
    accumulator size, right?

    What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
    happened to be arranged in such a way that it would temporarily overflow
    the target int type.

    I think it is much better to do it this way and then get a conversion
    size trap at the very end when/if the final sum is in fact too large for
    the result type.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Feb 20 01:10:23 2024
    Terje Mathisen wrote:

    David Brown wrote:
    On 17/02/2024 19:58, Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like Java=20 >>>>> where it would be illegal to transform a temporarily trapping loop
    into=20
    one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers
    certainly
    don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few things
    here that stand out to me - at least as long as we are talking about C.

    I realized a bunch of messages ago that it was a bad idea to write
    (pseudo-)C to illustrate a general problem. :-(

    If we have a platform where the default integer size is 32 bits and long
    is 64 bits, then afaik the C promotion rules will use int as the
    accumulator size, right?

    Not necessarily:: accumulation rules allow the promotion of int->long
    inside a loop if the long is smashed back to int immediately after the
    loop terminates. A compiler should be able to perform this transformation.
    In effect, this hoists the smashes back to int out of the loop, increasing performance and making it that much harder to tickle the overflow exception.

    What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
    happened to be arranged in such a way that it would temporarily overflow
    the target int type.

    We are in an era where long has higher performance than ints (except for
    cache footprint overheads.)

    We are also in an era where the dust on dusty decks is starting to show
    its accumulated depth.

    I think it is much better to do it this way and then get a conversion
    size trap at the very end when/if the final sum is in fact too large for
    the result type.

    Same argument holds for Kahan-Babuška summation.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Tue Feb 20 06:31:59 2024
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {

    Just a side remark: This loop can get very long for len < 0.

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

        s += data[i];
      }
      return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    No. In C, int8_t values will be promoted to "int" (which is always at
    least 16 bits, on any target) and the operation will therefore not
    overflow.

    Depending on len and the data...

    The conversion of the result of "s + data[i]" from int to
    int8_t, implicit in the assignment, also cannot "overflow" since that
    term applies only to the evaluation of operators. But if this value is outside the range for int8_t, then the conversion is
    implementation-defined behaviour. (That is unlike signed integer
    overflow, which is undefined behaviour.)

    And that is one of the things that bugs me, in languages like C
    and Fortran both.

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Feb 20 07:32:40 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Terje Mathisen wrote:
    If we have a platform where the default integer size is 32 bits and long
    is 64 bits, then afaik the C promotion rules will use int as the
    accumulator size, right?

    Not necessarily:: accumulation rules allow the promotion of int->long
    inside a loop if the long is smashed back to int immediately after the
    loop terminates. A compiler should be able to perform this transformation.

    What "accumulation rules"?

    Certainly with twos-complement modulo arithmetic, the following
    distributive laws hold:

    (a+b) mod 2^n = (a mod 2^n) + (b mod 2^n)

    and this also holds for a "signed mod" operator smod that represents
    the congruence classes modulo 2^n by the numbers -2^(n-1)..2^(n-1)-1
    instead of 0..2^n-1. I actually would prefer to write the equivalence
    above as a congruence modulo 2^n, which would avoid the need to
    explain that separately, but I don't see a good way to do it. Maybe:

    a+b is congruent with (a mod 2^n) + (b mod 2^n) modulo 2^n

    but of course this still uses the mod operator that produces values
    0..2^n-1.

    We also have:

    a is congruent to a mod 2^m modulo 2^n if m>=n

    So, the result is that, yes, if we only have a wider addition
    instruction, and need a narrower result at some point, we can
    undistribute the narrowing operation (sign extension or zero
    extension) and just apply it to the end result.

    In effect, this hoists the smashes back to int out of the loop, increasing >performance and making it that much harder to tickle the overflow exception.

    What overflow exception? I have yet to see a C compiler use an
    addition instruction that causes an overflow exception on integer
    overflow, unless specifically asked to do so with -ftrapv. And if the programmer explicitly asked for traps on overflow, then the C compiler
    should not try to "optimize" them away. Note that the stuff above is
    true for modulo arithmetic, not (in general) for trapping arithmetic.

    We are in an era where long has higher performance than ints (except for >cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s. We have
    been paying with additional sign-extension and zero-extension
    operations ever since then, and it has even deformed architectures:
    ARM A64 has addressing modes that include sign- or zero-extending a
    32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
    compressed extension, probably because they are so frequent because
    they are used in RISC-V's idioms for sign and zero extension.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Tue Feb 20 08:15:22 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the architectures of old.

    Those ideas are that integer overflows do not happen and that a
    competent programmer proactively prevents them from happening, by
    sizing the types accordingly, and checking the inputs. Therefore,
    there is no need to check if an addition overflows, and many
    architectures do not provide an easy way:

    * The old ones-complement and sign-magnitude machines trapped on
    overflow AFAIK.

    * S/360 has some mode bits that allow trapping on overflow, or setting
    some bits in some register (where the meaning of these bits depends
    on the instructions that set them).

    * MIPS and Alpha provide trap-on-signed-overflow addition,
    subtraction, and (Alpha) multiplication, but no easy way to check
    whether one of these operations would overflow. In MIPS64r6 (2014)
    MIPS finally added BOVC/BNVC, which branches if adding two registers
    would produce a signed overflow. RISC-V eliminates the
    trap-on-signed-overflow instructions, but otherwise follows the
    original MIPS and Alpha approach to overflow.

    Over time both unsigned and signed overflow have become more
    important: High-level programming languages nowadays often support
    Bignums (arbitrarily long (signed) integers), and cryptography needs fixed-width wide arithmetics. For Bignums, you need to detect
    (without trapping) overflows of signed-integer arithmetics, for wide
    addition and subtraction (for both the wide path of Bignums, and for cryptography), you need carry/borrow, i.e. unsigned
    overflow/underflow.

    And this is reflected in more recent architectures: Many architectures
    since about 1970 have a flags register with carry and overflow bits,
    MIPS64r6 has added BOVC/BNVC, the 88000 and Power have a carry bit
    (and, for Power, IIRC a sticky overflow bit) outside their usual
    handling of comparison results.

    But of course, C standardization is barely moving into the 1970s; from
    what I read, they finally managed to standardize twos-complement
    arithmetics (as present in the S/360 in 1964 and pretty much every
    architecture that was not just an extension of an earlier architecture
    since then). Leave some time until the importance of overflow
    handling dawns on them; it was not properly present in the S/360, and
    is not really supported in RISC-V (and in MIPS only since 2014), so
    it's obviously much too early to standardize such things in the
    standardized subset of C, a language subset that targeted
    ones-complement and sign-magnitude dinosaurs until recently.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Tue Feb 20 12:39:50 2024
    On 19/02/2024 23:21, Terje Mathisen wrote:
    David Brown wrote:
    On 17/02/2024 19:58, Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like
    Java=20
    where it would be illegal to transform a temporarily trapping loop
    into=20
    one that would not trap and give the mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct answer"?

    If you are talking about integer arithmetic, the limited integers in
    Java have modulo semantics, i.e., they don't trap, and BigIntegers
    certainly
    don't trap.

    If you are talking about FP (like I did), by default FP addition does
    not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few things
    here that stand out to me - at least as long as we are talking about C.

    I realized a bunch of messages ago that it was a bad idea to write
    (pseudo-)C to illustrate a general problem. :-(

    Someone had asked for a comment from a C language lawyer. He might have
    been joking, but when lawyer mode is active, all sense of humour is
    deactivated :-)

    More seriously, the exact rules for C can be complicated, especially
    when you must consider unusual machines (IIRC the first Cray had 64-bit "short", "int" and "long", for example).


    If we have a platform where the default integer size is 32 bits and long
    is 64 bits, then afaik the C promotion rules will use int as the
    accumulator size, right?

    Yes. Any integer type smaller than "int" gets automatically promoted to
    "int" in many situations. (If the platform has "short int" or "char"
    that is the same size as "int", then values of those types also
    technically get promoted to "int", but that has little practical
    effect.) Note that this means unsigned types smaller than "int" get
    promoted to /signed/ int.

    Then you get the "usual arithmetic conversions" for most binary
    operators, picking a common type for the two operands. Basically, this
    picks the first it can in the list "int", "unsigned int", "long int",
    "unsigned long int", "long long int", "unsigned long long int". And
    note that "<int> op <unsigned int>" will result in the "int" operand
    being converted to "unsigned int", and the operation being carried out
    in unsigned ints. (I'm ignoring floating point for brevity.)

    (Personally, I don't think this is the best way to handle this kind of
    thing. I'd rather not have any promotions, and that any "usual
    arithmetic conversions" picked a common type that spanned the full
    ranges of both operands, or simply didn't allow mixed signed operations.
    I use gcc "-Wconversion" to complain about possible problems.)

    So on a typical 32-bit int system, any arithmetic that looks like it is
    being done on smaller types, is actually done in 32-bit signed
    arithmetic. C arithmetic operators don't exist for smaller types. The generated code can, of course, use 64-bit registers and operations - as
    long as the results are the same.


    What I was trying to illustrate was the principle that by having a wider accumulator you could aggregate a series of numbers, both positive and negative, and get the correct (in-range) result, even if the input
    happened to be arranged in such a way that it would temporarily overflow
    the target int type.

    Yes, and your code was fine for that.

    But I think it is helpful to understand that there is no overflow -
    temporary or otherwise - according to the C meaning of the term. First,
    the arithmetic is done as "int", not the type of the operands (when they
    are smaller than "int"), and certainly not the type of the target
    integer type. Signed integer arithmetic overflow should be avoided in
    C, because there is no "right" answer, and it is therefore UB. But the narrowing conversion to a smaller integer type is defined behaviour
    (defined by the implementation), and fine to use as long as you are
    happy with the potentially non-portable results.


    I think it is much better to do it this way and then get a conversion
    size trap at the very end when/if the final sum is in fact too large for
    the result type.


    I agree that it is better to use a larger type for the accumulator. If
    you are sure the calculation will never exceed the range of 16 bits, the
    ideal type to use would perhaps be "int_fast16_t" (or use
    "int_fast32_t", or "int_fast64_t" if you may need a bigger value). On
    x86-64, these types are all 64-bit since that is more efficient, so they
    are good choices for local variables.

    Your final conversion to int8_t is still implementation-dependent (but
    not UB) if the value in your accumulator is outside that range. And
    while the implementation-dependent behaviour for out-of-range
    conversions is allowed to raise a signal (or trap), I have never heard
    of a system which actually does that. So if you want to check for
    out-of-range conditions, you'll have to do it manually. (And then there
    are obviously big advantages in doing that once at the end of the function.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Tue Feb 20 13:42:22 2024
    On 20/02/2024 02:10, MitchAlsup1 wrote:
    Terje Mathisen wrote:

    David Brown wrote:
    On 17/02/2024 19:58, Terje Mathisen wrote:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    On the third (i.e gripping) hand you could have a language like
    Java=20
    where it would be illegal to transform a temporarily trapping loop >>>>>> into=20
    one that would not trap and give the mathematically correct answer. >>>>>
    What "temporarily trapping loop" and "mathematically correct answer"? >>>>>
    If you are talking about integer arithmetic, the limited integers in >>>>> Java have modulo semantics, i.e., they don't trap, and BigIntegers
    certainly
    don't trap.

    If you are talking about FP (like I did), by default FP addition does >>>>> not trap in Java, and any mention of "mathematically correct" in
    connection with FP needs a lot of further elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few things
    here that stand out to me - at least as long as we are talking about C.

    I realized a bunch of messages ago that it was a bad idea to write
    (pseudo-)C to illustrate a general problem. :-(

    If we have a platform where the default integer size is 32 bits and
    long is 64 bits, then afaik the C promotion rules will use int as the
    accumulator size, right?

    Not necessarily:: accumulation rules allow the promotion of int->long
    inside a loop if the long is smashed back to int immediately after the
    loop terminates. A compiler should be able to perform this transformation.
    In effect, this hoists the smashes back to int out of the loop, increasing performance and making it that much harder to tickle the overflow
    exception.


    A compiler can make any transformations it wants, as long as the final observable behaviour is unchanged. And since signed integer arithmetic overflow is undefined behaviour, the compiler can do whatever it likes
    if that happens. So if your target has 32-bit int and that is the type
    you use in the source code for the accumulator, the compiler can
    certainly use a 64-bit integer type for implementation. It could also
    use a double floating point type. All that matters is that if the user
    feeds in numbers that never lead to an 32-bit signed integer overflow,
    the final output is correct.

    It is not normal to have exceptions on integer overflow. If a compiler supports that as an extension (perhaps for debugging), then that is
    giving defined behaviour to signed integer overflow, and now it is
    observable - so the compiler cannot make optimisations that "make it
    much harder to tickle". It would stop quite a few optimisations, in
    fact - but certainly can be useful for debugging code.

    What I was trying to illustrate was the principle that by having a
    wider accumulator you could aggregate a series of numbers, both
    positive and negative, and get the correct (in-range) result, even if
    the input happened to be arranged in such a way that it would
    temporarily overflow the target int type.

    We are in an era where long has higher performance than ints (except for cache footprint overheads.)


    "long" on many systems (Windows, and all 32-bit systems - which I think
    have now overtaken 8-bit systems as the biggest market segment based on
    unit volumes) is the same size as "int". But assuming you mean that
    64-bit arithmetic has higher performance than 32-bit arithmetic on
    modern "big" processors, that is sometimes correct. As well as the aforementioned cache and memory bandwidth differences, 32-bit can still
    be faster for some types of operation (such as division) or if
    operations can be vectorised. But it is not surprising that on 64-bit
    systems, "int_fast32_t" is usually 64-bit, as that is typically faster
    for most operations on local variables.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Feb 20 12:00:29 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s. We have
    been paying with additional sign-extension and zero-extension
    operations ever since then, and it has even deformed architectures:
    ARM A64 has addressing modes that include sign- or zero-extending a
    32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
    compressed extension, probably because they are so frequent because
    they are used in RISC-V's idioms for sign and zero extension.

    Also, these architectures probably would not have the so-called 32-bit arithmetic instructions (like RV64G's addw) if the mainstream C world
    had decided to use ILP64. RV64G could have left these instructions
    away and replaced them with a sequence of add, slli, srli, i.e., a
    64-bit addition followed by a sign-extension idiom. After all, RISC-V
    seems to favour sequences of more general instructions over having
    more specialized instructions (and addressing modes). But apparently
    the frequency of 32-bit additions is so high thanks to I32LP64 that
    they added addw and addiw to RV64G; and they even occupy space in the compressed extension (16-bit encodings of frequent instructions).

    BTW, some people here have advocated the use of unsigned instead of
    int. Which of the two results in better code depends on the
    architecture. On AMD64 where the so-called 32-bit instructions
    perform a 32->64-bit zero-extension, unsigned is better. On RV64G
    where the so-called 32-bit instructions perform a 32->64-bit sign
    extension, signed int is better. But actually the best way is to use
    a full-width type like intptr_t or uintptr_t, which gives better
    results than either. E.g., consider the function

    void sext(int M, int *ic, int *is)
    {
    int k;
    for (k = 1; k <= M; k++) {
    ic[k] += is[k];
    }
    }

    which is based on the only loop (from 456.hmmer) in SPECint 2006 where
    the difference between -fwrapv and the default produces a measurable performance difference (as reported in section 3.3 of <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
    created variations of this function, where the types of M and k were
    changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
    code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
    looks as follows on RV64GC:

    int unsigned (u)intptr_t
    .L3: .L8: .L15:
    slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
    add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
    add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
    lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
    lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
    addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
    addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
    sw a5,0(a6) addw a5,a5,a3
    bge a0,a4,6 <.L3> sw a5,0(a6)
    bgeu a0,a4,28 <.L8>

    There is no difference between the intptr_t loop body and the
    uintptr_t loop. And without -fwrapv, the int loop looks just like the (u)intptr_t loop (because the C compiler then assumes that signed
    integer overflow does not happen).

    So, if you don't have a specific reason to choode int or unsigned,
    better use intptr_t or uintptr_t, respectively. In this way you can
    circumvent some of the damage that I32LP64 has done.

    - anton

















    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Tue Feb 20 13:58:46 2024
    On 20/02/2024 07:31, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {

    Just a side remark: This loop can get very long for len < 0.


    Yes. "len" would be converted to "unsigned" by addition of 2^n (2^32
    for the sizes given here) before the comparison. (It is not an infinite
    loop, however.)

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.


    Wouldn't it be better to forbid mixing of signedness? I don't know
    Fortran, so that might be a silly question!

    For my C programming, I like to have "gcc -Wconversion -Wsign-conversion -Wsign-compare" to catch unintended mixes of signedness.


        s += data[i];
      }
      return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    No. In C, int8_t values will be promoted to "int" (which is always at
    least 16 bits, on any target) and the operation will therefore not
    overflow.

    Depending on len and the data...

    Yes, but I think that was given by Terje in the specification for the
    code that the intermediary calculations could be too big for "int8_t",
    but not too big for "int".


    The conversion of the result of "s + data[i]" from int to
    int8_t, implicit in the assignment, also cannot "overflow" since that
    term applies only to the evaluation of operators. But if this value is
    outside the range for int8_t, then the conversion is
    implementation-defined behaviour. (That is unlike signed integer
    overflow, which is undefined behaviour.)

    And that is one of the things that bugs me, in languages like C
    and Fortran both.

    Me too.

    Even now (from when C23 is officially released) that two's complement is
    the only allowed representation for signed integers in C, conversion
    does not need to be wrapping - conversion to a signed integer type from
    a value that is outside its range is "implementation-defined or an implementation-defined signal is raised". I like that there is the
    option of raising a signal - that lets you have debug options in the
    compiler to find run-time issues. But I'd prefer that the alternative
    to that was specified as modulo or wrapping behaviour.

    (I like that signed integer overflow is UB, however.)


    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    You'll be glad that this is now in C23:

    #include <stdckdint.h>
    bool ckd_add(type1 *result, type2 a, type3 b);
    bool ckd_sub(type1 *result, type2 a, type3 b);
    bool ckd_mul(type1 *result, type2 a, type3 b);

    (Basically, it's the gcc/clang extensions that have been standardised.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 14:46:10 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the >architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    COBOL:
    ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
    IF OVERFLOW ...


    Those ideas are that integer overflows do not happen and that a

    Can't say that I've known a programmer who thought that way.


    And this is reflected in more recent architectures: Many architectures
    since about 1970 have a flags register with carry and overflow bits,

    Architectures in the 1960's had a flags register with and overflow bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Feb 20 14:39:10 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {

    Just a side remark: This loop can get very long for len < 0.

    Which is why len should have been declared as size_t. A negative
    array length is nonsensical.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Feb 20 16:09:36 2024
    Scott Lurndal wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {

    Just a side remark: This loop can get very long for len < 0.

    Which is why len should have been declared as size_t. A negative
    array length is nonsensical.


    It was a pure typo from my side. In Rust array indices are always of
    "size_t" type, you have to explicitely convert/cast anything else before
    you can use it in a lookup:

    opcodes[addr as usize]

    when I needed addr (an i64 variable) to be able to take on negative
    values, but here (knowing that it is now in fact positive) I am using it
    as an array index.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Feb 20 16:17:21 2024
    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the
    architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    COBOL:
    ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
    IF OVERFLOW ...


    Those ideas are that integer overflows do not happen and that a

    Can't say that I've known a programmer who thought that way.


    And this is reflected in more recent architectures: Many architectures
    since about 1970 have a flags register with carry and overflow bits,

    Architectures in the 1960's had a flags register with and overflow bit.


    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.

    Not only that, these cpus also had a dedicated single-byte opcode INTO
    (hex 0xCE) to allow you to implement exception-style overflow handling
    with very little impact on the mainline program, just emit that INTO
    opcode directly after any program sequence where the compiler believed
    that an overflow which shjould be handled, might happen.

    Terje
    PS.INTO was removed in AMD64, I don't remember exactly what the opcode
    was repurposed for?

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Tue Feb 20 16:27:56 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s.

    I presume the main reason for this was the size and cost of memory at
    the time? Or do you know any other reason? Maybe some of the early
    64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
    faster at 16-bit.

    Or maybe changing int from 32-bit to 64-bit would have caused
    as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
    day.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Tue Feb 20 17:25:18 2024
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>> cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s.

    I presume the main reason for this was the size and cost of memory at
    the time? Or do you know any other reason? Maybe some of the early
    64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
    faster at 16-bit.

    We have
    been paying with additional sign-extension and zero-extension
    operations ever since then, and it has even deformed architectures:
    ARM A64 has addressing modes that include sign- or zero-extending a
    32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
    compressed extension, probably because they are so frequent because
    they are used in RISC-V's idioms for sign and zero extension.

    Also, these architectures probably would not have the so-called 32-bit arithmetic instructions (like RV64G's addw) if the mainstream C world
    had decided to use ILP64. RV64G could have left these instructions
    away and replaced them with a sequence of add, slli, srli, i.e., a
    64-bit addition followed by a sign-extension idiom. After all, RISC-V
    seems to favour sequences of more general instructions over having
    more specialized instructions (and addressing modes). But apparently
    the frequency of 32-bit additions is so high thanks to I32LP64 that
    they added addw and addiw to RV64G; and they even occupy space in the compressed extension (16-bit encodings of frequent instructions).

    BTW, some people here have advocated the use of unsigned instead of
    int. Which of the two results in better code depends on the
    architecture. On AMD64 where the so-called 32-bit instructions
    perform a 32->64-bit zero-extension, unsigned is better. On RV64G
    where the so-called 32-bit instructions perform a 32->64-bit sign
    extension, signed int is better. But actually the best way is to use
    a full-width type like intptr_t or uintptr_t, which gives better
    results than either.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
    picked to fit the range you need). These adapt suitably for different
    targets. If you want to force the issue, then "int64_t" is IMHO clearer
    than "long long int" and does not give a strange impression where you
    are using a type aimed at pointer arithmetic for general integer arithmetic.


    E.g., consider the function

    void sext(int M, int *ic, int *is)
    {
    int k;
    for (k = 1; k <= M; k++) {
    ic[k] += is[k];
    }
    }

    which is based on the only loop (from 456.hmmer) in SPECint 2006 where
    the difference between -fwrapv and the default produces a measurable performance difference (as reported in section 3.3 of <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
    created variations of this function, where the types of M and k were
    changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
    code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
    looks as follows on RV64GC:

    int unsigned (u)intptr_t
    .L3: .L8: .L15:
    slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
    add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
    add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
    lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
    lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
    addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
    addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
    sw a5,0(a6) addw a5,a5,a3
    bge a0,a4,6 <.L3> sw a5,0(a6)
    bgeu a0,a4,28 <.L8>

    There is no difference between the intptr_t loop body and the
    uintptr_t loop. And without -fwrapv, the int loop looks just like the (u)intptr_t loop (because the C compiler then assumes that signed
    integer overflow does not happen).

    So, if you don't have a specific reason to choode int or unsigned,
    better use intptr_t or uintptr_t, respectively. In this way you can circumvent some of the damage that I32LP64 has done.


    I would say the takeaways here are :

    If you want fast local variables, use C's [u]int_fastN_t types. That's
    what they are for.

    Don't use unsigned types for counters and indexes unless you actually
    need them. Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so
    letting the compiler assume that will not happen is a good thing. (And
    it lets you check for overflow bugs using run-time sanitizers.)

    It is not just RISCV - this advice applies to all the 64-bit
    architectures I tried on <https://godbolt.org>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Tue Feb 20 16:37:39 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran.  Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language.  Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the >>>> architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    COBOL:
        ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
        IF OVERFLOW ...


    Those ideas are that integer overflows do not happen and that a

    Can't say that I've known a programmer who thought that way.


    And this is reflected in more recent architectures: Many architectures >>>> since about 1970 have a flags register with carry and overflow bits,

    Architectures in the 1960's had a flags register with and overflow bit.


    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of
    fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and >superscaling, overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    ARM often has two encodings for the math instructions - one that sets the flags and one
    that doesn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Tue Feb 20 16:22:31 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    x86 has had an 'O' (Overflow) flags bit since the very beginning

    What is "x86"?

    The 8086 architecture was introduced in 1978, so it's not from the
    1960s. And yes, it's already part of the modern wave of architectures
    which support, e.g., add-with-carry. For the 8086, this is probably
    due to it being rooted in 8-bit-microprocessors, where add-with-carry
    was necessary to have additions beyond 8 bits.

    PS.INTO was removed in AMD64, I don't remember exactly what the opcode
    was repurposed for?

    AFAICT it has not been used yet. My guess is that the AMD64 designers wanted to vacate a little-used single-byte opcode that can be readily
    replaced with a JO to an appropriate target.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Tue Feb 20 17:28:45 2024
    On 20/02/2024 16:17, Terje Mathisen wrote:
    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran.  Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language.  Gcc adds stuff like __builtin_add_overflow,
    but this kind of thing really belongs in the core language.

    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the
    architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    COBOL:
        ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
        IF OVERFLOW ...


    Those ideas are that integer overflows do not happen and that a

    Can't say that I've known a programmer who thought that way.


    And this is reflected in more recent architectures: Many architectures
    since about 1970 have a flags register with carry and overflow bits,

    Architectures in the 1960's had a flags register with and overflow bit.


    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of
    fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and superscaling, overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    Not only that, these cpus also had a dedicated single-byte opcode INTO
    (hex 0xCE) to allow you to implement exception-style overflow handling
    with very little impact on the mainline program, just emit that INTO
    opcode directly after any program sequence where the compiler believed
    that an overflow which shjould be handled, might happen.

    Terje
    PS.INTO was removed in AMD64, I don't remember exactly what the opcode
    was repurposed for?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 16:39:16 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the >>architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    IIRC S/360 has two modes of operation: One where, on signed addition,
    overflow traps, and one where it sets some flag; and the flag-setting
    is not as consistent as say the NZCV flags on modern architectures;
    instead, there are two bits that can mean anything at all, depending
    on the instruction that sets them. In any case, if you use a program
    that checks for overflows, then you either have to change the mode to non-trapping before the addition and change it back afterwards, or all
    signed overflows that are not checked explicitly are ignored.

    Supposedly other architectures have instructions that trap on signed
    integer overflow; if you cannot disable that feature on them, you
    cannot use overflow-checking programs; if you can disable that
    feature, you have the same problem as the S/360.

    In which case the question is: What is the result of such a silent
    overflowing addition/subtraction/multiplication? On 2s-complement architectures, the result likely is defined by modulo arithmetic, but
    what about ones-complement and sign-magnitude machines?

    Certainly the S/360 design indicates that the architect does not
    expect an overflow to happen in regular execution, and that
    overflow-checking in programs is not expected by the architect,
    either.

    Moreover, addition with carry-in was only added in ESA/390 in 1990.
    So they certainly did not expect multi-precision arithmetic or Bignums
    before then.

    COBOL:
    ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
    IF OVERFLOW ...

    This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?

    Those ideas are that integer overflows do not happen and that a

    Can't say that I've known a programmer who thought that way.

    Just read into the discussions about the default treatment of signed
    integer overflow by, e.g. gcc, and clang. The line of argument goes
    like this: The C standard does not define the behaviour on signed
    overflow, therefore a C program must not have such overflows,
    therefore it is not just correct but also desirable for a compiler to
    assume that such overflows do not happen. I don't know whether people
    who argue that way are programmers, but at least they pose as such on
    the 'net.

    Architectures in the 1960's had a flags register with and overflow bit.

    S/360 certainly has no dedicated overflow bit, just multi-purpose bits
    that sometimes mean "overflow".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 17:24:55 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    It seems to me that this is based on the ideas people in the old days
    had about integer overflows, and these ideas are also reflected in the >>>architectures of old.

    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
    is not as consistent as say the NZCV flags on modern architectures;
    instead, there are two bits that can mean anything at all, depending
    on the instruction that sets them. In any case, if you use a program
    that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
    signed overflows that are not checked explicitly are ignored.

    The contemporaneous B3500 had a three-bit field for the 'flags'
    called COMS/OVF (Comparison Toggles(COMH/COML) and Overflow Toggle).

    The overflow toggle was sticky and only reset by the Branch on
    Overflow (OFL) instruction.

    There were no traps.


    Moreover, addition with carry-in was only added in ESA/390 in 1990.
    So they certainly did not expect multi-precision arithmetic or Bignums
    before then.

    The B3500 was BCD, with one to 100 digit operands. Effectively
    bignums. The optional floating point instruction set had a two
    digit exponent and a hundred digit fraction.



    COBOL:
    ADD 1 TO TALLY ON OVERFLOW ...

    BPL:
    IF OVERFLOW ...

    This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?

    No. Burroughs Programming Language.

    https://en.wikipedia.org/wiki/Burroughs_Medium_Systems

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Feb 20 17:32:59 2024
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
      int8_t s = 0;
      for (unsigned i = 0 i < len; i++) {

    Just a side remark: This loop can get very long for len < 0.

    Which is why len should have been declared as size_t. A negative
    array length is nonsensical.

    Not in all languages. It can be just a shorthand for a zero-sized
    array:

    $ cat array.f90
    program main
    real, dimension(1:-1) :: a
    print *,size(a)
    end program main
    $ gfortran array.f90 && ./a.out
    0
    $

    The argument is the same as for a DO loop like

    DO I=1,-1
    ...
    END DO

    which is also executed zero times (and not -3 times :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Feb 20 17:37:09 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    It was a pure typo from my side. In Rust array indices are always of
    "size_t" type, you have to explicitely convert/cast anything else before
    you can use it in a lookup:

    opcodes[addr as usize]

    No arbitrary array indices, then?

    I sometimes find array bounds like

    a(-3:3)

    convenient, but it is not a killer for not using a language
    (I use both C and Perl :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Tue Feb 20 17:42:24 2024
    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at
    https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.


    Wouldn't it be better to forbid mixing of signedness?

    That is also in the proposal.

    I don't know
    Fortran, so that might be a silly question!

    Not at all - mixing signed and unsigned arithmetic is a source
    of headache for C, and I see no reason to impose the same sort of
    headache on Fortran (and I see that were are in agreement here,
    or you would not have written

    For my C programming, I like to have "gcc -Wconversion -Wsign-conversion -Wsign-compare" to catch unintended mixes of signedness.

    :-)

    Not sure if it will pass, though - unsigned ints have been brought
    up in the past, and rejected. Maybe, with the support of DIN, this
    has a better chance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Feb 20 17:43:18 2024
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Architectures in the 1960's had a flags register with and overflow bit.

    POWER still has a (sticky) overflow bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 18:27:40 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    Or maybe changing int from 32-bit to 64-bit would have caused
    as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
    day.

    In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
    architectures. Given the history of C, that's not surprising: BCPL
    and B have a single type, the machine word, and it eventually became
    C's int. You see this in "int" declarations being optional in various
    places. So code portable between 16-bit and 32-bit systems could not
    assume that int has a specific size (such as 32 bits), but if it
    assumed that sizeof(int) == sizeof(int *), that would port fine
    between 16-bit and 32-bit Unixes. There may have been C code that
    assumed that sizeof(int)==4, but why cater to this kind of code which
    did not even port to 16-bit systems?

    In any case, I32LP64 caused breakage for my code, and I expect that
    there was more code around with the assumption sizeof(int)==sizeof(int
    *) than with the assumption sizeof(int)==4. Of course, we worked
    around this misfeature of the C compilers on Digital OSF/1, but those
    who assumed sizeof(int)==4 would have adapted their code if the
    decision had been for ILP64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Tue Feb 20 18:42:17 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of
    fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.

    as flag registers are a bottleneck for OOO and
    superscaling

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).
    Implementing a flags register in an OoO microarchitecture does require execution resources, however.


    overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    That's nonsense. People use carry for implementing multi-precision
    arithmetic (e.g., for cryptography) and for Bignums, and they use
    overflow for implementing Bignums. And the significance of these
    features has increased over time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Tue Feb 20 17:47:37 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>>> cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s.

    I presume the main reason for this was the size and cost of memory at
    the time? Or do you know any other reason? Maybe some of the early
    64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
    faster at 16-bit.

    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    Guess 1: There was more software that depended on sizeof(int)==4 than
    software that depended on sizeof(int)==sizeof(char *).

    Guess 2: When benchmarketing without adapting the source code (as is
    usual), I32LP64 produced better numbers then ILP64 for some
    benchmarks, because arrays and other data structures with int elements
    are smaller and have better cache hit rates.

    My guess is that it was a mixture of 1 and 2, with 2 being the
    decisive factor. I have certainly seen a lot of writing about how
    64-bit (pointers) hurt performance, and it even led to the x32
    nonsense (which never went anywhere, not surprising to me). These
    days support for 32-bit applications is eliminated from ARM cores,
    another indication that the performance advantages of 32-bit pointers
    are minor.

    BTW, some people here have advocated the use of unsigned instead of
    int. Which of the two results in better code depends on the
    architecture. On AMD64 where the so-called 32-bit instructions
    perform a 32->64-bit zero-extension, unsigned is better. On RV64G
    where the so-called 32-bit instructions perform a 32->64-bit sign
    extension, signed int is better. But actually the best way is to use
    a full-width type like intptr_t or uintptr_t, which gives better
    results than either.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs. And I don't see a reason why to use
    something else for a loop counter.

    For a type of which you store many in an array or other data
    structure, you probably prefer int32_t rather than int_fast32_t if 32
    bits is enough. So I don't see a reason for int_fast32_t etc.

    These adapt suitably for different
    targets. If you want to force the issue, then "int64_t" is IMHO clearer
    than "long long int" and does not give a strange impression where you
    are using a type aimed at pointer arithmetic for general integer arithmetic.

    Why do you bring up "long long int"? As for int64_t, that tends to be
    slow (if supported at all) on 32-bit platforms, and it is more than
    what is necessary for indexing arrays and for loop counters that are
    used for indexing into arrays.

    If you want fast local variables, use C's [u]int_fastN_t types. That's
    what they are for.

    I don't see a point in those types. What's wrong with (u)intptr_t IYO?

    Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so >letting the compiler assume that will not happen is a good thing.

    Thank you for giving a demonstration for Scott Lurndal. I assume that
    you claim to be a programmer.

    Anyway, if I have made a mistake in my code, why would let the
    compiler assume that I did not make a mistake be a good thing?

    I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
    and for good performance, I write the code appropriately (e.g., by
    using intptr_t instead of int).

    (And
    it lets you check for overflow bugs using run-time sanitizers.)

    If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

    Anyway, I certainly have code that relies on modulo arithmetic.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 19:12:39 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Or maybe changing int from 32-bit to 64-bit would have caused
    as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
    day.

    In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
    architectures. Given the history of C, that's not surprising: BCPL
    and B have a single type, the machine word, and it eventually became
    C's int. You see this in "int" declarations being optional in various >places. So code portable between 16-bit and 32-bit systems could not
    assume that int has a specific size (such as 32 bits), but if it
    assumed that sizeof(int) == sizeof(int *), that would port fine
    between 16-bit and 32-bit Unixes. There may have been C code that
    assumed that sizeof(int)==4, but why cater to this kind of code which
    did not even port to 16-bit systems?

    Most of the problems encountered when moving unix (System V)
    from 16-bit to 32-bit were more around missing typedefs for certain
    data types (e.g. uids, gids, pids, etc), so there was
    a lot of code that declared these as shorts, but the 32-bit
    kernels defined these as 32-bit (unsigned) values).

    That's when uid_t, pid_t, gid_t were added.

    Then there were the folks who used 'short [int]' instead of 'int'
    since they were the same size on the PDP-11.

    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Feb 20 19:31:20 2024
    Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of >>fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.

    as flag registers are a bottleneck for OOO and
    superscaling

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).
    Implementing a flags register in an OoO microarchitecture does require execution resources, however.

    These implementations have several 200 man design teams and decades
    of architectural and µArchitectural understanding that upstart
    RISC designs cannot afford (until they <also> reach 100M chips sold
    per year--where you reach the required amount of cubic dollars).

    The fact that some designs with flags can perform at the top or near
    the top is only indicative that flags are "not that much" of an
    impediment to performance at the scale of current µprocessors
    (3-4 wide)

    overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
    overflow for implementing Bignums. And the significance of these
    features has increased over time.

    You can implement BigNums efficiently without a RAW serializing
    carry bit in some control register (see My 66000 CARRY instruction-
    modifier).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Feb 20 19:44:57 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Architectures of old _expected_ integer overflows and had
    mechanisms in the languages to test for them.

    IIRC S/360 has two modes of operation: One where, on signed addition, >overflow traps, and one where it sets some flag; and the flag-setting
    is not as consistent as say the NZCV flags on modern architectures;
    instead, there are two bits that can mean anything at all, depending
    on the instruction that sets them. In any case, if you use a program
    that checks for overflows, then you either have to change the mode to >non-trapping before the addition and change it back afterwards, or all
    signed overflows that are not checked explicitly are ignored.

    Not quite. It had regular add and subtract (A, AR, S, SR) and logical
    (AL, ALR, SL, SLR). The former set the condition code as negative,zero,positive, or overflow, and interrupted if the overflow
    interrupt was enabled. The latter set the condition code as zero no
    carry, nonzero no carry, zero carry, or nonzero carry, and never
    overflowed. There weren't instructions to do add or subtract with
    carry but it was pretty easy to fake by doing a branch on no carry
    around an instruction to add or subtract 1.

    Multiplication was always signed and took two single length operands
    and produced a double length product. It couldn't overflow but you
    could with some pain check to see if the high word of the product had
    any significant bits.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brian G. Lucas on Tue Feb 20 21:49:39 2024
    "Brian G. Lucas" <bagel99@gmail.com> writes:
    What I would like is a compiler flag that did "IFF when an int (or unsigned) >ends up in a register, promote it to the 'fast' type".

    That used to be the point of C's promotion-to-int rule, until the
    I32LP64 mistake. Now, despite architectural workarounds like RISC-V's
    addw, we see the fallout of this mistake.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Tue Feb 20 21:54:55 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    I only heard about (u)intptr_t long after my first contact with
    I32LP64 in 1995. I don't think it existed at the time. Of course we
    defined our own (u)intptr_t-like types, but there are problems to this
    day, e.g., when I want to use printf on that type, which is an int on
    one platform and a long on another platform; I guess the solution is
    to always use %ld etc, and cast the integer data to be printed to
    long/unsigned long.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Feb 20 14:40:02 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [some incidental text removed]

    David Brown <david.brown@hesbynett.no> schrieb:

    On 17/02/2024 19:58, Terje Mathisen wrote:

    int8_t sum(int len, int8_t data[])
    {
    int8_t s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return s;
    }

    will overflow if called with data = [127, 1, -2], right?

    No. In C, int8_t values will be promoted to "int" (which is always
    at least 16 bits, on any target) and the operation will therefore
    not overflow.

    Depending on len and the data...

    The code as written does not overflow regardless of the
    values in the data array or how many are processed.

    [...]

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably. It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Tue Feb 20 22:59:10 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    I only heard about (u)intptr_t long after my first contact with
    I32LP64 in 1995. I don't think it existed at the time.

    Sorry, I meant ptrdiff_t, which was used for pointer math.

    uintptr_t came later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Wed Feb 21 11:47:27 2024
    On 20/02/2024 18:47, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    We are in an era where long has higher performance than ints (except for >>>>> cache footprint overheads.)

    C has been in that era since the bad I32LP64 decision of the people
    who did the first 64-bit Unix compilers in the early 1990s.

    I presume the main reason for this was the size and cost of memory at
    the time? Or do you know any other reason? Maybe some of the early
    64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
    faster at 16-bit.

    I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    Guess 1: There was more software that depended on sizeof(int)==4 than software that depended on sizeof(int)==sizeof(char *).

    Guess 2: When benchmarketing without adapting the source code (as is
    usual), I32LP64 produced better numbers then ILP64 for some
    benchmarks, because arrays and other data structures with int elements
    are smaller and have better cache hit rates.

    My guess is that it was a mixture of 1 and 2, with 2 being the
    decisive factor.

    Sounds reasonable.

    Another possible reason is that it is very useful to have integer types
    with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
    types, so if "int" is 64-bit, you have "short" as either 16-bit and have
    no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With
    32-bit "int", it's easy to have each size without having to add extended integer types or add new standard integer types (like "short short int"
    for 16-bit and "short int" for 32-bit).

    I have certainly seen a lot of writing about how
    64-bit (pointers) hurt performance, and it even led to the x32
    nonsense (which never went anywhere, not surprising to me). These
    days support for 32-bit applications is eliminated from ARM cores,
    another indication that the performance advantages of 32-bit pointers
    are minor.

    I saw benchmarks showing x32 being measurably faster, but it's not
    unlikely that the differences got less with more modern x86-64
    processors (with bigger caches), and it's simply not worth the effort
    having another set of libraries and compiler targets just to make some
    kinds of code marginally faster.

    And support for 32-bit has /not/ been "eliminated from ARM cores". It
    may have been eliminated from the latest AArch64 cores - I don't keep
    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
    and embedded systems. You might not be interested in anything that
    isn't running on modern 64-bit x86 or AArch64 systems (and that's
    absolutely fine - no one is interested in everything), but 32-bit is not
    going away any time soon. Even 8-bit has not gone away.


    BTW, some people here have advocated the use of unsigned instead of
    int. Which of the two results in better code depends on the
    architecture. On AMD64 where the so-called 32-bit instructions
    perform a 32->64-bit zero-extension, unsigned is better. On RV64G
    where the so-called 32-bit instructions perform a 32->64-bit sign
    extension, signed int is better. But actually the best way is to use
    a full-width type like intptr_t or uintptr_t, which gives better
    results than either.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
    picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31 elements; or it might work on one platform and break on another one.


    You need to pick the appropriate size for your data, as I said.

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs. And I don't see a reason why to use something else for a loop counter.

    I like my types to say what they mean. "uintptr_t" says "this object
    holds addresses for converting pointer values back and forth with an
    integer type". "uint_fast64_t" says "this holds an unsigned integer
    with a range of at least 64 bits, as fast as the target can manage".
    And if you want a type that says "this can hold a value as big as the
    size of the biggest object on this target", "size_t" is the correct type.

    I expect that on most 64-bit platforms, uint_fast16_t, uint_fast32_t, uint_fast64_t, uintptr_t, size_t, and many other types are all 64-bit
    unsigned integers, and all are typedef's of unsigned long int or
    unsigned long long int. The same goes for the signed types.
    Nonetheless, it is good design to use appropriate type names for
    appropriate usage. This makes the code clearer, and increases portability.

    So "[u]intptr_t" is - IMHO - the wrong choice for anything other than
    dealing with pointers that are converted to an integer type.


    For a type of which you store many in an array or other data
    structure, you probably prefer int32_t rather than int_fast32_t if 32
    bits is enough.

    Agreed.

    (You could even use "int_least32_t" if you wanted extreme portability,
    but I have difficulty imagining a platform where that would be anything
    other than int32_t. There are DSP ISAs that are still current which
    don't support int8_t or int16_t, but code for such devices is usually
    highly specialised for such devices, and portability is not an issue.)


    So I don't see a reason for int_fast32_t etc.

    Use it when you want a 32-bit range, and as fast as possible.


    These adapt suitably for different
    targets. If you want to force the issue, then "int64_t" is IMHO clearer
    than "long long int" and does not give a strange impression where you
    are using a type aimed at pointer arithmetic for general integer arithmetic.

    Why do you bring up "long long int"?

    If you want a standard integer type in C that has at least 64-bit, you
    use "long long int". "long int" is only specified as being at least
    32-bit. All other integer types, such as "intptr_t", "size_t" and
    "int64_t", are aliases for the standard integer types. (They could be "extended integer types", but no major C implementation has any of these.)

    As for int64_t, that tends to be
    slow (if supported at all) on 32-bit platforms, and it is more than
    what is necessary for indexing arrays and for loop counters that are
    used for indexing into arrays.


    And that is why it makes sense to use the "fast" types. If you need a
    16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems,
    32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
    always supporting the range you need, as fast as possible.

    If you want fast local variables, use C's [u]int_fastN_t types. That's
    what they are for.

    I don't see a point in those types. What's wrong with (u)intptr_t IYO?


    I've answered that above.

    (I believe we are in close agreements about the facts - when different
    sizes are faster - but differ in our opinions of which type names to use.)

    Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so
    letting the compiler assume that will not happen is a good thing.

    Thank you for giving a demonstration for Scott Lurndal. I assume that
    you claim to be a programmer.


    Sorry, that comment went over my head - I don't know what
    "demonstration" you are referring to.

    Anyway, if I have made a mistake in my code, why would let the
    compiler assume that I did not make a mistake be a good thing?

    If you have a mistake in your code, you want all the help you can get to
    find it and fix it - leaving integer overflow as UB means compilers can
    provide tools such as sanitizers to aid here, without having to break conformance with the language.

    And if you have /not/ made a mistake in your code, presumably you want
    the compiler to assume you have not make a mistake if that assumption
    lets it generate more efficient object code. (I realise there are times
    when minimising the consequences of possible mistakes is more important
    than object code efficiency, and there can be many other factors to take
    into account.)


    I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
    and for good performance, I write the code appropriately (e.g., by
    using intptr_t instead of int).

    OK.

    We have had discussions before about the pros and cons of C's UB and
    handling of signed integer overflow. We disagreed before, and I do not
    expect anything has changed or that either of us has new arguments to
    present. So it's probably best not to try to argue about whose
    preferences or opinions are "best" or have the most justifiable
    reasoning. But we can give facts.

    In the examples you have given, using "int" and "-fwrapv" (or "unsigned
    int", which is always wrapping) gives poorer code than either using a
    64-bit type (whatever it is called) or using "int" without "-fwrapv".

    And this could have been simpler if "int" had been 64-bit in the first
    place.


    (And
    it lets you check for overflow bugs using run-time sanitizers.)

    If the compiler assumes that overflow does not happen, how do these "sanitizers" work?

    A compiler assumes that the overflow does not happen in /correct/ code,
    used correctly. With optimisation options, it will assume that if there
    is garbage in, you don't care what kind of garbage comes out, and use
    that to give more efficient object code when things are running as
    intended. With debugging flags, you ask the compiler to tell you when
    it sees the garbage.

    But if overflow is defined (such as it is for unsigned arithmetic, or if "-fwrapv" is in effect), overflow is no longer an error that can be
    trapped at run-time - it is behaviour you may use and rely on in your code.


    Anyway, I certainly have code that relies on modulo arithmetic.


    Sure. So do I. (Not often, but it happens.) I use unsigned types when
    I need to do that, because that's how they are defined in the language.

    For comparison to C, look at the Zig language. (It's not a language I
    have used or know in detail, but I know a little about it.) Unsigned
    integer arithmetic overflow is UB in Zig, just like signed integer
    arithmetic overflow. There are standard options and block settings
    (roughly equivalent to pragmas) to control whether these give a run-time
    error, or are assumed never to happen (for optimisation purposes). And
    if you want to add "x" and "y" with wrapping, you use "x +% y" as an
    explicit choice of operator.

    That seems to me to be the right approach for an efficient high-level
    language.

    But whatever one thinks about C - and I doubt if there is anyone who has
    worked much with C who does not dislike and disagree with some of its
    design decisions - I think it is important to program in C the way C
    works. Compiler extensions, specific flags, relying on implementation-dependent behaviour, etc., can be acceptable, but should
    not be the first choice.

    If you want "int z = x + y;" with wrapping, write :

    int z = (unsigned) x + (unsigned) y;

    If you do it a lot, put it in an inline function (or macro). It is only
    if you are doing it regularly throughout your code, or if you are using
    code written by someone else who assumes signed arithmetic wraps, that "-fwrapv" is a good choice IMHO. And even then, I always put it in a
    pragma so that the code works even if someone uses different compiler flags.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brian G. Lucas on Wed Feb 21 13:27:23 2024
    On 20/02/2024 20:18, Brian G. Lucas wrote:
    On 2/20/24 10:25, David Brown wrote:

    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes, picked to fit the range you need).  These adapt suitably for
    different targets.  If you want to force the issue, then "int64_t" is
    IMHO clearer than "long long int" and does not give a strange
    impression where you are using a type aimed at pointer arithmetic for
    general integer arithmetic.

    What I would like is a compiler flag that did "IFF when an int (or
    unsigned)
    ends up in a register, promote it to the 'fast' type".  This would be great when compiling dusty C decks. (Was there ever C code on punched cards?)


    Well, that will happen to at least some extent (with an optimising
    compiler), at least as long as the answer is the same in the end. It
    can be done a bit more often with "int" rather than "unsigned int",
    precisely because you promised the compiler that your arithmetic won't
    overflow so it does not need to worry about that possibility.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Wed Feb 21 13:49:41 2024
    On 20/02/2024 19:42, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning, along
    with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of
    fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.


    I was thinking more in terms of /using/ these flags, rather than ISA
    support for them. ISAs would clearly have to keep the flag registers
    and the instructions that used them if they wanted to keep compatibility
    with 32-bit code.

    But I think it was fairly rare to use the "add 32-bit and update flags" instruction in 32-bit RSIC systems (except for 64-bit arithmetic), and
    much rarer to use the "add 64-bit and update flags" version in 64-bit
    versions.

    as flag registers are a bottleneck for OOO and
    superscaling

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).

    I think it would mainly be /despite/ having a flag register, rather than /because/ of it?

    Sometimes having flags for overflows, carries, etc., can be very handy.
    So having it in the ISA is useful. But I think you would normally want
    your code to avoid setting or reading flags.

    Implementing a flags register in an OoO microarchitecture does require execution resources, however.


    It would, I think, be particularly cumbersome to track several parallel
    actions that all act on the flag register, as it is logically a shared resource.

    Do you think it is an advantage for a RISC architecture to have a flags register compared to alternatives? Say we want to have a double-width
    addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b". (I hope my
    pseudo-code is clear enough here.) With flags and an "add with carry" instruction you could have :

    carry = 0;
    carry, res_lo = reg_a + reg_b + carry
    carry, res_hi = 0 + 0 + carry

    Alternatively, you could have a double-register result, at the cost of
    having more complex register banks :

    res_hi:res_lo = reg_a + reg_b

    Or you could have an "add and take the high word" instruction and use
    two additions :

    res_hi = (reg_a + reg_b) >> N
    res_lo = reg_a + reg_b



    overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
    overflow for implementing Bignums. And the significance of these
    features has increased over time.


    Fair enough, you do want carry (or an equivalent) for big number work.
    But I would still contend that the vast majority of integers and integer arithmetic used in code will fit within 32 bits, and the vast majority
    of those that don't, will fit within 64 bits. Once you go beyond that,
    you will need lots of bits (such as, as you say, cryptography).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 14:34:59 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:

    And support for 32-bit has /not/ been "eliminated from ARM cores". It
    may have been eliminated from the latest AArch64 cores

    The ARMv8 architecture fully supports the A32 and T32 instruction sets.

    Implementations of the architecture can choose not to implement
    the A32 and T32 instruction sets. Some ARMv8 implementations
    (e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM implementations of ARMv8 implemented the A32/T32 instruction
    sets for EL0 (user-mode) only - I'm not aware of any that
    supported A32 at privileged exception levels (EL1, EL2 or EL3).

    Some of the more recent ARM neoverse cores support A32/T32 at EL0,
    and some of them don't. Cavium's cores were 64-bit only.


    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers

    Indeed, and many ARMv8 SoCs include arm 32-bit M-series microcontrollers on-chip.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Wed Feb 21 18:23:16 2024
    On Wed, 21 Feb 2024 14:34:59 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:

    And support for 32-bit has /not/ been "eliminated from ARM cores".
    It may have been eliminated from the latest AArch64 cores

    The ARMv8 architecture fully supports the A32 and T32 instruction
    sets.

    Implementations of the architecture can choose not to implement
    the A32 and T32 instruction sets. Some ARMv8 implementations
    (e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM implementations of ARMv8 implemented the A32/T32 instruction
    sets for EL0 (user-mode) only - I'm not aware of any that
    supported A32 at privileged exception levels (EL1, EL2 or EL3).


    W.r.t. Arm Inc. Cortex-A cores that's simply wrong.
    All 64-bit Cortex-A cores from the very first two (A53 and A57, 2012)
    and up to A75 support A32/T32 at all four exception levels. Of those,
    A53 and A55 are still produced and used in huge quantities.
    The first Cortex-A 64-bit core that supports aarch32 only at EL0 is
    Cortex-A76 (2018).

    Some of the more recent ARM neoverse cores support A32/T32 at EL0,
    and some of them don't. Cavium's cores were 64-bit only.


    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in
    microcontrollers

    Indeed, and many ARMv8 SoCs include arm 32-bit M-series
    microcontrollers on-chip.

    Still, I don't think that the ratio of all ARM cores combined to cores
    in smartphone application processors is really hundreds. I'd say,
    something between 15 and 30.
    By now, some of smartphone cores in current production, most notable
    "LITTLE" Cortex-A510, still support T32/A32, but in few years
    everything in this space will be aarch64-only.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Wed Feb 21 18:33:07 2024
    On Wed, 21 Feb 2024 13:27:23 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 20/02/2024 20:18, Brian G. Lucas wrote:
    On 2/20/24 10:25, David Brown wrote:

    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes, picked to fit the range you need). These adapt suitably
    for different targets. If you want to force the issue, then
    "int64_t" is IMHO clearer than "long long int" and does not give a
    strange impression where you are using a type aimed at pointer
    arithmetic for general integer arithmetic.

    What I would like is a compiler flag that did "IFF when an int (or unsigned)
    ends up in a register, promote it to the 'fast' type". This would
    be great when compiling dusty C decks. (Was there ever C code on
    punched cards?)

    Well, that will happen to at least some extent (with an optimising compiler), at least as long as the answer is the same in the end. It
    can be done a bit more often with "int" rather than "unsigned int", precisely because you promised the compiler that your arithmetic
    won't overflow so it does not need to worry about that possibility.


    In case of array indices I'd replace your "a bit more" by "a lot
    more".
    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will
    somehow figure out what you meant instead of doing what you wrote is a
    naivety.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed Feb 21 16:53:08 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Feb 2024 14:34:59 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:

    And support for 32-bit has /not/ been "eliminated from ARM cores".
    It may have been eliminated from the latest AArch64 cores

    The ARMv8 architecture fully supports the A32 and T32 instruction
    sets.

    Implementations of the architecture can choose not to implement
    the A32 and T32 instruction sets. Some ARMv8 implementations
    (e.g. Cavium's) never implemented A32 or T32. Many (if not most) ARM
    implementations of ARMv8 implemented the A32/T32 instruction
    sets for EL0 (user-mode) only - I'm not aware of any that
    supported A32 at privileged exception levels (EL1, EL2 or EL3).


    W.r.t. Arm Inc. Cortex-A cores that's simply wrong.
    All 64-bit Cortex-A cores from the very first two (A53 and A57, 2012)
    and up to A75 support A32/T32 at all four exception levels. Of those,
    A53 and A55 are still produced and used in huge quantities.
    The first Cortex-A 64-bit core that supports aarch32 only at EL0 is >Cortex-A76 (2018).

    I'm looking at it from a server-grade core (neoverse) perspective. Even
    with the Cortex-A cores, there wasn't much -demand- for
    aarch32 support at higher levels (except for Rasp Pi, which
    granted there are a considerable number of them).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Wed Feb 21 18:08:08 2024
    On 21/02/2024 17:23, Michael S wrote:
    On Wed, 21 Feb 2024 14:34:59 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:

    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in
    microcontrollers

    Indeed, and many ARMv8 SoCs include arm 32-bit M-series
    microcontrollers on-chip.

    Still, I don't think that the ratio of all ARM cores combined to cores
    in smartphone application processors is really hundreds. I'd say,
    something between 15 and 30.

    32-bit ARM Cortex-M cores are /everywhere/. Your smart phone probably
    has several of them for the cellular modem, the wireless interface, and
    other devices. Your TV will have them, your microwave, your keyboard,
    your games console controllers. But my guestimate is no more than a guestimate, and the ratio is perhaps less than it used to be as
    smartphones and other such things get more and more cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Wed Feb 21 19:09:16 2024
    On Wed, 21 Feb 2024 13:49:41 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 20/02/2024 19:42, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning,
    along with JO and JNO for Jump on Overflow and Jump if Not
    Overflow.

    Many processors had something similar. But I think they fell out
    of fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.


    I was thinking more in terms of /using/ these flags, rather than ISA
    support for them. ISAs would clearly have to keep the flag registers
    and the instructions that used them if they wanted to keep
    compatibility with 32-bit code.


    aarch64 is a completely new and incompatible instruction encoding.

    But I think it was fairly rare to use the "add 32-bit and update
    flags" instruction in 32-bit RSIC systems (except for 64-bit
    arithmetic), and much rarer to use the "add 64-bit and update flags"
    version in 64-bit versions.


    Of course, the main use of flags is conditional branching. That's true
    even on 16-bit.
    The 2nd common use is conditional move/selection.
    Other uses are just bonus, insignificant in The Great Scheme of Things.
    However I don't think that "bonus" use of flags for bignum and similar
    is any rarer on 64-bit machines than on 32-bit.

    as flag registers are a bottleneck for OOO and
    superscaling

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).

    I think it would mainly be /despite/ having a flag register, rather
    than /because/ of it?


    That's what Mitch thinks. But he has no proof.

    Sometimes having flags for overflows, carries, etc., can be very
    handy. So having it in the ISA is useful. But I think you would
    normally want your code to avoid setting or reading flags.


    On OoO, when you are setting flags almost all the time, you are
    effectively telling an engine that flags results of the previous
    arithmetic instructions are DNC. In theory, it can be used to avoid
    majority of updates of flags in PRF. I don't know whether such
    optimization is actually done in real HW.
    Reading flags can't really be rare, because conditional branches
    are among most common instructions in the real-world code.

    Implementing a flags register in an OoO microarchitecture does
    require execution resources, however.


    It would, I think, be particularly cumbersome to track several
    parallel actions that all act on the flag register, as it is
    logically a shared resource.

    Do you think it is an advantage for a RISC architecture to have a
    flags register compared to alternatives? Say we want to have a
    double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
    (I hope my pseudo-code is clear enough here.) With flags and an "add
    with carry" instruction you could have :

    carry = 0;
    carry, res_lo = reg_a + reg_b + carry
    carry, res_hi = 0 + 0 + carry


    That's not how we do it.
    We just use normal add for the first instruction.

    Alternatively, you could have a double-register result, at the cost
    of having more complex register banks :

    res_hi:res_lo = reg_a + reg_b

    Or you could have an "add and take the high word" instruction and use
    two additions :

    res_hi = (reg_a + reg_b) >> N
    res_lo = reg_a + reg_b


    This variant does not scale to longer additions. Producing carry, i.e. instruction having effectively two outputs is smaller part of advantage
    of flags-based scheme. The bigger part is consuming carry, i.e. having effectively three inputs. In order to see it, you have to think about
    triple -width (or wider) case.
    Those are words in between the first and the last which are the most challenging for MIPS/Alpha/RISC-V and where they need 5 instruction vs 1 instruction on x86/Arm/SPARC.

    Still, as I said above, that just a bonus rather than a major reason to
    have flags.




    overflow is a lot less common for 64-bit arithmetic, and
    people were not really using the flag except for implementation of
    64-bit arithmetic.

    That's nonsense. People use carry for implementing multi-precision arithmetic (e.g., for cryptography) and for Bignums, and they use
    overflow for implementing Bignums. And the significance of these
    features has increased over time.


    Fair enough, you do want carry (or an equivalent) for big number
    work. But I would still contend that the vast majority of integers
    and integer arithmetic used in code will fit within 32 bits, and the
    vast majority of those that don't, will fit within 64 bits. Once you
    go beyond that, you will need lots of bits (such as, as you say, cryptography).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Wed Feb 21 18:10:25 2024
    On 21/02/2024 17:33, Michael S wrote:
    On Wed, 21 Feb 2024 13:27:23 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 20/02/2024 20:18, Brian G. Lucas wrote:
    On 2/20/24 10:25, David Brown wrote:

    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes, picked to fit the range you need).  These adapt suitably
    for different targets.  If you want to force the issue, then
    "int64_t" is IMHO clearer than "long long int" and does not give a
    strange impression where you are using a type aimed at pointer
    arithmetic for general integer arithmetic.

    What I would like is a compiler flag that did "IFF when an int (or
    unsigned)
    ends up in a register, promote it to the 'fast' type".  This would
    be great when compiling dusty C decks. (Was there ever C code on
    punched cards?)

    Well, that will happen to at least some extent (with an optimising
    compiler), at least as long as the answer is the same in the end. It
    can be done a bit more often with "int" rather than "unsigned int",
    precisely because you promised the compiler that your arithmetic
    won't overflow so it does not need to worry about that possibility.


    In case of array indices I'd replace your "a bit more" by "a lot
    more".

    I haven't measured the real-world performance impact (I am more
    interested in performance on microcontroller cores). So I'll believe
    whatever you and the others here say on that!

    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will
    somehow figure out what you meant instead of doing what you wrote is a naivety.


    Indeed.

    Compilers do what you tell them. You just have to be accurate about
    what you say.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed Feb 21 17:30:31 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Feb 2024 13:49:41 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On OoO, when you are setting flags almost all the time, you are
    effectively telling an engine that flags results of the previous
    arithmetic instructions are DNC. In theory, it can be used to avoid
    majority of updates of flags in PRF. I don't know whether such
    optimization is actually done in real HW.
    Reading flags can't really be rare, because conditional branches
    are among most common instructions in the real-world code.

    Code generators are pretty good at using the non-flag-setting
    arithmetic instructions when the flags don't matter, and using
    the flag-setting versions (e.g. ADD vs. ADDS) when needed for conditional branches or conditional moves.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Wed Feb 21 17:27:21 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 21/02/2024 17:33, Michael S wrote:
    On Wed, 21 Feb 2024 13:27:23 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 20/02/2024 20:18, Brian G. Lucas wrote:
    On 2/20/24 10:25, David Brown wrote:

    I would suggest C "fast" types like int_fast32_t (or other "fast"
    sizes, picked to fit the range you need).  These adapt suitably
    for different targets.  If you want to force the issue, then
    "int64_t" is IMHO clearer than "long long int" and does not give a
    strange impression where you are using a type aimed at pointer
    arithmetic for general integer arithmetic.

    What I would like is a compiler flag that did "IFF when an int (or
    unsigned)
    ends up in a register, promote it to the 'fast' type".  This would
    be great when compiling dusty C decks. (Was there ever C code on
    punched cards?)

    Well, that will happen to at least some extent (with an optimising
    compiler), at least as long as the answer is the same in the end. It
    can be done a bit more often with "int" rather than "unsigned int",
    precisely because you promised the compiler that your arithmetic
    won't overflow so it does not need to worry about that possibility.


    In case of array indices I'd replace your "a bit more" by "a lot
    more".

    I haven't measured the real-world performance impact (I am more
    interested in performance on microcontroller cores). So I'll believe >whatever you and the others here say on that!

    If one wants top performance on 64-bit architectures then avoiding
    'unsigned int' indices is a very good idea. Hoping that compiler will
    somehow figure out what you meant instead of doing what you wrote is a
    naivety.


    Indeed.

    I hope you're not agreeing that unsigned array indicies should be
    avoided - they should instead be preferred.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Wed Feb 21 18:31:18 2024
    Michael S wrote:
    On Wed, 21 Feb 2024 13:49:41 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 20/02/2024 19:42, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning,
    along with JO and JNO for Jump on Overflow and Jump if Not
    Overflow.

    Many processors had something similar. But I think they fell out
    of fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.


    I was thinking more in terms of /using/ these flags, rather than ISA
    support for them. ISAs would clearly have to keep the flag registers
    and the instructions that used them if they wanted to keep
    compatibility with 32-bit code.


    aarch64 is a completely new and incompatible instruction encoding.

    But I think it was fairly rare to use the "add 32-bit and update
    flags" instruction in 32-bit RSIC systems (except for 64-bit
    arithmetic), and much rarer to use the "add 64-bit and update flags"
    version in 64-bit versions.


    Of course, the main use of flags is conditional branching. That's true
    even on 16-bit.
    The 2nd common use is conditional move/selection.
    Other uses are just bonus, insignificant in The Great Scheme of Things. However I don't think that "bonus" use of flags for bignum and similar
    is any rarer on 64-bit machines than on 32-bit.

    as flag registers are a bottleneck for OOO and
    superscaling

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).

    I think it would mainly be /despite/ having a flag register, rather
    than /because/ of it?


    That's what Mitch thinks. But he has no proof.

    Sometimes having flags for overflows, carries, etc., can be very
    handy. So having it in the ISA is useful. But I think you would
    normally want your code to avoid setting or reading flags.


    On OoO, when you are setting flags almost all the time, you are
    effectively telling an engine that flags results of the previous
    arithmetic instructions are DNC. In theory, it can be used to avoid
    majority of updates of flags in PRF. I don't know whether such
    optimization is actually done in real HW.
    Reading flags can't really be rare, because conditional branches
    are among most common instructions in the real-world code.

    Implementing a flags register in an OoO microarchitecture does
    require execution resources, however.


    It would, I think, be particularly cumbersome to track several
    parallel actions that all act on the flag register, as it is
    logically a shared resource.

    Do you think it is an advantage for a RISC architecture to have a
    flags register compared to alternatives? Say we want to have a
    double-width addition so that "res_hi:res_lo = 0:reg_a + 0:reg_b".
    (I hope my pseudo-code is clear enough here.) With flags and an "add
    with carry" instruction you could have :

    carry = 0;
    carry, res_lo = reg_a + reg_b + carry
    carry, res_hi = 0 + 0 + carry


    That's not how we do it.
    We just use normal add for the first instruction.

    Alternatively, you could have a double-register result, at the cost
    of having more complex register banks :

    res_hi:res_lo = reg_a + reg_b

    Or you could have an "add and take the high word" instruction and use
    two additions :

    res_hi = (reg_a + reg_b) >> N
    res_lo = reg_a + reg_b


    This variant does not scale to longer additions. Producing carry, i.e. instruction having effectively two outputs is smaller part of advantage
    of flags-based scheme. The bigger part is consuming carry, i.e. having effectively three inputs. In order to see it, you have to think about
    triple -width (or wider) case.
    Those are words in between the first and the last which are the most challenging for MIPS/Alpha/RISC-V and where they need 5 instruction vs 1 instruction on x86/Arm/SPARC.

    I agree with this. Afair, on the Itanium you had two separate ADD
    opcodes, ADD0 and ADD1, where the first (which was aliased to regular
    ADD?) just added the two inputs, while the second addeded the inputs,
    plus one, i.e. to be used only when the previous round generated an
    outgoing carry.

    Afair, the resulting code would also use predicated operations,
    effectively doing both ADD0 and ADD1 at the same time, and letting the
    previous carry out select between them.

    s0 = add0(a[0],b[0]);
    sum[0] = s0;
    p0 = s0 < a[0];
    s1 = p0 ? add1(a[1],b[1]) : add0(a[1],b[1])
    sum[1] = s1;
    p1 = p0 ? s1 <= a[1] : s1 < a[1];

    That final line shows how the intermediate words cause extra
    complications: You have to use two different comparison operations to
    generate the carry predicate for the next stage, so the
    predicate-generating instructions must themselves be predicated.


    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Thu Feb 22 17:08:43 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?


    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    The next question would be how to do the same for multiplication....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Fri Feb 23 18:34:08 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float). float
    is cerainly faster than double.

    It would also broken Fortran, where storage aasociation rules mean
    that both REAL and INTEGER have to have the same size, and DOUBLE
    PRECISION twice that. Breaking that would have invalidated just
    about every large scientific program at the time.

    Cray got away with 64-bit REAL and 128-bit DOUBLE PRECISION because
    they were the fastest anyway, but anybody else making that choice
    would have been laughed right out of the market.

    So, backward compatibility, your favorite topic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Feb 24 10:21:00 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float).

    I assume that you mean that people wanted sizeof(int)==sizeof(float).
    Why would they want that? That certainly did not hold on the PDP-11
    and many other 16-bit systems where sizeof(int)==2 and
    sizeof(float)==4.

    float
    is cerainly faster than double.

    On the 21064 or MIPS R4000 (the first 64-bit systems after those by
    Cray)? I am pretty sure that FP addition, subtraction and
    multiplication have the same speed on these CPUs in binary32 and
    binary64.

    It would also broken Fortran, where storage aasociation rules mean
    that both REAL and INTEGER have to have the same size, and DOUBLE
    PRECISION twice that. Breaking that would have invalidated just
    about every large scientific program at the time.

    C compilers choose their types according to their rules, and Fortran
    chooses its types according to its rules. I don't see what C's int
    type has to do with Fortran's INTEGER type. And if the rules you
    specify above mean that Fortran on the PDP-11 has a 4-byte INTEGER
    type, there is already precedent for int having a different size from
    INTEGER.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat Feb 24 11:39:48 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Feb 2024 13:49:41 +0100
    David Brown <david.brown@hesbynett.no> wrote:
    [...]
    However I don't think that "bonus" use of flags for bignum and similar
    is any rarer on 64-bit machines than on 32-bit.

    I think Bignums are more common on 64-bit machines. The popularity of languages with Bignums is now higher than it was before
    general-purpose computing switched from 32-bit to 64-bit 10-30 years
    ago. Also, these languages are more popular on general-purpose
    computers (64 bits since at least two years) than on the embedded
    computers where 32-bit processors still prevail.

    I also think that general-purpose computers do more cryptography (and multi-precision arithmetic for that) than embedded computers.

    Sometimes having flags for overflows, carries, etc., can be very
    handy. So having it in the ISA is useful. But I think you would
    normally want your code to avoid setting or reading flags.


    On OoO, when you are setting flags almost all the time, you are
    effectively telling an engine that flags results of the previous
    arithmetic instructions are DNC. In theory, it can be used to avoid
    majority of updates of flags in PRF. I don't know whether such
    optimization is actually done in real HW.

    In real hardware, AMD64 CPUs have as many physical flag registers as
    physical integer registers (e.g., 280 on Golden Cove); maybe they are
    actually part of the same register, but they would still need separate
    tracking hardware (one for C, one for O, one fore NZP), so there is no particular reason to have them be part of the same register.

    On ARM cores the number of physical flags registers is roughly 1/3 of
    the nymber of physical integer registers (46 vs. 147 on A710, 39
    vs. 120 on Neoverse N1/A76 <https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, but I guess that this is due to being able to suppress flags updates
    rather than ignoring those that are overwritten without being read.

    That assumes that all of A32, T32, and A64 have the ability to
    suppress flags updates. Do they?

    It would, I think, be particularly cumbersome to track several
    parallel actions that all act on the flag register, as it is
    logically a shared resource.

    That's no problem for modern register renaming units. They rename the
    single flags register just as readily as an integer register, with the
    same logical register getting multiple physical register numbers per
    cycle.

    [multi-word addition/subtraction]
    Still, as I said above, that just a bonus rather than a major reason to
    have flags.

    The 88000 and Power architectures have one mechanism for producing
    comparison results, and a different mechanism for add-with-carry (with
    carry-in and carry-out optional in both architectures). This shows
    that multi-word addition and subtraction is not just a bonus, but a
    major reason for an architectural mechanism.

    However, I think that a separate flags register has its disadvantages
    (e.g., the need for separate tracking resources), and that adding
    carry and overflow to general-purpose registers is preferable.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sat Feb 24 15:13:42 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:


    On ARM cores the number of physical flags registers is roughly 1/3 of
    the nymber of physical integer registers (46 vs. 147 on A710, 39
    vs. 120 on Neoverse N1/A76 ><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >but I guess that this is due to being able to suppress flags updates
    rather than ignoring those that are overwritten without being read.

    That assumes that all of A32, T32, and A64 have the ability to
    suppress flags updates. Do they?

    A32, T32 and A64 have an bit in the instruction word that
    specifies whether the flags should be updated. T16
    only updates flags when the instruction is in an if-then (IT)
    block.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Sat Feb 24 17:51:56 2024
    On 2024-02-24 13:39, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Feb 2024 13:49:41 +0100 David Brown
    <david.brown@hesbynett.no> wrote:
    [...]
    However I don't think that "bonus" use of flags for bignum and
    similar is any rarer on 64-bit machines than on 32-bit.

    I think Bignums are more common on 64-bit machines.


    [...]


    I also think that general-purpose computers do more cryptography
    (and multi-precision arithmetic for that) than embedded computers.


    Hm. That may depend on whether you are comnparing absolute numbers or proportions of work-loads. Cryptography is very important for many
    embedded systems (telecom, automatic teller machines, point-of-sale
    terminals, etc.) Last I looked (several years ago), there were several 8051-based chips (small 8-bit processors) on the market with dedicated
    on-chip HW accelerators for cryptography.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 17:21:25 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:


    On ARM cores the number of physical flags registers is roughly 1/3 of
    the nymber of physical integer registers (46 vs. 147 on A710, 39
    vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>, >>but I guess that this is due to being able to suppress flags updates
    rather than ignoring those that are overwritten without being read.

    That assumes that all of A32, T32, and A64 have the ability to
    suppress flags updates. Do they?

    A32, T32 and A64 have an bit in the instruction word that
    specifies whether the flags should be updated. T16
    only updates flags when the instruction is in an if-then (IT)
    block.

    What is T16? Google does not give me anything appropriate for "ARM
    T16"? Do you mean the 16-bit-encoded instructions in T32?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Niklas Holsti on Sat Feb 24 17:00:48 2024
    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    I also think that general-purpose computers do more cryptography
    (and multi-precision arithmetic for that) than embedded computers.


    Hm. That may depend on whether you are comnparing absolute numbers or >proportions of work-loads.

    I am thinking about the proportions of workloads.

    Cryptography is very important for many
    embedded systems (telecom, automatic teller machines, point-of-sale >terminals, etc.) Last I looked (several years ago), there were several >8051-based chips (small 8-bit processors) on the market with dedicated >on-chip HW accelerators for cryptography.

    And AMD64 and ARM A64 have acceleration instructions for symmetric
    cryptographe (such as AES), too. Symmetric cryptography (with secret
    keys only) does not use multi-precision arithmetic, but asymetric
    cryptography (with public and private keys) does. I guess that there
    is no special hardware for that because hardware that does it faster
    than software would be very expensive, and because the asymmetric part
    is only used at the start of a connection and when renewing the secret
    key for the symmetric stuff.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 17:26:41 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:

    And support for 32-bit has /not/ been "eliminated from ARM cores". It
    may have been eliminated from the latest AArch64 cores

    The ARMv8 architecture fully supports the A32 and T32 instruction sets.

    There is no ARMv8 architecture. ARMv8-M does not support A64.
    ARMv8-A and ARMv8-R do. They may also support A32 and T32, but then
    the A cores released since 2021 (X2 and following, A710 and following,
    A510 and following) are ARMv9-A, not ARMv8-A; and several of them do
    not support A32 nor T32.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Feb 24 20:22:16 2024
    On Sat, 24 Feb 2024 17:21:25 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:


    On ARM cores the number of physical flags registers is roughly 1/3
    of the nymber of physical integer registers (46 vs. 147 on A710, 39
    vs. 120 on Neoverse N1/A76 >><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
    but I guess that this is due to being able to suppress flags updates >>rather than ignoring those that are overwritten without being read.

    That assumes that all of A32, T32, and A64 have the ability to
    suppress flags updates. Do they?

    A32, T32 and A64 have an bit in the instruction word that
    specifies whether the flags should be updated. T16
    only updates flags when the instruction is in an if-then (IT)
    block.

    What is T16? Google does not give me anything appropriate for "ARM
    T16"? Do you mean the 16-bit-encoded instructions in T32?

    - anton

    Scott probably uses a name T16 for pre-ARMv7 variant of Thumb.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 10:38:30 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    Terje Mathisen wrote:

    If we have a platform where the default integer size is 32 bits
    and long is 64 bits, then afaik the C promotion rules will use
    int as the accumulator size, right?

    Not necessarily:: accumulation rules allow the promotion of
    int->long inside a loop

    Yes.

    if the long is smashed back to int immediately after the loop
    terminates.

    Doing this may be a good idea, but the C standard doesn't require
    it. Once the horse is out of the barn, as far as the standard is
    concerned there nothing that forces an implementation to evidence
    any concern about getting the horse back inside.

    In particular, an implementation may continue holding a 32-bit int
    in a 64-bit memory for the rest of the program's execution, making
    use of all 64 bits in any subsequent operations, and not run afoul
    of what the standard requires.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 10:25:50 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    David Brown wrote:

    On 17/02/2024 19:58, Terje Mathisen wrote:

    Anton Ertl wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    On the third (i.e gripping) hand you could have a language like
    Java where it would be illegal to transform a temporarily
    trapping loop into one that would not trap and give the
    mathematically correct answer.

    What "temporarily trapping loop" and "mathematically correct
    answer"?

    If you are talking about integer arithmetic, the limited integers
    in Java have modulo semantics, i.e., they don't trap, and
    BigIntegers certainly don't trap.

    If you are talking about FP (like I did), by default FP addition
    does not trap in Java, and any mention of "mathematically
    correct" in connection with FP needs a lot of further
    elaboration.

    Sorry to be unclear:

    I haven't really been following this thread, but there's a few
    things here that stand out to me - at least as long as we are
    talking about C.

    I realized a bunch of messages ago that it was a bad idea to write
    (pseudo-)C to illustrate a general problem. :-(

    If we have a platform where the default integer size is 32 bits
    and long is 64 bits, then afaik the C promotion rules will use int
    as the accumulator size, right?

    For a signed type T, a C implementation is free to hold values of
    type T in any memory cell that is at least as wide as T, and in
    particular can hold a 32-bit int in a 64-bit register, if it so
    chooses. More specifically, an implementation may translate a
    statement like (assume x and y are 32-bit ints)

    x = x + y;

    using a 64-bit add, and store the full 64-bit result into a
    64-bit register that corresponds to x. Note that what causes the
    problem is not the store of a large result but the addition that
    overflows the nominal 32-bit type. As long as the results of the
    addition are in range, everything is fine; but once any of the
    additions overflows the nominal 32-bit range, all bets are off.

    What I was trying to illustrate was the principle that by having a
    wider accumulator you could aggregate a series of numbers, both
    positive and negative, and get the correct (in-range) result, even
    if the input happened to be arranged in such a way that it would
    temporarily overflow the target int type.

    I think it is much better to do it this way and then get a
    conversion size trap at the very end when/if the final sum is in
    fact too large for the result type.

    Absolutely, and for the accumulator choose the widest signed type
    available. At the tail end of the loop, the final value can (and
    should?) be ranged checked, because storing an out-of-range value
    into a smaller type is a "safe" operation in that what happens is implementation-defined behavior, so it's very likely nothing very
    bad will happen (at least not right away ;).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sat Feb 24 17:37:26 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 19:42, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 16:17, Terje Mathisen wrote:
    x86 has had an 'O' (Overflow) flags bit since the very beginning, along >>>> with JO and JNO for Jump on Overflow and Jump if Not Overflow.


    Many processors had something similar. But I think they fell out of
    fashion for 64-bit RISC,

    No, it didn't. All the RISCs that had flags registers for their
    32-bit architectures still have it for their 64-bit architectures.


    I was thinking more in terms of /using/ these flags, rather than ISA
    support for them. ISAs would clearly have to keep the flag registers
    and the instructions that used them if they wanted to keep compatibility
    with 32-bit code.

    64-bit architectures of this century have separate 64-bit instruction
    sets on which you cannot run 32-bit code. The idea of having a 32-bit instruction set that's a subset of the 64-bit instruction set, and
    supporting both 32-bit programs and 64-bit programs without mode
    switch seems to have gone out of fashion; it used to be in fashion in
    the 1990s when MIPS, SPARC, and PowerPC introduced 64-bit extensions
    by just adding instructions to their 32-bit instruction set (and
    defining what the 32-bit instructions do with the additional bits in
    the GPRs). So if AMD and ARM had thought that they would be better
    off in the long term without flags, they could have designed AMD64 and
    A64 to work without flags.

    But I think it was fairly rare to use the "add 32-bit and update flags" >instruction in 32-bit RSIC systems (except for 64-bit arithmetic), and
    much rarer to use the "add 64-bit and update flags" version in 64-bit >versions.

    On ARM A64's glibc-2.31:

    [a76:/usr/lib/aarch64-linux-gnu:99372] objdump -d libc-2.31.so |grep '\<add\>'|wc -l
    19330
    [a76:/usr/lib/aarch64-linux-gnu:99373] objdump -d libc-2.31.so |grep '\<adds\>'|wc -l
    277

    However, glibc is written in C. If you look at code for a language
    with Bignums, I expect that things look quite differently.

    No, it isn't, as demonstrated by the fact that architectures with
    flags registers (AMD64, ARM A64) handily outperform architectures
    without (but probably not because they have a flags register).

    I think it would mainly be /despite/ having a flag register, rather than >/because/ of it?

    There is no evidence for "despite".

    Sometimes having flags for overflows, carries, etc., can be very handy.
    So having it in the ISA is useful. But I think you would normally want
    your code to avoid setting or reading flags.

    If each instruction can specify whether it sets flags or not, that
    costs one bit per instruction, which is also a cost.

    AMD64 does not spend that bit, and I see no evidence that its
    performance suffers from setting flags too often.

    It would, I think, be particularly cumbersome to track several parallel >actions that all act on the flag register, as it is logically a shared >resource.

    Not particularly. The renamer ensures that the instructions write to
    different physical flags registers, and that instructions that read
    flags read the right physical register.

    Do you think it is an advantage for a RISC architecture to have a flags >register compared to alternatives?

    I think my alternative of putting carry and overflow in the result
    register has certain advantages.

    If you compare a RISC with a flags register (such as A64) to RV64GC,
    certainly the add-with-carry-in-carry-out is 1 instruction (with
    typically 1 cycle latency) in A64 and 5 (with typically 3 cycles
    latency) in RV64GC; and the overflow check in the fast path for a
    Bignum addition is 2 instructions longer than on A64.

    That's nonsense. People use carry for implementing multi-precision
    arithmetic (e.g., for cryptography) and for Bignums, and they use
    overflow for implementing Bignums. And the significance of these
    features has increased over time.


    Fair enough, you do want carry (or an equivalent) for big number work.
    But I would still contend that the vast majority of integers and integer >arithmetic used in code will fit within 32 bits, and the vast majority
    of those that don't, will fit within 64 bits.

    Sure. But with Bignums, you have to check for overflow after every
    addition, subtraction, or multiplication. Actually, you need a check
    even with division: The result of dividing the smallest small Bignum
    by -1 does not fit in a small Bignum.

    Once you go beyond that,
    you will need lots of bits (such as, as you say, cryptography).

    Yes, so you need multi-precision arithmetic on 64-bit systems just as
    on 32-bit systems.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sat Feb 24 22:29:01 2024
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).
    ...
    Sorry, I meant ptrdiff_t, which was used for pointer math.

    I have seen little code that uses ptrdiff_t; quite a bit that used
    size_t (the unsigned brother of ptrdiff_t). But my memory tells me
    that even size_t was not very widespread in 1995.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Feb 24 14:43:46 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    Missing my point:: which was::

    The summation loop will not overflow, and overflow is detected at
    the smash from int to int8_t.

    So you wanted to make a point that's completely unrelated to what I
    was saying?

    In any case the conversion from type int to type int8_t that is
    done after loop is finished is not going to detect any overflow,
    regardless of whether it is done by an explicit cast or implicitly
    by the return statement. Converting an out-of-range value to a
    signed integer type is merely implementation-defined behavior. As
    of C99 the C standard allows a signal to be raised in such cases,
    but TTBOMK no implementations actually do that (and AFAICT neither
    clang nor gcc even have an option to do so).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sat Feb 24 20:57:17 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 18:47, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 20/02/2024 13:00, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    Another possible reason is that it is very useful to have integer types
    with sizes 1, 2, 4 and 8 bytes. C doesn't have many standard integer
    types, so if "int" is 64-bit, you have "short" as either 16-bit and have
    no 32-bit type, or "short" is 32-bit and you have no 16-bit type. With >32-bit "int", it's easy to have each size without having to add extended >integer types or add new standard integer types (like "short short int"
    for 16-bit and "short int" for 32-bit).

    short had been 16-bit on 16-bit machines and on 32-bit machines, so
    the right choice for 64-bit machines is to make it 16-bit, too. As
    for a 32-bit type, that would then obviously be "long short". The C
    compiler people had no qualms at adding "long long" when they wanted
    something bigger then 32 bits on 32-bit systems, so what should keep
    them from adding "long short"?

    I saw benchmarks showing x32 being measurably faster,

    Sure, it's measurably faster. That's obviously not sufficient for
    incurring the cost of x32.

    but it's not
    unlikely that the differences got less with more modern x86-64
    processors (with bigger caches)

    Doubtful. The L1 caches have not become bigger since the days of the
    K7 (1999) with its 64KB D and I caches (and the K7 actually did not
    support AMD64). There has been some growth in L2+L3 combined in
    recent years, but x32 already flopped earlier.

    and it's simply not worth the effort
    having another set of libraries and compiler targets just to make some
    kinds of code marginally faster.

    Exactly.

    And support for 32-bit has /not/ been "eliminated from ARM cores".

    Of course it has. E.g., the Cortex-X1 supports A32/T32, and its
    descendants Cortex-X2, X3, X4 don't. The Cortex-A710 supports
    A32/T32, it's successors A715 and A720 do not. Cortex-A510 supports
    A32/T32, A520 doesn't.

    It
    may have been eliminated from the latest AArch64 cores - I don't keep
    good track of these. But for every such core sold, there will be
    hundreds (my guestimate) of 32-bit ARM cores sold in microcontrollers
    and embedded systems.

    Irrelevant for the question at hand: Are the performance benefits of
    32-bit applications sufficient to pay for the cost of maintaining a
    32-bit software infrastructure on an otherwise 64-bit system? The
    answer is no.

    I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, >>> picked to fit the range you need).

    Sure, and then the program might break when an array has more the 2^31
    elements; or it might work on one platform and break on another one.


    You need to pick the appropriate size for your data, as I said.

    In general-purpose computing, you usually don't know that size. E.g.,
    for a sort routine, do you use int_fast32_t, int_fast8_t,
    int_fast16_t, or int_fast64_t for the array size?

    By contrast, with (u)intptr_t, on modern architectures you use the
    type that's as wide as the GPRs. And I don't see a reason why to use
    something else for a loop counter.

    I like my types to say what they mean. "uintptr_t" says "this object
    holds addresses for converting pointer values back and forth with an
    integer type".

    Exactly. It's the unnamed type of BCPL and B, and the int of Unix C
    before the I32LP64 mistake.

    Nonetheless, it is good design to use appropriate type names for
    appropriate usage. This makes the code clearer, and increases portability.

    On the contrary, in the Gforth project we have had many more
    portability problems with C code with its integer type zoo than in the
    Forth code which just has single-cell (a machine word), double cell,
    and char as integer types. Likewise, Forth code from others tends to
    be pretty portable between 32-bit and 64-bit systems, even if the code
    has only been tested on one kind of system.

    As for int64_t, that tends to be
    slow (if supported at all) on 32-bit platforms, and it is more than
    what is necessary for indexing arrays and for loop counters that are
    used for indexing into arrays.


    And that is why it makes sense to use the "fast" types. If you need a
    16-bit range, use "int_fast16_t". It will be 64-bit on 64-bit systems, >32-bit on 32-bit systems, and 16-bit on 16-bit and 8-bit systems -
    always supporting the range you need, as fast as possible.

    That makes no sense. E.g., an in-memory sorting routine might well
    have to sort 100G elements or more on a suitably large (64-bit)
    machine. So according you one should use int_fast64_t. But that
    would be slow and unnecessarily large on a 32-bit system where you
    cannot hold and sort that many items anyway.

    Don't use "-fwrapv" unless you actually need it - in most
    code, if your arithmetic overflows, you have a mistake in your code, so
    letting the compiler assume that will not happen is a good thing.

    Thank you for giving a demonstration for Scott Lurndal. I assume that
    you claim to be a programmer.


    Sorry, that comment went over my head - I don't know what
    "demonstration" you are referring to.

    I wrote in <2024Feb20.091522@mips.complang.tuwien.ac.at>:
    |Those ideas are that integer overflows do not happen and that a
    |competent programmer proactively prevents them from happening, by
    |sizing the types accordingly, and checking the inputs.

    Scott Lurndal replied <SW2BN.153110$taff.74839@fx41.iad>:
    |Can't say that I've known a programmer who thought that way.

    For comparison to C, look at the Zig language. (It's not a language I
    have used or know in detail, but I know a little about it.) Unsigned
    integer arithmetic overflow is UB in Zig, just like signed integer
    arithmetic overflow. There are standard options and block settings
    (roughly equivalent to pragmas) to control whether these give a run-time >error, or are assumed never to happen (for optimisation purposes). And
    if you want to add "x" and "y" with wrapping, you use "x +% y" as an
    explicit choice of operator.

    That seems to me to be the right approach for an efficient high-level >language.

    I don't know about Zig, but for a language like C, I would prefer
    types that ask for trap-on-overflow arithmetic, or for modulo
    arithmetic instead of introducing an additional operator.

    The undefined option is just a bad idea: Wang et al <https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf> found
    that it gave a measurable speedup over -fwrapv in only one loop in
    SPECint2006, and that speedup is only due to the I32LP64 mistake.
    I.e., a competently designed dialect would not have that mistake, and
    there would be no measurable speedup from the undefined option, just
    the possibility of the compiler doing something unexpected.

    If you want "int z = x + y;" with wrapping, write :

    int z = (unsigned) x + (unsigned) y;

    I'll leave that to you, not just for the following reason:

    Once upon a time someone like you suggested using some casting
    approach for getting x-1>=x (with signed x) to work as intended. I
    tried it, and the result was that gcc-3.something still "optimized" it
    to false.

    And even then, I always put it in a
    pragma so that the code works even if someone uses different compiler flags.

    Yes, announcing such things in the source code is a good idea in
    principle. In practice newer versions of gcc tend to need more sanity
    flags (the default of newer gccs is more insanity), and the older
    versions do not understand the new flags and fail if you pass them.
    So we check every sanity flag in autoconf and use those that the gcc
    version used accepts. Doing that through pragmas filled in by
    configure does not seem to be any better than using flags through the
    Makefile.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sat Feb 24 22:33:58 2024
    John Levine <johnl@taugh.com> writes:
    There weren't instructions to do add or subtract with
    carry but it was pretty easy to fake by doing a branch on no carry
    around an instruction to add or subtract 1.

    That is sufficient for double-word arithmetic, but not for multi-word arithmetic. ESA/390 adds addition-with-carry-in.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Sat Feb 24 18:19:43 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    I am normally writing Rust these days, where UB is far less common,
    but casts like this are mandatory.

    Oh. I didn't know that about Rust. Interesting.

    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules. I
    understand that it has the support of community sentiment, but
    even so it seems like a poor choice here. I'm not a big fan of
    the new attribute syntax, but a form like

    return [[narrow]] s;

    looks to be a better way of asking Rust to allow what is a
    normally disallowed conversion. By contrast, using a cast is
    overkill. There is unnecessary redundancy, by specifying a type
    in two places, and the risk that they might get out of sync. And
    on general principles requiring a cast violates good security
    principles. If someone needs access to a particular room in a
    building, we don't hand over a master key that opens every room
    in the building. If someone needs to read some documents that
    have classified materials, we don't give them an access code that
    lets them read any sensitive material regardless of whether it's
    relevant. Maybe Rust is different, but in C a cast allows any
    conversion that is possible in the language, even the unsafe
    ones. It just seems wrong to use the nuclear option of casting
    for every minor infringement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Feb 25 02:33:32 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:


    On ARM cores the number of physical flags registers is roughly 1/3 of
    the nymber of physical integer registers (46 vs. 147 on A710, 39
    vs. 120 on Neoverse N1/A76 >>><https://chipsandcheese.com/2023/08/11/arms-cortex-a710-winning-by-default/>,
    but I guess that this is due to being able to suppress flags updates >>>rather than ignoring those that are overwritten without being read.

    That assumes that all of A32, T32, and A64 have the ability to
    suppress flags updates. Do they?

    A32, T32 and A64 have an bit in the instruction word that
    specifies whether the flags should be updated. T16
    only updates flags when the instruction is in an if-then (IT)
    block.

    What is T16? Google does not give me anything appropriate for "ARM
    T16"? Do you mean the 16-bit-encoded instructions in T32?

    The 16-bit subset of T32, yes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Feb 25 02:42:43 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).
    ...
    Sorry, I meant ptrdiff_t, which was used for pointer math.

    I have seen little code that uses ptrdiff_t; quite a bit that used
    size_t (the unsigned brother of ptrdiff_t). But my memory tells me
    that even size_t was not very widespread in 1995.

    Unixware, early 90's:

    $ find . -name '*.[ch]' -print | xargs grep size_t | wc -l
    3435
    $ find . -name '*.[ch]' -print | xargs grep ptrdiff_t | wc -l
    86

    memcpy was defined using size_t in the 1989 SVID Third Edition, Volume 1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Anton Ertl on Sun Feb 25 02:56:20 2024
    It appears that Anton Ertl <anton@mips.complang.tuwien.ac.at> said:
    John Levine <johnl@taugh.com> writes:
    There weren't instructions to do add or subtract with
    carry but it was pretty easy to fake by doing a branch on no carry
    around an instruction to add or subtract 1.

    That is sufficient for double-word arithmetic, but not for multi-word >arithmetic. ESA/390 adds addition-with-carry-in.

    You could do it but you're right, it was a little trickier than that since
    you had to check for the carry both on the addition and on the optional
    second one. The add with carry instructions made it a lot simpler.

    According to the IBM manuals, add with carry was added in z/Series but
    you could use it in S/390 mode on a z machine, so I guess someone
    really wanted it for some existing code.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Opus on Sat Feb 24 20:58:28 2024
    Opus <ifonly@youknew.org> writes:

    On 18/02/2024 05:01, Tim Rentsch wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    Of course the conversion will be done implicitly. C converts almost
    anything implicitly. Not that this is its greatest feature.

    The explicit cast is still useful: 1/ to express intent (it shows that
    the potential loss of data is intentional) and then 2/ to avoid
    compiler warnings (if you enable -Wconversion, which I usually
    recommend) or warning from any serious static analyzer too (which I
    highly recommend using too).

    Using a cast is a poor way to express intent:

    return s; // narrow accumulated value on return

    is better.

    Using a cast to prevent a compiler warning is an awful
    convention. See also my reply to Terje's message about
    Rust. Compiler writers should be ashamed for promulgating
    such a moronic idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Sat Feb 24 21:05:49 2024
    scott@slp53.sl.home (Scott Lurndal) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    But the return statement is where overflow (if any) is detected.

    The cast is superfluous because a conversion to int8_t will be
    done in any case, since the return type of the function is
    int8_t.

    I suspect most experienced C programs know that.

    I expect most C programmers, even most experienced C programmers,
    do not know the rules for conversions of return values. How many
    readers here in comp.arch do you think know those rules? How
    confident are you that you can state them, without having to look
    them up? I know I was surprised when I first discovered that
    return statements do not convert values the way I expected.

    Yet, the 'superfluous' cast is also documentation that the
    programmer _intended_ that the return value would be narrowed.

    Using a cast is a poor way to express such an intent, as I have
    explained in other postings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sat Feb 24 21:17:13 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    ...

    Sorry, I meant ptrdiff_t, which was used for pointer math.

    I have seen little code that uses ptrdiff_t; quite a bit that used
    size_t (the unsigned brother of ptrdiff_t). But my memory tells me
    that even size_t was not very widespread in 1995.

    In 1995 a problem with both size_t and ptrdiff_t is that there
    were no corresponding length modifiers for those types in
    printf() format conversions (corrected in C99).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Sat Feb 24 22:05:35 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?

    Heavens to Betsy! Are you impugning the quality and excellence
    of my code? Of *my* code? I can only hope that you are suitably
    chagrined and contrite. ;)

    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    Rather that do that I will explain.

    An addition overflows if the two operands have the same sign and
    the sign of an operand is the opposite of the sign of the sum
    (taken mod the width of the operands). Convert the signed
    operands to their unsigned counterparts, and form the sum of the
    unsigned values. The sign is just the high-order bit in each
    case. Thus the overflow condition can be detected with a few
    bitwise xors and ands.

    Subtraction is similar except now overflow can occur only when
    the operands have different signs and the sign of the sum is
    the opposite of the sign of the first operand.

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    #if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
    // this case is the one outlined above

    #elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

    Does that all make sense?


    The next question would be how to do the same for multiplication....

    Multiplication is a whole other ball game. First we need to
    consider only the widest types, because narrower types can be
    carried out in a wider type and the resulting product value
    checked. Off the top of my head, for the widest types I would
    try converting to float or double, do a floating-point multiply,
    and do some trivial accepts and trivial rejects based on the
    exponent of the result. Any remaining cases would need more
    care, but probably (we hope!) there aren't many of those and they
    don't happen very often. So for what it's worth there is my
    first idea. Second idea is to compute a double-width product,
    or at least part of one, using standard multiple-precision
    arithmetic, and speed compare against the floating-point method.
    I better stop now or the ideas will probably get worse rather
    than better. :/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Feb 25 08:40:32 2024
    John Levine <johnl@taugh.com> writes:
    According to the IBM manuals, add with carry was added in z/Series but
    you could use it in S/390 mode on a z machine, so I guess someone
    really wanted it for some existing code.

    Interestingly,
    <https://en.wikibooks.org/wiki/360_Assembly/360_Instructions> lists
    ALC, ALCR, SLB and SLBR as belonging to the 390 instructions, and only
    ALCG, ALCGR, SLBG and SLBGR (the 64-bit variants) as Z instructions.

    If they added ALC, ALCR, SLB and SLBR only in Z (but in the S/390
    mode), that is counterevidence for the claim that add-with-carry is
    less important for 64-bit systems than for 32-bit systems.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Sun Feb 25 08:03:12 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Second idea is to compute a double-width product,
    or at least part of one, using standard multiple-precision
    arithmetic, and speed compare against the floating-point method.

    What "standard multiple-precision arithmetic" is there in C? I am not
    aware of any.

    If you have widening multiplication in the language, things are
    trivial. I'll use Forth because it has widening multiplication:

    For the unsigned case:

    ( u1 u2 ) um* if ... ( handle the overflow case ) ... then

    For the signed case:

    ( n1 n2 ) m* >r s>d r> <> if ... ( handle the overflow case ) ... then

    D ( n -- d ) sign-extends a single-cell to a double-cell.

    For those of you who are Forth-illiterate, here's how it might look in
    C, with (U)Word and (U)Doubleword integer types and some assumptions
    about things that the standardized subset of C does not define:

    For the unsigned case:

    UDoubleword ud = u1 * (UDoubleword)u2;
    if ((ud>>(8*sizeof(UWord))) != 0) {
    ... /* handle the overflow case */ ...
    }

    For the signed case:

    Doubleword d = n1 * (Doubleword)n2;
    Word dlo = d; /* dlo contains the low-order bits of d */
    if (d != (Doubleword)dlo) {
    ... /* handle the overflow case */ ...
    }

    I'll leave it to the advocates of the standardized subset of C to
    rewrite this in a way that can be executed in a strictly conforming
    program.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Feb 25 10:40:35 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    I know no implementation of a 64-bit architecture where ALU operations
    (except maybe division where present) is slower in 64 bits than in 32
    bits. I would have chosen ILP64 at the time, so I can only guess at
    their reasons:

    A guess: people did not want sizeof(float) != sizeof(float).

    I assume that you mean that people wanted sizeof(int)==sizeof(float).
    Why would they want that? That certainly did not hold on the PDP-11
    and many other 16-bit systems where sizeof(int)==2 and
    sizeof(float)==4.

    float
    is cerainly faster than double.

    On the 21064 or MIPS R4000 (the first 64-bit systems after those by
    Cray)? I am pretty sure that FP addition, subtraction and
    multiplication have the same speed on these CPUs in binary32 and
    binary64.

    Cache size and memory bandwidth also play a role...

    When you're doing huge vector-matrix multiplications to solve
    large sets of equations, memory bandwidth is usually the bottleneck.

    If you can get away with 32-bit reals, you do it - it is a factor of
    two, after all.

    These days, people are actually trying to do preconditioning with 16-bit
    floats go gain another factor of two.

    And nowadays, with SIMD, the advantage of shorter data types is even
    more pronounced.


    It would also broken Fortran, where storage aasociation rules mean
    that both REAL and INTEGER have to have the same size, and DOUBLE
    PRECISION twice that. Breaking that would have invalidated just
    about every large scientific program at the time.

    C compilers choose their types according to their rules, and Fortran
    chooses its types according to its rules. I don't see what C's int
    type has to do with Fortran's INTEGER type.

    C does not exist in a vacuum, especially if it is the systems
    programming language for a system that Fortran is supposed to run
    on, and run on well.

    Two examples: not being able to call BLAS subroutines from C
    would have made scientific C people unhappy, and not being able
    to call C functions via the de-facto stanard established by Bell's
    Fortran 77 compiler and later f2c would have made a lot of Fortran
    people unhappy.

    And if the rules you
    specify above mean that Fortran on the PDP-11 has a 4-byte INTEGER
    type, there is already precedent for int having a different size from INTEGER.

    And that was suboptimal, but it did not make Fortran unusable by
    requiring a 128-bit DOUBLE PRECISION like your suggestion would.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Feb 25 19:05:39 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    The Unix code ported relatively easily to I32LP64 because uintptr_t
    had been used extensively rather than assumptions about
    sizeof(int) == sizeof(int *).

    ...

    Sorry, I meant ptrdiff_t, which was used for pointer math.

    I have seen little code that uses ptrdiff_t; quite a bit that used
    size_t (the unsigned brother of ptrdiff_t). But my memory tells me
    that even size_t was not very widespread in 1995.

    In 1995 a problem with both size_t and ptrdiff_t is that there

    Calling it a "problem" is overstating the case. It was
    straightforward enough, if not completely portable to
    use the appropriate number of 'l' modifiers.


    were no corresponding length modifiers for those types in
    printf() format conversions (corrected in C99).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Feb 25 19:18:13 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    According to the IBM manuals, add with carry was added in z/Series but
    you could use it in S/390 mode on a z machine, so I guess someone
    really wanted it for some existing code.

    Interestingly,
    <https://en.wikibooks.org/wiki/360_Assembly/360_Instructions> lists
    ALC, ALCR, SLB and SLBR as belonging to the 390 instructions, and only
    ALCG, ALCGR, SLBG and SLBGR (the 64-bit variants) as Z instructions.

    These details are better found from the original references. Here's
    the S/390 reference: https://publibfp.dhe.ibm.com/epubs/pdf/dz9ar008.pdf

    And here's a link to the zSeries reference: https://www.ibm.com/support/pages/zarchitecture-principles-operation
    You need an IBM account to download it, but signing up is easy.

    What Wikipedia says is right, with the detail that the 390
    instructions only exist in 390 mode on a z. Those "G" instructions
    have 64 bit operands and make no sense in 390 mode.

    If they added ALC, ALCR, SLB and SLBR only in Z (but in the S/390
    mode), that is counterevidence for the claim that add-with-carry is
    less important for 64-bit systems than for 32-bit systems.

    Not necessarily. From S/360->370->390 most of the new instructions
    were to deal with the address expansion kludges, the I/O system, and
    IEEE floating point, with only a handful of general instructions like
    CHECKSUM to speed up TCP/IP. In z/Series they made a complete u-turn
    and added gazillions of new instructions. Some were to add 64 bit
    versions of 32 bit instructions, some to fill well known gaps like
    relative branches and longer offsets in memory references, but there
    is a whole lot of stuff that seems to make some customer's workload a
    little faster, such as a gzip accelerator and the fixed point decimal
    vector facility.

    z/Series uses what they call Millicode, microcode that uses the
    hardware implemented part of the same instruction set, so the cost of
    adding lots of new instructions is now low. They only backported a few instructions into S/390 which suggests someone really wanted the carry
    stuff.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 27 11:01:02 2024
    Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?

    Heavens to Betsy! Are you impugning the quality and excellence
    of my code? Of *my* code? I can only hope that you are suitably
    chagrined and contrite. ;)

    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    Rather that do that I will explain.

    An addition overflows if the two operands have the same sign and
    the sign of an operand is the opposite of the sign of the sum
    (taken mod the width of the operands). Convert the signed
    operands to their unsigned counterparts, and form the sum of the
    unsigned values. The sign is just the high-order bit in each
    case. Thus the overflow condition can be detected with a few
    bitwise xors and ands.

    Subtraction is similar except now overflow can occur only when
    the operands have different signs and the sign of the sum is
    the opposite of the sign of the first operand.

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    #if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
    // this case is the one outlined above

    #elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

    Does that all make sense?


    The next question would be how to do the same for multiplication....

    Multiplication is a whole other ball game. First we need to
    consider only the widest types, because narrower types can be
    carried out in a wider type and the resulting product value
    checked. Off the top of my head, for the widest types I would
    try converting to float or double, do a floating-point multiply,
    and do some trivial accepts and trivial rejects based on the
    exponent of the result. Any remaining cases would need more
    care, but probably (we hope!) there aren't many of those and they
    don't happen very often. So for what it's worth there is my
    first idea. Second idea is to compute a double-width product,
    or at least part of one, using standard multiple-precision
    arithmetic, and speed compare against the floating-point method.
    I better stop now or the ideas will probably get worse rather
    than better. :/

    Your floating point method is pretty bad, imho, since it can give you
    both false negatives and false positives, with no way to know for sure,
    except doing it all over again.

    If I really had to write a 64x64->128 MUL, with no widening MUL or MULH
    which returns the high half, then I would punt and do it using 32-bit
    parts (all variables are u64):

    p3 = (a >> 32) * (b >> 32);
    p2 = (a & 0xffffffff) * (b >> 32);
    p1 = (a >> 32) * (b & 0xffffffff);
    p0 = (a & 0xffffffff) * (b & 0xffffffff);

    // Middle sum, can give a 1 carry into high half
    p12 = (p0 >> 32) + (p1 & 0xffffffff) + (p2 & 0xffffffff);

    prod = (p0 & 0xffffffff) + (p12 << 32); // Final low word result

    prod_hi = p3 + (p2 >> 32) + (p1 >> 32) + (p01 >> 32);

    if (prod_hi != 0) overflow();

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 27 10:35:11 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    I am normally writing Rust these days, where UB is far less common,
    but casts like this are mandatory.

    Oh. I didn't know that about Rust. Interesting.

    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules. I

    I am not _sure_ but I believe Rust will in fact verify that all such
    casts are in fact legal, i.e. the data will fit in the target container.

    This is of course the total opposite of the C "nuclear option", and much
    more like other languages that try to be secure by default.

    understand that it has the support of community sentiment, but
    even so it seems like a poor choice here. I'm not a big fan of
    the new attribute syntax, but a form like

    return [[narrow]] s;

    looks to be a better way of asking Rust to allow what is a
    normally disallowed conversion. By contrast, using a cast is
    overkill. There is unnecessary redundancy, by specifying a type
    in two places, and the risk that they might get out of sync. And
    on general principles requiring a cast violates good security
    principles. If someone needs access to a particular room in a
    building, we don't hand over a master key that opens every room
    in the building. If someone needs to read some documents that
    have classified materials, we don't give them an access code that
    lets them read any sensitive material regardless of whether it's
    relevant. Maybe Rust is different, but in C a cast allows any
    conversion that is possible in the language, even the unsafe
    ones. It just seems wrong to use the nuclear option of casting
    for every minor infringement.

    I agree, it Rust did it like C, then it would be very unsafe indeed.

    I have not checked the generated asm, but I believe that if I write code
    like this:

    // x:u64
    x = 0x1234567890abcdef;
    let y:u8 = (x & 255) as u8;

    the compiler will see the mask and realize that the conversion is safe,
    so no need to interpose a

    cmp x,256
    jae trp_conversion

    idiom.

    OTOH, I have seen C compilers that insist on such a test at the end of a
    fully saturated switch statement, even when the mask in front should
    prove that no other values are possible.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Terje Mathisen on Tue Feb 27 13:50:43 2024
    Terje Mathisen wrote:
    Tim Rentsch wrote:
    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules.  I

    I am not _sure_ but I believe Rust will in fact verify that all such
    casts are in fact legal, i.e. the data will fit in the target container.

    This is of course the total opposite of the C "nuclear option", and much more like other languages that try to be secure by default.

    Just to make sure I spoke to our resident Rust guru, and he told me I
    was wrong:

    Rust does have conversion operators/functions for downsizing variables,
    and they come with full validity checking, but using "s as u8" as I
    suggested will generate exactly the same code as a C "(uint8_t) s"
    idiom, i.e. no verification and no safety checks.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Feb 27 15:07:22 2024
    On Tue, 27 Feb 2024 13:50:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Tim Rentsch wrote:
    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules. I

    I am not _sure_ but I believe Rust will in fact verify that all
    such casts are in fact legal, i.e. the data will fit in the target container.

    This is of course the total opposite of the C "nuclear option", and
    much more like other languages that try to be secure by default.

    Just to make sure I spoke to our resident Rust guru, and he told me I
    was wrong:

    Rust does have conversion operators/functions for downsizing
    variables, and they come with full validity checking, but using "s as
    u8" as I suggested will generate exactly the same code as a C
    "(uint8_t) s" idiom, i.e. no verification and no safety checks.

    Terje


    Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
    less striking visual hint of doing something unusual and potentially
    dangerous.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Feb 28 01:28:36 2024
    Michael S wrote:

    On Tue, 27 Feb 2024 13:50:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Tim Rentsch wrote:
    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules.  I

    I am not _sure_ but I believe Rust will in fact verify that all
    such casts are in fact legal, i.e. the data will fit in the target
    container.

    This is of course the total opposite of the C "nuclear option", and
    much more like other languages that try to be secure by default.

    Just to make sure I spoke to our resident Rust guru, and he told me I
    was wrong:

    Rust does have conversion operators/functions for downsizing
    variables, and they come with full validity checking, but using "s as
    u8" as I suggested will generate exactly the same code as a C
    "(uint8_t) s" idiom, i.e. no verification and no safety checks.

    Terje


    Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
    less striking visual hint of doing something unusual and potentially dangerous.

    No more dangerous than::

    if( c >= 'A' and c <= 'Z' ) c -= 'A'-'a';

    or

    if( table[c] & CAPS ) c -='A'-'a';

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Feb 28 15:01:19 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Michael S wrote:

    On Tue, 27 Feb 2024 13:50:43 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Terje Mathisen wrote:
    Tim Rentsch wrote:
    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules.  I

    I am not _sure_ but I believe Rust will in fact verify that all
    such casts are in fact legal, i.e. the data will fit in the target
    container.

    This is of course the total opposite of the C "nuclear option", and
    much more like other languages that try to be secure by default.

    Just to make sure I spoke to our resident Rust guru, and he told me I
    was wrong:

    Rust does have conversion operators/functions for downsizing
    variables, and they come with full validity checking, but using "s as
    u8" as I suggested will generate exactly the same code as a C
    "(uint8_t) s" idiom, i.e. no verification and no safety checks.

    Terje


    Prettty much in spirit of Ada's Unchecked_Conversion construct, but with
    less striking visual hint of doing something unusual and potentially
    dangerous.

    No more dangerous than::

    if( c >= 'A' and c <= 'Z' ) c -= 'A'-'a';

    If your character set is EBCDIC, this isn't perfect, but since
    the gaps are generally unassigned, may not cause problems in
    practice.

    For EBCDIC, it was sufficent to clear or set bit<6> to change case.


    or

    if( table[c] & CAPS ) c -='A'-'a';

    Safer than the former.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Mon Mar 11 07:54:07 2024
    scott@slp53.sl.home (Scott Lurndal) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    scott@slp53.sl.home (Scott Lurndal) writes:

    The Unix code ported relatively easily to I32LP64 because
    uintptr_t had been used extensively rather than assumptions
    about
    sizeof(int) == sizeof(int *).

    ...

    Sorry, I meant ptrdiff_t, which was used for pointer math.

    I have seen little code that uses ptrdiff_t; quite a bit that
    used size_t (the unsigned brother of ptrdiff_t). But my memory
    tells me that even size_t was not very widespread in 1995.

    In 1995 a problem with both size_t and ptrdiff_t is that there

    Calling it a "problem" is overstating the case. It was
    straightforward enough, if not completely portable to
    use the appropriate number of 'l' modifiers.

    Whether it is called a problem or not, the lack of support from
    printf() was mentioned upthread (by OP?), and that's why I pointed it
    out. The point is that not having the appropriate length modifiers
    in C90 makes the code clumsy and the coding inconvenient. Focusing
    on what word is used is a red herring.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Mar 11 08:00:59 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Second idea is to compute a double-width product,
    or at least part of one, using standard multiple-precision
    arithmetic, and speed compare against the floating-point method.

    What "standard multiple-precision arithmetic" is there in C? I am
    not aware of any.

    I didn't say there is. So what else might I have meant by that
    phrase?

    If you have widening multiplication in the language, things are
    trivial. [...]

    Sure. If things were different they wouldn't be the same.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 08:10:29 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [...]

    int8_t sum(int len, int8_t data[])
    {
    int s = 0;
    for (unsigned i = 0 i < len; i++) {
    s += data[i];
    }
    return (int8_t) s;
    }

    The cast in the return statement is superfluous.

    I am normally writing Rust these days, where UB is far less common,
    but casts like this are mandatory.

    Oh. I didn't know that about Rust. Interesting.

    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules. I

    I am not _sure_ but I believe Rust will in fact verify that all such
    casts are in fact legal, i.e. the data will fit in the target
    container.

    This is of course the total opposite of the C "nuclear option", and
    much more like other languages that try to be secure by default.

    understand that it has the support of community sentiment, but
    even so it seems like a poor choice here. I'm not a big fan of
    the new attribute syntax, but a form like

    return [[narrow]] s;

    looks to be a better way of asking Rust to allow what is a
    normally disallowed conversion. By contrast, using a cast is
    overkill. There is unnecessary redundancy, by specifying a type
    in two places, and the risk that they might get out of sync. And
    on general principles requiring a cast violates good security
    principles. If someone needs access to a particular room in a
    building, we don't hand over a master key that opens every room
    in the building. If someone needs to read some documents that
    have classified materials, we don't give them an access code that
    lets them read any sensitive material regardless of whether it's
    relevant. Maybe Rust is different, but in C a cast allows any
    conversion that is possible in the language, even the unsafe
    ones. It just seems wrong to use the nuclear option of casting
    for every minor infringement.

    I agree, it Rust did it like C, then it would be very unsafe indeed.

    I have not checked the generated asm, but I believe that if I write
    code like this:

    // x:u64
    x = 0x1234567890abcdef;
    let y:u8 = (x & 255) as u8;

    the compiler will see the mask and realize that the conversion is
    safe, so no need to interpose a

    cmp x,256
    jae trp_conversion

    idiom.

    Sounds like you and I are on the same page here.

    OTOH, I have seen C compilers that insist on such a test at the
    end of a fully saturated switch statement, even when the mask in
    front should prove that no other values are possible.

    Yeah, what's up with that? Even worse, when each of the branches
    has a return statement, sometimes there is a warning saying the
    end of the function can be reached without returning a value.
    That really annoys me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 08:11:05 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Terje Mathisen wrote:

    Tim Rentsch wrote:

    I'm always somewhat surprised when someone advocates using a cast
    for such things, and now more surprised to learn that Rust has
    chosen to impose using a cast as part of its language rules. I

    I am not _sure_ but I believe Rust will in fact verify that all such
    casts are in fact legal, i.e. the data will fit in the target
    container.

    This is of course the total opposite of the C "nuclear option", and
    much more like other languages that try to be secure by default.

    Just to make sure I spoke to our resident Rust guru, and he told me I
    was wrong:

    Rust does have conversion operators/functions for downsizing
    variables, and they come with full validity checking, but using "s as
    u8" as I suggested will generate exactly the same code as a C
    "(uint8_t) s" idiom, i.e. no verification and no safety checks.

    Perfect. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Mar 11 09:02:56 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?

    Heavens to Betsy! Are you impugning the quality and excellence
    of my code? Of *my* code? I can only hope that you are suitably
    chagrined and contrite. ;)

    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    Rather that do that I will explain.

    An addition overflows if the two operands have the same sign and
    the sign of an operand is the opposite of the sign of the sum
    (taken mod the width of the operands). Convert the signed
    operands to their unsigned counterparts, and form the sum of the
    unsigned values. The sign is just the high-order bit in each
    case. Thus the overflow condition can be detected with a few
    bitwise xors and ands.

    Subtraction is similar except now overflow can occur only when
    the operands have different signs and the sign of the sum is
    the opposite of the sign of the first operand.

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    #if UINT_MAX > INT_MAX && INT_MIN == -INT_MAX - 1
    // this case is the one outlined above

    #elif UINT_MAX > INT_MAX && INT_MIN == -INT_MAX

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX - 1

    #elif UINT_MAX == INT_MAX && INT_MIN == -INT_MAX

    Does that all make sense?


    The next question would be how to do the same for multiplication....

    Multiplication is a whole other ball game. First we need to
    consider only the widest types, because narrower types can be
    carried out in a wider type and the resulting product value
    checked. Off the top of my head, for the widest types I would
    try converting to float or double, do a floating-point multiply,
    and do some trivial accepts and trivial rejects based on the
    exponent of the result. Any remaining cases would need more
    care, but probably (we hope!) there aren't many of those and they
    don't happen very often. So for what it's worth there is my
    first idea. Second idea is to compute a double-width product,
    or at least part of one, using standard multiple-precision
    arithmetic, and speed compare against the floating-point method.
    I better stop now or the ideas will probably get worse rather
    than better. :/

    Your floating point method is pretty bad, imho, since it can give
    you both false negatives and false positives, with no way to know
    for sure, except doing it all over again.

    I think you misunderstood what I was suggesting. The initial
    tests using floating point don't produce any false positives or
    false negatives. They may give a lot of "not sure" cases, but
    none of the other cases is ever wrong. It is only the "not sure"
    cases that need further investigation. If the FP multiplication
    is done using long double (10 byte IEEE), I'm pretty sure the
    results are essentially perfect with respect to a 64x64 multiply;
    that is, there are very few "not sure" cases, perhaps even zero.

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Mar 11 20:10:15 2024
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at
    https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Thomas Koenig on Mon Mar 11 18:01:31 2024
    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at
    https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like multiplicative hashes...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Mon Mar 11 18:19:19 2024
    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at
    https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like
    multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?

    Illegal, hence an implementation would be free to trap or start
    World War III (with a bit of an expectation that compilers would
    trap when supplied with the right options).

    My expecation is different: It would then be treated like signed
    overflow, which is also illegal in Fortran. So, everybody will
    implement it as if it were modular 2^n anyway, plus start optimizing
    on the assumption that overflow cannot happen.

    And, since in Fortran, arrays can start at arbitrary lower bounds
    (and array can have a lower bound of -42 and an upper bound of -21,
    for example), the use of unsigned integers for array indices is
    somewhat less than in programming languages such as C or (I believe)
    Rust where they always start at zero.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon Mar 11 20:38:44 2024
    On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which
    hopefully will be considered in the next J3 meeting, it can be
    found at https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like
    multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?

    Illegal, hence an implementation would be free to trap or start
    World War III (with a bit of an expectation that compilers would
    trap when supplied with the right options).


    So, speaking in C Standard language, UB.


    My expecation is different: It would then be treated like signed
    overflow, which is also illegal in Fortran. So, everybody will
    implement it as if it were modular 2^n anyway, plus start optimizing
    on the assumption that overflow cannot happen.


    Yes, I'd expect the same.

    And, since in Fortran, arrays can start at arbitrary lower bounds
    (and array can have a lower bound of -42 and an upper bound of -21,
    for example), the use of unsigned integers for array indices is
    somewhat less than in programming languages such as C or (I believe)
    Rust where they always start at zero.

    As discussed here just recently, there are good reason to avoid
    'unsigned' array indices in performance-oriented programs running under
    IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
    preferable, tend to vary.

    What is the size of Fortran's default UNSIGNED ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Mar 11 20:30:38 2024
    On 11/03/2024 19:10, Michael S wrote:
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which hopefully
    will be considered in the next J3 meeting, it can be found at
    https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like
    multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?


    Speaking for myself only, I'd like a choice. For the most part, I
    consider overflow (signed or unsigned) to indicate an error in the code.
    The two appropriate actions then are a run-time error, or that the
    compiler assumes that such overflow does not happen for the purposes of optimisation. (Whether the choice of handling here is determined by the
    source code, or compiler options, or a combination of both, is another
    matter of choice.) And for some unusual code, I want overflow (signed
    or unsigned) to be defined behaviour - either wrapping, or with
    additional information about overflows or carries, or perhaps more
    special cases such as saturation.

    Basically, I think it is a mistake for a language to pick one kind of
    treatment and pretend that this is /the/ correct handling of overflow. Different circumstances could call for a variety of different behaviours.

    If I had to pick only one possible choice of treatment, then I suppose
    it would be wrapping - because sometimes you really need that in code,
    while UB based optimisation or run-time checking is just "nice to have"
    rather than essential. Standard C having UB for signed overflow and
    wrapping for unsigned overflow is a reasonable compromise when keeping
    things simple, but I'd prefer a choice.

    (Similarly, I'd prefer a choice regarding another favourite source of UB
    and complaints in C - that of type-based alias analysis. I'd sometimes
    like 16-bit, 32-bit and 64-bit types that could alias anything, and I'd sometimes like 8-bit types that did /not/ alias anything.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Mon Mar 11 21:29:02 2024
    On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?

    [...]

    Heavens to Betsy! Are you impugning the quality and excellence
    of my code? Of *my* code? I can only hope that you are suitably
    chagrined and contrite. ;)

    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    Rather that do that I will explain.

    An addition overflows if the two operands have the same sign and
    the sign of an operand is the opposite of the sign of the sum
    (taken mod the width of the operands). Convert the signed
    operands to their unsigned counterparts, and form the sum of the
    unsigned values. The sign is just the high-order bit in each
    case. Thus the overflow condition can be detected with a few
    bitwise xors and ands.

    Subtraction is similar except now overflow can occur only when
    the operands have different signs and the sign of the sum is
    the opposite of the sign of the first operand.

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    ...

    but that's implementation-defined behavior, correct?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Mon Mar 11 21:43:19 2024
    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which
    hopefully will be considered in the next J3 meeting, it can be
    found at https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like
    multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?

    Illegal, hence an implementation would be free to trap or start
    World War III (with a bit of an expectation that compilers would
    trap when supplied with the right options).


    So, speaking in C Standard language, UB.

    Yes, that would be the translation. In Fortran terms, it would
    violate a "shall" directive.



    My expecation is different: It would then be treated like signed
    overflow, which is also illegal in Fortran. So, everybody will
    implement it as if it were modular 2^n anyway, plus start optimizing
    on the assumption that overflow cannot happen.


    Yes, I'd expect the same.

    And, since in Fortran, arrays can start at arbitrary lower bounds
    (and array can have a lower bound of -42 and an upper bound of -21,
    for example), the use of unsigned integers for array indices is
    somewhat less than in programming languages such as C or (I believe)
    Rust where they always start at zero.

    As discussed here just recently, there are good reason to avoid
    'unsigned' array indices in performance-oriented programs running under IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
    preferable, tend to vary.

    What is the size of Fortran's default UNSIGNED ?

    It is not yet in the language; a paper has been passed by J3,
    but it needs to be put to WG5, and WG5 has to agree that J3 should
    put it into the standard proper for Fortran 202y (202x just
    came out as Fortran 2023).

    But if it does go in, it is likely that it will have the same
    size as INTEGER, which is usually 32 bits.

    However, what I did put in the paper (and what the subsequent
    revision by a J3 subcommittee left in) is a prohibition against
    using unsigneds in a DO loop. The reason is semantics of
    negative strides.

    Currently, in Fortran, the number of iterations of the loop

    do i=m1,m2,m3
    ...
    end do

    is (m2-m1+m3)/m3 unless that value is negative, in which case it
    is zero (m3 defaults to 1 if it is not present).

    So,

    do i=1,3,-1

    will be executed zero times, as will

    do i=3,1

    Translating that into arithmetic with unsigned integers makes
    little sense, how many times should

    do i=1,3,4294967295

    be executed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Mar 11 23:34:19 2024
    Thomas Koenig wrote:

    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:19:19 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    On 2024-03-11, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 11 Mar 2024 18:01:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Thomas Koenig <tkoenig@netcologne.de> schrieb:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 20/02/2024 07:31, Thomas Koenig wrote:

    Even further on the side: I wrote up a proposal for finally
    introducing a wrapping UNSIGNED type to Fortran, which
    hopefully will be considered in the next J3 meeting, it can be
    found at https://j3-fortran.org/doc/year/24/24-102.txt .

    In this proposal, I intended to forbid UNSIGNED variables in
    DO loops, especially for this sort of reason.

    (Testing a new news server, my old one was decomissioned...)

    I was quite delighted, but also a little bit surprised that the
    proposal (somewhat modified) actually passed.

    Now, what's left is the people who do not want modular arithmetic,
    for a reason that I am unable to fathom. I guess they don't like
    multiplicative hashes...

    What do they like?
    To declare unsigned overflow UB? Or implementation defined? Or
    trapping?

    Illegal, hence an implementation would be free to trap or start
    World War III (with a bit of an expectation that compilers would
    trap when supplied with the right options).


    So, speaking in C Standard language, UB.

    Yes, that would be the translation. In Fortran terms, it would
    violate a "shall" directive.



    My expecation is different: It would then be treated like signed
    overflow, which is also illegal in Fortran. So, everybody will
    implement it as if it were modular 2^n anyway, plus start optimizing
    on the assumption that overflow cannot happen.


    Yes, I'd expect the same.

    And, since in Fortran, arrays can start at arbitrary lower bounds
    (and array can have a lower bound of -42 and an upper bound of -21,
    for example), the use of unsigned integers for array indices is
    somewhat less than in programming languages such as C or (I believe)
    Rust where they always start at zero.

    As discussed here just recently, there are good reason to avoid
    'unsigned' array indices in performance-oriented programs running under
    IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t. Now, opinions on which of the 3 is most
    preferable, tend to vary.

    What is the size of Fortran's default UNSIGNED ?

    It is not yet in the language; a paper has been passed by J3,
    but it needs to be put to WG5, and WG5 has to agree that J3 should
    put it into the standard proper for Fortran 202y (202x just
    came out as Fortran 2023).

    But if it does go in, it is likely that it will have the same
    size as INTEGER, which is usually 32 bits.

    However, what I did put in the paper (and what the subsequent
    revision by a J3 subcommittee left in) is a prohibition against
    using unsigneds in a DO loop. The reason is semantics of
    negative strides.

    Currently, in Fortran, the number of iterations of the loop

    do i=m1,m2,m3
    ....
    end do

    is (m2-m1+m3)/m3 unless that value is negative, in which case it
    is zero (m3 defaults to 1 if it is not present).

    So,

    do i=1,3,-1

    will be executed zero times, as will

    do i=3,1

    Translating that into arithmetic with unsigned integers makes
    little sense, how many times should

    do i=1,3,4294967295

    be executed?

    3-1+4294967295 = 4294967297 // (m2-m1+m3)

    4294967297 / 4294967295 = 1.0000000004656612874161594750863

    So the loop should be executed one time. {{And yes I know 4294967295 == 0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement)
    where 4294967295 is representable naturally ??

    Do i = 1 incrementing by 4294967295 until i > 3 should be executed once. Certainly 1 is <= 3 in any numeric system so it should be executed at least
    once.
    Certainly 1+4294967295 > 3 in any numeric system claiming to be algebraic,
    and additionally when registers are larger than
    32-bits. So the loop should not be executed more
    than once.

    This happens naturally on a 64-bit machine and on 64-bit machines which do
    not have word-width calculation instructions. If you use the word width instructions you enter in to UB or IB behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Mar 12 07:03:24 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    However, what I did put in the paper (and what the subsequent
    revision by a J3 subcommittee left in) is a prohibition against
    using unsigneds in a DO loop. The reason is semantics of
    negative strides.

    Currently, in Fortran, the number of iterations of the loop

    do i=m1,m2,m3
    ....
    end do

    is (m2-m1+m3)/m3 unless that value is negative, in which case it
    is zero (m3 defaults to 1 if it is not present).

    So,

    do i=1,3,-1

    will be executed zero times, as will

    do i=3,1

    Translating that into arithmetic with unsigned integers makes
    little sense, how many times should

    do i=1,3,4294967295

    be executed?

    3-1+4294967295 = 4294967297 // (m2-m1+m3)

    4294967297 / 4294967295 = 1.0000000004656612874161594750863

    So the loop should be executed one time. {{And yes I know 4294967295 == 0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement) where 4294967295 is representable naturally ??

    Correct (of course).

    The same result would be expected for

    do i=1u,3u,-1u

    (asusming an u suffix for unsigned numbers).

    The problem is that this violates a Fortran basic assumption since
    FORTRAN 77, which is that DO loops can be zero-trip.

    This is a can of worms that I would like to leave unopened.

    Same goes for array slices. Even assuming that no negative
    indices are used, the slice a(1:3:-1) is zero-sized in Fortran,
    as is a(3:1) .

    For a(1u:3u:-1u) the same logic that you outlined above would apply,
    making it a slice with one element.

    Not going there :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Tue Mar 12 08:32:52 2024
    On 11/03/2024 22:29, Thomas Koenig wrote:
    On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    ...

    but that's implementation-defined behavior, correct?

    It is, as far as I understand it. (Tim knows these things better than I
    do, so if he finds something in the standards to contradict me, he's
    probably right.)

    For a signed integer type T in C, you have N value bits, P padding bits,
    and a single sign bit. The N value bits can hold values between 0 and
    2^N - 1. The sign bit can either indicate a negative value (sign and magnitude), -2^N (two's complement), or -(2^N - 1) (ones' complement).
    There will also be a corresponding unsigned type, which has at least N
    value bits and takes the same number of bytes, and has no sign bit. It
    can still have padding bits (except for unsigned char, which has no
    padding). But all non-negative integers that can be represented in the
    signed type have exactly the same significant bits in both the signed
    and unsigned types.

    So an implementation can have a 32-bit two's complement "int", and use
    that for unsigned types too, treating the MSB as a padding bit for
    unsigned usage. (Of course, doing so would be inefficient in use on
    most cpus - you'd have to keep masking off the padding bit when using
    the value, or somehow guarantee that it is never non-zero.)

    In such an implementation, converting an int to an unsigned int would
    mask off the top bit (fully defined by the standard), and converting
    from an unsigned int it to a signed int would leave it unchanged (implementation dependent in the standard). That means (int)(unsigned
    int)(-1) would be positive 0x7fffffff, not -1.


    Real-world systems, of course, all use two's complement types with no
    padding, and Tim's description will work fine.

    (And in C23, only two's complement representations will be allowed. I
    can't remember if padding bits are still allowed, however. And of
    course signed integer overflow is still UB.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Mar 12 07:58:05 2024
    Michael S <already5chosen@yahoo.com> writes:
    As discussed here just recently, there are good reason to avoid
    'unsigned' array indices in performance-oriented programs running under >IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t.

    If Fortran makes unsigned overflow illegal, Fortran compilers can
    perform the same shenanigans for unsigned that C compilers do for
    signed integers; so if signed int really is preferable because of
    these shenanigans, unsigned with the same shenanigans would be
    preferable, too.

    In general, I think that undersized ints (as in I32LP64 or IL32P64)
    are always at a disadvantage; some architectures (particularly ARM
    A64) go to extra lengths to compensate for these disadvantages, but I
    don't think that these measures eliminates it completely, and the
    additional effort in the architectures could have gone to better uses.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Mar 12 10:52:56 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    There is a small gotcha if either of the inputs are of the 0x80000000
    form, i.e. MININT, but the naive iabs() conversion will do the right
    thing by leaving the input unchanged.

    At the other end there cannot be any issues since restoring a negative
    output sign cannot overflow/fail.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Mar 12 17:13:38 2024
    Terje Mathisen wrote:

    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    The low order N bits are all the same while the higher order N bits
    are different; where N is operand size.

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    SW may have to do that, HW does not.

    There is a small gotcha if either of the inputs are of the 0x80000000
    form, i.e. MININT, but the naive iabs() conversion will do the right
    thing by leaving the input unchanged.

    At the other end there cannot be any issues since restoring a negative
    output sign cannot overflow/fail.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Tue Mar 12 11:03:35 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    There is a small gotcha if either of the inputs are of the 0x80000000
    form, i.e. MININT, but the naive iabs() conversion will do the right
    thing by leaving the input unchanged.

    At the other end there cannot be any issues since restoring a negative
    output sign cannot overflow/fail.

    It isn't quite that simple. Some of what you describe has a risk
    of running afoul of implementation-defined behavior or undefined
    behavior (as for example abs( INT_MIN )). I'm pretty sure it's
    possible to avoid those pitfalls, but it requires a fair amount
    of care and careful thinking.

    Note that my goal is only to avoid the possibility of undefined
    behavior that comes from signed overflow. My approach is to safely
    determine whether the signed multiplication would overflow, and if
    it wouldn't then simply use signed arithmetic to get the result.
    I use unsigned types to determine the safety, and if it's safe then
    using signed types to get a result. For the current problem I don't
    care about widening, except as it might help to determine safety.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Tue Mar 12 18:38:20 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Michael S <already5chosen@yahoo.com> writes:
    As discussed here just recently, there are good reason to avoid
    'unsigned' array indices in performance-oriented programs running under >>IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t.

    If Fortran makes unsigned overflow illegal, Fortran compilers can
    perform the same shenanigans for unsigned that C compilers do for
    signed integers; so if signed int really is preferable because of
    these shenanigans, unsigned with the same shenanigans would be
    preferable, too.

    One problem is that, without 2^n modulo, something like a
    multiplicative hash would be illegal.

    People would do it anyway, ignoring the prohibition, because it
    is so useful, and subsequent hilarity will ensue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Tue Mar 12 11:51:21 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    On 2024-02-25, Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    Signed integer overflow is undefined behavior in C and prohibited
    in Fortran. Yet, there is no straightforward, standard-compliant
    way to check for signed overflow (and handle this appropriately)
    in either language. [...]

    It isn't hard to write standard C code to determine whether a
    proposed addition or subtraction would overflow, and does so
    safely and reliably.

    Also efficiently and without resorting to implementation-
    defined or undefined behavior (and without needing a bigger
    type)?

    [...]

    Heavens to Betsy! Are you impugning the quality and excellence
    of my code? Of *my* code? I can only hope that you are suitably
    chagrined and contrite. ;)

    It's a little bit tedious perhaps but not
    difficult. Checking code can be wrapped in an inline function
    and invoke whatever handling is desired, within reason.

    Maybe you could share such code?

    Rather that do that I will explain.

    An addition overflows if the two operands have the same sign and
    the sign of an operand is the opposite of the sign of the sum
    (taken mod the width of the operands). Convert the signed
    operands to their unsigned counterparts, and form the sum of the
    unsigned values. The sign is just the high-order bit in each
    case. Thus the overflow condition can be detected with a few
    bitwise xors and ands.

    Subtraction is similar except now overflow can occur only when
    the operands have different signs and the sign of the sum is
    the opposite of the sign of the first operand.

    The above description works for two's complement hardware where
    unsigned types have the same width as their corresponding signed
    types. I think for most people that's all they need. The three
    other possibilities are all doable with minor adjustments, and
    code appropriate to each particular implementation can be
    selected using C preprocessor conditional, as for example

    ...

    but that's implementation-defined behavior, correct?

    There is implementation-dependent behavior but there isn't any implementation-defined behavior. The result has to depend on the implementation because different implementations can imply
    different results, as for example whether the representation for
    signed integers uses two's complement or ones' complement.
    Roughly speaking the distinction is whether code is relying on an implementation choice other than the choice assumed. There is
    nothing wrong, for example, with code that holds the value of the
    character constant 'a' in a variable, as long as the code makes
    sure that there are no wrong assumptions about what specific
    value that is (as for example the wrong assumption that the
    expression c + "A" - "a" can be used to change a letter from
    lower case to upper case). The C standard doesn't clearly
    differentiate behavior /of the implementation/ and behavior /of
    the program/. I took your question to mean, Does the code resort
    to implementation-defined behavior so as to rely on an unreliable
    assumption, ie, the kind that can go wrong if a different implementation-defined choice is made? The answer is that the
    code does not rely on any such assumption. So strictly speaking
    the code does /involve/ implementation-defined choices (as indeed
    essentially all programs do). But it does not /depend/ on implementation-defined choices in any way that risks changing the
    correctness of its results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Mar 12 19:08:18 2024
    Thomas Koenig wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Michael S <already5chosen@yahoo.com> writes:
    As discussed here just recently, there are good reason to avoid >>>'unsigned' array indices in performance-oriented programs running under >>>IL32P64 or I32LP64 C environments. Everything else is preferable -
    int, ptrdiff_t, size_t.

    If Fortran makes unsigned overflow illegal, Fortran compilers can
    perform the same shenanigans for unsigned that C compilers do for
    signed integers; so if signed int really is preferable because of
    these shenanigans, unsigned with the same shenanigans would be
    preferable, too.

    One problem is that, without 2^n modulo, something like a
    multiplicative hash would be illegal.

    In HW we can reverse the bit order of the fields at zero cost making
    hashes that "whiten" the data better.

    People would do it anyway, ignoring the prohibition, because it
    is so useful, and subsequent hilarity will ensue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Tue Mar 12 19:07:01 2024
    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    There is a small gotcha if either of the inputs are of the 0x80000000
    form, i.e. MININT, but the naive iabs() conversion will do the right
    thing by leaving the input unchanged.

    At the other end there cannot be any issues since restoring a negative
    output sign cannot overflow/fail.

    It isn't quite that simple. Some of what you describe has a risk
    of running afoul of implementation-defined behavior or undefined
    behavior (as for example abs( INT_MIN )). I'm pretty sure it's
    possible to avoid those pitfalls, but it requires a fair amount
    of care and careful thinking.

    It would be supremely nice if we could go back in time before
    computers and reserve an integer encoding that represents the
    value of "there is no value here" and mandate if upon integer
    arithmetic.

    Note that my goal is only to avoid the possibility of undefined
    behavior that comes from signed overflow. My approach is to safely
    determine whether the signed multiplication would overflow, and if
    it wouldn't then simply use signed arithmetic to get the result.

    Double width multiplication cannot overflow. 2n = n×n then, ignoring
    the top n bits gives you your non-overflowing multiply.

    I use unsigned types to determine the safety, and if it's safe then
    using signed types to get a result. For the current problem I don't
    care about widening, except as it might help to determine safety.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Mar 12 19:05:44 2024
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]
    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    In Gforth we use:

    DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
    DCell res;

    res = UD2D(ummul (a, b));
    if (a < 0)
    res.hi -= b;
    if (b < 0)
    res.hi -= a;
    return res;
    }

    I have this technique from Andrew Haley. It relies on twos-complement representation.

    - anton

    Yeah, that's what Alpha does with UMULH.
    I'm still trying to figure out why it works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Tue Mar 12 22:23:36 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    In Gforth we use:

    DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
    DCell res;

    res = UD2D(ummul (a, b));
    if (a < 0)
    res.hi -= b;
    if (b < 0)
    res.hi -= a;
    return res;
    }

    I have this technique from Andrew Haley. It relies on twos-complement representation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Tue Mar 12 20:12:20 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    There is a small gotcha if either of the inputs are of the 0x80000000
    form, i.e. MININT, but the naive iabs() conversion will do the right
    thing by leaving the input unchanged.

    At the other end there cannot be any issues since restoring a negative
    output sign cannot overflow/fail.

    It isn't quite that simple. Some of what you describe has a risk
    of running afoul of implementation-defined behavior or undefined
    behavior (as for example abs( INT_MIN )). I'm pretty sure it's
    possible to avoid those pitfalls, but it requires a fair amount
    of care and careful thinking.

    It would be supremely nice if we could go back in time before
    computers and reserve an integer encoding that represents the
    value of "there is no value here" and mandate if upon integer
    arithmetic.

    ISO C allows such an encoding, even for two's complement.

    Sadly it appears that the latest C standard will be taking
    away that allowance.

    Note that my goal is only to avoid the possibility of undefined
    behavior that comes from signed overflow. My approach is to safely
    determine whether the signed multiplication would overflow, and if
    it wouldn't then simply use signed arithmetic to get the result.

    Double width multiplication cannot overflow. 2n = nxn then, ignoring
    the top n bits gives you your non-overflowing multiply.

    C does not guarantee that. The point of the exercise is to
    write code assuming nothing more than what the C standard
    mandates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to EricP on Tue Mar 12 20:23:32 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    Anton Ertl wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the
    signs, convert the inputs to unsigned, then do the unsigned
    widening mul, before finally resotirng the sign as the XOR of the
    input signs?

    In Gforth we use:

    DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
    DCell res;

    res = UD2D(ummul (a, b));
    if (a < 0)
    res.hi -= b;
    if (b < 0)
    res.hi -= a;
    return res;
    }

    I have this technique from Andrew Haley. It relies on twos-complement
    representation.

    Yeah, that's what Alpha does with UMULH.
    I'm still trying to figure out why it works.

    It works because a sign bit works like a value bit
    with a weight of -2**(N-1), where N is the width of
    the memory holding the signed value. So instead
    of subtracting 2**(N-1) * b, assuming a is negative,
    we have instead added 2**(N-1) * b, so we need to
    subtract 2 * 2**(N-1) * b, or 2**N * b, which means
    subtracting b from the high order word of the result.
    And of course similarly for when b is negative.

    (Note that the above holds for two's complement, but
    not for ones' complement or signed magnitude.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Wed Mar 13 17:09:37 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ >> {
    DCell res;

    res = UD2D(ummul (a, b));
    if (a < 0)
    res.hi -= b;
    if (b < 0)
    res.hi -= a;
    return res;
    }

    I have this technique from Andrew Haley. It relies on twos-complement
    representation.

    - anton

    Yeah, that's what Alpha does with UMULH.
    I'm still trying to figure out why it works.

    Let's consider the case where a>=0 and b<0, and cells are 64 bits. ua
    is a interpreted as unsigned cell, and ub is b interpreted as unsigned
    cell. The following computations are in Z (the unlimited
    integers). For the case under consideration:

    ua=a
    ub=b+2^64

    res = ua*ub = a*(b+2^64)= a*b + a*2^64

    So,

    a*b = res - a*2^64

    The other cases are similar.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Mar 13 18:58:09 2024
    Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    Thomas Koenig wrote:

    However, what I did put in the paper (and what the subsequent
    revision by a J3 subcommittee left in) is a prohibition against
    using unsigneds in a DO loop. The reason is semantics of
    negative strides.

    Currently, in Fortran, the number of iterations of the loop

    do i=m1,m2,m3
    ....
    end do

    is (m2-m1+m3)/m3 unless that value is negative, in which case it
    is zero (m3 defaults to 1 if it is not present).

    So,

    do i=1,3,-1

    will be executed zero times, as will

    do i=3,1

    Translating that into arithmetic with unsigned integers makes
    little sense, how many times should

    do i=1,3,4294967295

    be executed?

    3-1+4294967295 = 4294967297 // (m2-m1+m3)

    4294967297 / 4294967295 = 1.0000000004656612874161594750863

    So the loop should be executed one time. {{And yes I know 4294967295 ==
    0x1,0000,0001}} What would you expect on a 36-bit machine (2s-complement)
    where 4294967295 is representable naturally ??

    Correct (of course).

    It seems to me that the problem is not using unsigned integers as
    DO LOOP indexes, the problem is there is no compiler error message
    from "4294967295 does not fit in integer container". Bringing the
    problem to the programmer. THEN everybody is free (under the above)
    to implement unsigned DO LOOPs.

    The same result would be expected for

    do i=1u,3u,-1u

    (asusming an u suffix for unsigned numbers).

    The problem is that this violates a Fortran basic assumption since
    FORTRAN 77, which is that DO loops can be zero-trip.

    (m2-m1+m3)/m3

    (3-1+(-1))/-1 = -1 and the loop should not be taken at all.
    BUT
    -1u = 4294967293
    Therefore:
    (3u-1u+(-1u))/-1u =
    (3-1+4294967293)/4294967293 = 1.0000000004656612874161594750863 again.

    Once again this is a problem only when a constant integer value
    cannot be precisely represented--and deserves a warning/error
    message instead of a complete ban.

    This is a can of worms that I would like to leave unopened.

    I understand why, I just think the fickle finger of fate should
    point at the constant rather than the type.

    Same goes for array slices. Even assuming that no negative
    indices are used, the slice a(1:3:-1) is zero-sized in Fortran,
    as is a(3:1) .

    For a(1u:3u:-1u) the same logic that you outlined above would apply,
    making it a slice with one element.

    Not going there :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Fri Mar 15 10:33:21 2024
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    If I really had to write a 64x64->128 MUL, with no widening MUL or
    MULH which returns the high half, then I would punt and do it using
    32-bit parts (all variables are u64): [...]

    I wrote some code along the same lines. A difference is you
    are considering unsigned multiplication, and I am considering
    signed multiplication.

    Signed mul is just a special case of unsigned mul, right?

    I.e. in case of a signed widening mul, you'd first extract the signs,
    convert the inputs to unsigned, then do the unsigned widening mul,
    before finally resotirng the sign as the XOR of the input signs?

    In Gforth we use:

    DCell mmul (Cell a, Cell b) /* signed multiply, mixed precision */ {
    DCell res;

    res = UD2D(ummul (a, b));
    if (a < 0)
    res.hi -= b;
    if (b < 0)
    res.hi -= a;
    return res;
    }

    I have this technique from Andrew Haley. It relies on twos-complement representation.

    Nice!

    Subtracting the results of having used the sign bit as part of the multiplication.

    Here you can probably schedule the fixup to happen in parallel with the
    actual multiplication:

    ;; inputs in r9 & r10, result in rdx:rax, rbx & rcx as scratch

    mov rax,r9 ;; All these can start in the first cycle
    mul r10
    mov rbx,r9 ;; The MOV can be handled by the renamer
    sar r9,63
    mov rcx,r10 ;; Ditto
    sar r10,63

    and rbx,r9 ;; Second set of ops
    and rcx,r10

    add rbx,rcx ;; Third cycle

    sub rdx,rbx ;; Do a single adjustment as soon as the MUL finishes

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Fri Mar 15 17:07:19 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Here you can probably schedule the fixup to happen in parallel with the >actual multiplication:

    ;; inputs in r9 & r10, result in rdx:rax, rbx & rcx as scratch

    mov rax,r9 ;; All these can start in the first cycle
    mul r10
    mov rbx,r9 ;; The MOV can be handled by the renamer
    sar r9,63
    mov rcx,r10 ;; Ditto
    sar r10,63

    and rbx,r9 ;; Second set of ops
    and rcx,r10

    add rbx,rcx ;; Third cycle

    sub rdx,rbx ;; Do a single adjustment as soon as the MUL finishes

    Of course on AMD64 you could just use imul instead.

    RISC-V also supports signed as well as unsigned (and also
    signed*unsigned) multiplication, and I think that's also the case for
    ARM A64. But on Alpha this technique would be useful.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)