• Re: My 66000 and High word facility

    From MitchAlsup1@21:1/5 to Brett on Sat Aug 10 18:49:35 2024
    On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:


    My 66000 should look into the z/architecture High Word Facility as that
    would give you another 65% more registers or so. You have the opcode
    space and it is Another nice boost for some customers.

    The article posted by Andy Glew was luke-warm at best. Now, while
    IBM has figured out that 16 GPRs is insufficient, there is scant
    data that 32 are insufficient {witness how few RISCs went with
    bigger files}

    Since My 66000 is a 64-bit architecture with a modicum of support for
    8-bit, 16-bit, and 32-bit stuff; and since 32-truely GPRs seems to be
    enough (compiler output), I think I will pass.

    Due to significant access to constants, My 66000 with only 32-actual
    registers performs as well as RISC-V does with 32I+32F in most codes,
    so there does not seem to be an insufficient number of registers. I
    even have ASM examples where RISC-V runs out of registers where My
    66000 does not !! Not wasting register to hold onto big immediates,
    big displacements, or big addresses goes a long way to thinning out
    the register count necessities.

    In My 66000 one can utilize all 32 registers, with 0 reserved for
    {linking, splicing, GOT access,...} these "effective constants"
    become actual constants meaning one does not have to consume a
    register to have access through that constant address value.

    IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

    Neither of which would worry me.


    Thanks,
    Brett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to All on Sat Aug 10 18:17:54 2024
    My 66000 should look into the z/architecture High Word Facility as that
    would give you another 65% more registers or so. You have the opcode space
    and it is Another nice boost for some customers.

    IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

    Thanks,
    Brett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Aug 10 21:12:12 2024
    On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:


    My 66000 should look into the z/architecture High Word Facility as that
    would give you another 65% more registers or so. You have the opcode
    space and it is Another nice boost for some customers.

    IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

    x86 (the x is lower case) solved the problem at great cost to various implementations. The AMD *doze family could not perform forwarding of
    these lower or upper portions of registers and its performance suffered.
    The high/low stuff makes it very difficult to do forwarding when the
    clock cycle is less than 14-gates per cycle.

    After x86 grew out of its 8-register only enclave and went with 16
    (later
    32) GPRs; register pressure went down markedly.

    Thanks,
    Brett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 11 00:46:09 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:


    My 66000 should look into the z/architecture High Word Facility as that
    would give you another 65% more registers or so. You have the opcode
    space and it is Another nice boost for some customers.

    The article posted by Andy Glew was luke-warm at best. Now, while
    IBM has figured out that 16 GPRs is insufficient, there is scant
    data that 32 are insufficient {witness how few RISCs went with
    bigger files}

    Since My 66000 is a 64-bit architecture with a modicum of support for
    8-bit, 16-bit, and 32-bit stuff; and since 32-truely GPRs seems to be
    enough (compiler output), I think I will pass.

    Due to significant access to constants, My 66000 with only 32-actual registers performs as well as RISC-V does with 32I+32F in most codes,
    so there does not seem to be an insufficient number of registers. I
    even have ASM examples where RISC-V runs out of registers where My
    66000 does not !! Not wasting register to hold onto big immediates,
    big displacements, or big addresses goes a long way to thinning out
    the register count necessities.

    In My 66000 one can utilize all 32 registers, with 0 reserved for
    {linking, splicing, GOT access,...} these "effective constants"
    become actual constants meaning one does not have to consume a
    register to have access through that constant address value.

    These are excellent points and need to be in your marketing information.

    Compilers love unrolling loops because it saves an instruction, which for a short loop could mean 10% faster. Point out your code has more unrolls and performance.

    I don’t know if you are in the 14 gate delay market that makes high
    registers a fail. Can’t find Andy Glew’s article on z/arch, but that arch has limited opcode space that imposes constraints, you don’t.

    High registers is mostly marketing vapor ware extension for you, see if
    anyone cares and put them on a list for when a market for that extension
    pops up.

    The lack of CPU’s with 64 registers is what makes for a market, that 4%
    that could benefit have no options to pick from. You would be happy to have control of a market that big. Point customers at a compiler configured for
    64 registers and say that with high registers and inline constants that is
    what they could expect for code generation.

    If there is demand for high registers you will probably just spin a CPU
    arch with more registers, but that will never happen if you never ask. This
    is the definition of vapor ware, a free market survey. You can even add
    more registers as an incompatible extension, if fact you should.

    IBM supports Linux, so the compiler support should exist. X86 solved the
    aliasing issue with finer tracking.

    Neither of which would worry me.


    Thanks,
    Brett


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Brett on Sun Aug 11 08:33:47 2024
    Brett <ggtgp@yahoo.com> schrieb:

    Compilers love unrolling loops because it saves an instruction, which for a short loop could mean 10% faster. Point out your code has more unrolls and performance.

    If you want to look at what the compiler for My 66000 does, it
    can be found at https://github.com/bagel99/llvm-my66000 .
    Installation is a bit cumbersome, but manageable.

    Speaking as somebody who neither designed the ISA nor written
    the compiler port: The Virtual Vector methods makes unrolling
    vectorized loops unprofitable; all you "gain" from unrolling those
    is increased register pressure and code size. Having constants
    in the instruction stream also reduces register pressure.

    In the beginning, I had my doubts that 32 general registers which
    are also used for floating point are enough, but looking at
    generated code convinced me.

    Unrolling in the presence of VVM is not that easy Non-vectorizable
    loops can still be profitable to unroll, as can be outer loops.
    But when working with an existing compiler which has assumptions
    about currently available architectures baked in, this is quite
    difficult.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Sun Aug 11 14:33:33 2024
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    The additional registers obviously did not give these architectures a
    decisive advantage.

    When ARM designed A64, when the RISC-V people designed RISC-V, and
    when Intel designed APX, each of them had the opportinity to go for 64
    GPRs, but they decided not to. Apparently the benefits do not
    outweigh the disadvantages.

    Where is your 4% number coming from?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Sun Aug 11 17:48:21 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4%
    that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    All antiques no longer available.

    The additional registers obviously did not give these architectures a decisive advantage.

    When ARM designed A64, when the RISC-V people designed RISC-V, and
    when Intel designed APX, each of them had the opportinity to go for 64
    GPRs, but they decided not to. Apparently the benefits do not
    outweigh the disadvantages.

    Where is your 4% number coming from?

    The 4% number is poor memory and a guess.
    Here is an antique paper on the issue:

    https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

    I used to be able to find better sources, but Google is full of garbage
    now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Anton Ertl on Sun Aug 11 20:53:42 2024
    On 2024-08-11 17:33, Anton Ertl wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4%
    that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    SPARC also has 32 separate floating-point registers, not windowed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to BGB on Mon Aug 12 02:23:00 2024
    BGB <cr88192@gmail.com> wrote:
    On 8/11/2024 9:33 AM, Anton Ertl wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    The additional registers obviously did not give these architectures a
    decisive advantage.

    When ARM designed A64, when the RISC-V people designed RISC-V, and
    when Intel designed APX, each of them had the opportinity to go for 64
    GPRs, but they decided not to. Apparently the benefits do not
    outweigh the disadvantages.


    In my experience:
    For most normal code, the advantage of 64 GPRs is minimal;
    But, there is some code, where it does have an advantage.
    Mostly involving big loops with lots of variables.


    Sometimes, it is preferable to be able to map functions entirely to registers, and 64 does increase the probability of being able to do so (though, neither achieves 100% of functions; and functions which map
    entirely to GPRs with 32 will not see an advantage with 64).

    Well, and to some extent the compiler needs to be selective about which functions it allows to use all of the registers, since in some cases a situation can come up where the saving/restoring more registers in the prolog/epilog can cost more than the associated register spills.


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that call,
    as it splits your function and burns registers that would otherwise get
    used.

    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.
    But today that is an unserviced market which will get customers to give you
    a look. Put out some vapor ware and see what customers say.


    But, have noted that 32 GPRs can get clogged up pretty quickly when
    using them for FP-SIMD and similar (if working with 128-bit vectors as register pairs); or otherwise when working with 128-bit data as pairs.

    Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but
    can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32
    matrix, and 16 registers to perform a matrix-transpose, ...

    Granted, arguably, doing a matrix-multiply directly in registers using
    SIMD ops is a bit niche (traditional option being to use scalar
    operations and fetch numbers from memory using "for()" loops, but this
    is slower). Most of the programs don't need fast MatMult though.



    Annoyingly, it has led to my ISA fragmenting into two variants:
    Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
    Supports R32..R63 for only a subset of the ISA for 32-bit ops.
    For ops outside this subset, needs 64-bit encodings in these cases.
    XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
    By itself, would be easier to decode than Baseline,
    as it drops a bunch of wonky edge cases.
    Though, some cases were dropped from Baseline when XG2 was added.
    "Op40x2" was dropped as it was hair and became mostly moot.

    Then, a common subset exists known as Fix32, which can be decoded in
    both Baseline and XG2 Mode, but only has access to R0..R31.


    Well, and a 3rd sub-variant:
    XG2RV: Uses XG2's encodings but RISC-V's register space.
    R0..R31 are X0..X31;
    R32..R63 are F0..F31.

    Arguable main use-case for XG2RV mode is for ASM blobs intended to be
    called natively from RISC-V mode; but...

    It is debatable whether such an operating mode actually makes sense, and
    it might have made more sense to simply fake it in the ASM parser:
    ADD R24, R25, R26 //Uses BJX2 register numbering.
    ADD X14, X15, X16 //Uses RISC-V register remapping.

    Likely, as a sub-mode of either Baseline or XG2 Mode.
    Since, the register remapping scheme is known as part of the ISA spec,
    it could be done in the assembler.

    It is possible that XG2RV mode may eventually be dropped due to "lack of relevance".


    Well, and similarly any ABI thunks would need to be done in Baseline or
    XG2 mode, since neither RV mode nor XG2RV Mode has access to all the registers used for argument passing in BJX2.
    In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5,
    being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.


    Well, and likewise one wants to limit the number of inter-ISA branches,
    as the branch-predictor can't predict these, and they need a full
    pipeline flush (a few extra cycles are needed to make sure the L1 I$ is fetching in the correct mode). Technically also the L1 I$ needs to flush
    any cache-lines which were fetched in a different mode (the I$ uses
    internal tag-bits to to figure out things like instruction length and bundling and to try to help with Superscalar in RV mode, *; mostly for timing/latency reasons, ...).


    *: The way the BJX2 core deals with superscalar being to essentially
    pretend as-if RV64 had WEX flag bits, which can be synthesized partly
    when fetching cache lines (putting some of the latency in the I$ Miss handling, rather than during instruction-fetch). In the ID stage, it
    sees the longer PC step and infers that two instructions are being
    decoded as superscalar.

    ...


    Where is your 4% number coming from?



    I guess it could make sense, arguably, to try to come up with test cases
    to try to get a quantitative measurement of the effect of 64 GPRs for programs which can make effective use of them...

    Would be kind of a pain to test as 64 GPR programs couldn't run on a
    kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in kernel-space (and is the main thing in my case that seems to benefit
    from 64 GPRs).

    But, technically, a 32 GPR kernel couldn't run RISC-V programs either.


    So, would likely need to switch GLQuake and similar over to baseline
    mode (and probably messing with "timedemo").




    Checking, as-is, timedemo results for "demo1" are "969 frames 150.5
    seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would
    be faster with RGB555 LDR), at 50 MHz.

    GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".

    But, yeah, both are with builds that use 64 GPRs.


    Software Quake: "969 frames 147.4 seconds 6.6 fps"
    Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"

    Not going to bother with GLQuake in RISC-V mode, would likely take a painfully long time.

    Well, decided to run this test anyways:
    "969 frames 687.3 seconds 1.4 fps"


    IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done
    to make it fast within the limits of RV64G). Though, this is with it
    running GL entirely in RV64 mode (it might fare better as a userland application where the GL backend is running in kernel space in BJX2 mode).

    Though, much of this is likely due more to RV64G's lack of SIMD and
    similar, rather than due to having fewer GPRs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Brett on Mon Aug 12 08:22:11 2024
    Brett wrote:
    BGB <cr88192@gmail.com> wrote:
    On 8/11/2024 9:33 AM, Anton Ertl wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4%
    that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    The additional registers obviously did not give these architectures a
    decisive advantage.

    When ARM designed A64, when the RISC-V people designed RISC-V, and
    when Intel designed APX, each of them had the opportinity to go for 64
    GPRs, but they decided not to. Apparently the benefits do not
    outweigh the disadvantages.


    In my experience:
    For most normal code, the advantage of 64 GPRs is minimal;
    But, there is some code, where it does have an advantage.
    Mostly involving big loops with lots of variables.


    Sometimes, it is preferable to be able to map functions entirely to
    registers, and 64 does increase the probability of being able to do so
    (though, neither achieves 100% of functions; and functions which map
    entirely to GPRs with 32 will not see an advantage with 64).

    Well, and to some extent the compiler needs to be selective about which
    functions it allows to use all of the registers, since in some cases a
    situation can come up where the saving/restoring more registers in the
    prolog/epilog can cost more than the associated register spills.


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that call, as it splits your function and burns registers that would otherwise get
    used.

    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.
    But today that is an unserviced market which will get customers to give you
    a look. Put out some vapor ware and see what customers say.

    The solution (?) have always looked obvious to me: Some form of huffmann encoding of register specifiers, so that the most common ones (bottom 16
    or 32) require just a small amount of space (as today), and then either
    a prefix or a suffix to provide extra bits when you want to use those
    higher register numbers. Mitch's CARRY sets up a single extra register
    for a set of operations, a WIDE prefix could contain two extra register
    bits for four registers over the next 2 or 3 instructions.

    As long as this doesn't make the decoder a speed limiter, it would be
    zero cost for regular code and still quite cheap except for increasing
    code size by 33-50% for the inner loops of algorithms that need 64 or
    even 128 regs.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Mon Aug 12 06:29:36 2024
    Brett <ggtgp@yahoo.com> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    All antiques no longer available.

    SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

    |Fujitsu will also discontinue their SPARC production [...] end-of-sale
    |in 2029, of UNIX servers and a year later for their mainframe.

    No word of when Oracle will discontinue (or has discontinued) sales,
    but both companies introduced their last SPARC CPUs in 2017.

    In any case, my point still stands: these architectures were
    available, and the large number of registers failed to give them a
    decisive advantage. Maybe it even gave them a decisive disadvantage:
    AMD29K and IA-64 never had OoO implementations, and SPARC got them
    only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
    2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
    Power and Alpha switched in 1998 (POWER3, 21264).

    Where is your 4% number coming from?

    The 4% number is poor memory and a guess.
    Here is an antique paper on the issue:

    https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

    Interesting. I only skimmed the paper, but I read a lot about
    inlining and interprocedural register allocation. SPARCs register
    windows and AMD29K's and IA-64's register stacks were intended to be
    useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
    that's despite register windows/stacks working even for indirect calls
    (e.g., method calls in the general case), where interprocedural
    register allocation or inlining don't help.

    It seems to me that with OoO the cycle cost of spilling and refilling
    on call boundaries was lowered: the spills can be delayed until the
    computation is complete, and the refills can start early because the
    stack pointer tends to be available early.

    And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
    even if the called function is short, the spilling and refilling
    around it (if any) does not increase the latency of the value that's
    spilled and refilled. But that consideration is only relevant for
    Intel APX, ARM A64 and RISC-V went for 32 registers several years
    before zero-cycle store-to-load-forwarding was implemented.

    One other optimization that they use the additional registers for is
    "register promotion", i.e., putting values from memory into registers
    for a while (if absence of aliasing can be proven). One interesting
    aspect here is that register promotion with 64 or 256 registers (RP-64
    and RP-256) is usually not much better (if better at all) than
    register promotion with 32 registers (RP-32); see Figure 1. So
    register promotion does not make a strong case for more registers,
    either, at least in this paper.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Aug 12 17:36:30 2024
    On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>> that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    All antiques no longer available.

    SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

    |Fujitsu will also discontinue their SPARC production [...] end-of-sale
    |in 2029, of UNIX servers and a year later for their mainframe.

    No word of when Oracle will discontinue (or has discontinued) sales,
    but both companies introduced their last SPARC CPUs in 2017.

    In any case, my point still stands: these architectures were
    available, and the large number of registers failed to give them a
    decisive advantage. Maybe it even gave them a decisive disadvantage:
    AMD29K and IA-64 never had OoO implementations, and SPARC got them
    only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
    2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
    Power and Alpha switched in 1998 (POWER3, 21264).

    Where is your 4% number coming from?

    The 4% number is poor memory and a guess.
    Here is an antique paper on the issue:

    https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

    Interesting. I only skimmed the paper, but I read a lot about
    inlining and interprocedural register allocation. SPARCs register
    windows and AMD29K's and IA-64's register stacks were intended to be
    useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
    that's despite register windows/stacks working even for indirect calls
    (e.g., method calls in the general case), where interprocedural
    register allocation or inlining don't help.

    It seems to me that with OoO the cycle cost of spilling and refilling
    on call boundaries was lowered: the spills can be delayed until the computation is complete, and the refills can start early because the
    stack pointer tends to be available early.

    And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
    even if the called function is short, the spilling and refilling
    around it (if any) does not increase the latency of the value that's
    spilled and refilled. But that consideration is only relevant for
    Intel APX, ARM A64 and RISC-V went for 32 registers several years
    before zero-cycle store-to-load-forwarding was implemented.

    One other optimization that they use the additional registers for is "register promotion", i.e., putting values from memory into registers
    for a while (if absence of aliasing can be proven). One interesting
    aspect here is that register promotion with 64 or 256 registers (RP-64
    and RP-256) is usually not much better (if better at all) than
    register promotion with 32 registers (RP-32); see Figure 1. So
    register promotion does not make a strong case for more registers,
    either, at least in this paper.

    With full access to constants, there is even less need to promote
    addresses or immediates into registers as you can simply poof them
    up anything you want one.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Aug 12 20:12:59 2024
    On Mon, 12 Aug 2024 19:27:22 +0000, BGB wrote:

    On 8/12/2024 12:36 PM, MitchAlsup1 wrote:
    On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market,
    that 4%
    that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    AMD29K: IIRC a 128-register stack and 64 additional registers

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register >>>>> files to make good use of them.

    All antiques no longer available.

    SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

    |Fujitsu will also discontinue their SPARC production [...] end-of-sale
    |in 2029, of UNIX servers and a year later for their mainframe.

    No word of when Oracle will discontinue (or has discontinued) sales,
    but both companies introduced their last SPARC CPUs in 2017.

    In any case, my point still stands: these architectures were
    available, and the large number of registers failed to give them a
    decisive advantage.  Maybe it even gave them a decisive disadvantage:
    AMD29K and IA-64 never had OoO implementations, and SPARC got them
    only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
    2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
    Power and Alpha switched in 1998 (POWER3, 21264).

    Where is your 4% number coming from?

    The 4% number is poor memory and a guess.
    Here is an antique paper on the issue:

    https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

    Interesting.  I only skimmed the paper, but I read a lot about
    inlining and interprocedural register allocation.  SPARCs register
    windows and AMD29K's and IA-64's register stacks were intended to be
    useful for that, but somehow the other architectures did not suffer a
    big-enough disadvantage to make them adopt one of these concepts, and
    that's despite register windows/stacks working even for indirect calls
    (e.g., method calls in the general case), where interprocedural
    register allocation or inlining don't help.

    It seems to me that with OoO the cycle cost of spilling and refilling
    on call boundaries was lowered: the spills can be delayed until the
    computation is complete, and the refills can start early because the
    stack pointer tends to be available early.

    And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
    even if the called function is short, the spilling and refilling
    around it (if any) does not increase the latency of the value that's
    spilled and refilled.  But that consideration is only relevant for
    Intel APX, ARM A64 and RISC-V went for 32 registers several years
    before zero-cycle store-to-load-forwarding was implemented.

    One other optimization that they use the additional registers for is
    "register promotion", i.e., putting values from memory into registers
    for a while (if absence of aliasing can be proven).  One interesting
    aspect here is that register promotion with 64 or 256 registers (RP-64
    and RP-256) is usually not much better (if better at all) than
    register promotion with 32 registers (RP-32); see Figure 1.  So
    register promotion does not make a strong case for more registers,
    either, at least in this paper.

    With full access to constants, there is even less need to promote
    addresses or immediates into registers as you can simply poof them
    up anything you want one.


    There are tradeoffs still, if constants need space to encode...

    Inline is still better than a memory load, granted.

    May make sense to consolidate multiple uses of a value into a register
    rather than try encoding them as an immediate each time.

    See polpak:: r8_erf()


    r8_erf: ; @r8_erf
    ; %bb.0:
    fabs r2,r1
    fcmp r3,r2,#0x3EF00000
    bngt r3,.LBB141_5
    ; %bb.1:
    fcmp r3,r2,#4
    bngt r3,.LBB141_6
    ; %bb.2:
    fcmp r3,r2,#0x403A8B020C49BA5E
    bnlt r3,.LBB141_7
    ; %bb.3:
    fmul r3,r1,r1
    fdiv r3,#1,r3
    mov r4,#0x3F90B4FB18B485C7
    fmac r4,r3,r4,#0x3FD38A78B9F065F6
    fadd r5,r3,#0x40048C54508800DB
    fmac r4,r3,r4,#0x3FD70FE40E2425B8
    fmac r5,r3,r5,#0x3FFDF79D6855F0AD
    fmac r4,r3,r4,#0x3FC0199D980A842F
    fmac r5,r3,r5,#0x3FE0E4993E122C39
    fmac r4,r3,r4,#0x3F9078448CD6C5B5
    fmac r5,r3,r5,#0x3FAEFC42917D7DE7
    fmac r4,r3,r4,#0x3F4595FD0D71E33C
    fmul r4,r3,r4
    fmac r3,r3,r5,#0x3F632147A014BAD1
    fdiv r3,r4,r3
    fadd r3,#0x3FE20DD750429B6D,-r3
    fdiv r3,r3,r2
    br .LBB141_4
    LBB141_5:
    fmul r3,r1,r1
    fcmp r2,r2,#0x3C9FFE5AB7E8AD5E
    sra r2,r2,#8,#1
    cvtsd r4,#0
    mux r2,r2,r3,r4
    mov r3,#0x3FC7C7905A31C322
    fmac r3,r2,r3,#0x400949FB3ED443E9
    fadd r4,r2,#0x403799EE342FB2DE
    fmac r3,r2,r3,#0x405C774E4D365DA3
    fmac r4,r2,r4,#0x406E80C9D57E55B8
    fmac r3,r2,r3,#0x407797C38897528B
    fmac r4,r2,r4,#0x40940A77529CADC8
    fmac r3,r2,r3,#0x40A912C1535D121A
    fmul r1,r3,r1
    fmac r2,r2,r4,#0x40A63879423B87AD
    fdiv r2,r1,r2
    mov r1,r2
    ret
    LBB141_6:
    mov r3,#0x3E571E703C5F5815
    fmac r3,r2,r3,#0x3FE20DD508EB103E
    fadd r4,r2,#0x402F7D66F486DED5
    fmac r3,r2,r3,#0x4021C42C35B8BC02
    fmac r4,r2,r4,#0x405D6C69B0FFCDE7
    fmac r3,r2,r3,#0x405087A0D1C420D0
    fmac r4,r2,r4,#0x4080C972E588749E
    fmac r3,r2,r3,#0x4072AA2986ABA462
    fmac r4,r2,r4,#0x4099558EECA29D27
    fmac r3,r2,r3,#0x408B8F9E262B9FA3
    fmac r4,r2,r4,#0x40A9B599356D1202
    fmac r3,r2,r3,#0x409AC030C15DC8D7
    fmac r4,r2,r4,#0x40B10A9E7CB10E86
    fmac r3,r2,r3,#0x40A0062821236F6B
    fmac r4,r2,r4,#0x40AADEBC3FC90DBD
    fmac r3,r2,r3,#0x4093395B7FD2FC8E
    fmac r4,r2,r4,#0x4093395B7FD35F61
    fdiv r3,r3,r4
    LBB141_4:
    fmul r4,r2,#16
    fmul r4,r4,#0x3D800000
    rnd r4,r4,#5
    fadd r5,r2,-r4
    fadd r2,r2,r4
    fmul r4,r4,-r4
    fexp r4,r4
    fmul r2,r2,-r5
    fexp r2,r2
    fmul r2,r4,r2
    fadd r2,#0,-r2
    fmac r2,r2,r3,#0x3F000000
    fadd r2,r2,#0x3F000000
    pdlt r1,T
    fadd r2,#0,-r2
    mov r1,r2
    ret
    LBB141_7:
    fcmp r1,r1,#0
    sra r1,r1,#8,#1
    cvtsd r2,#-1
    cvtsd r3,#1
    mux r2,r1,r3,r2
    mov r1,r2
    ret

    All of the constants are use once !

    RISC-V takes 240 instructions and uses 342 words of
    memory {.text, .data, .rodata}

    My 66000 takes 85 instructions and uses 169 words of
    memory {.text, .data, .rodata}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Aug 12 22:35:14 2024
    On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

    On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

    See polpak:: r8_erf()


    r8_erf:                                 ; @r8_erf
    ; %bb.0:
        fabs    r2,r1
        fcmp    r3,r2,#0x3EF00000
        bngt    r3,.LBB141_5
    ; %bb.1:
        fcmp    r3,r2,#4
        bngt    r3,.LBB141_6
    ; %bb.2:
        fcmp    r3,r2,#0x403A8B020C49BA5E
        bnlt    r3,.LBB141_7
    ; %bb.3:
        fmul    r3,r1,r1
        fdiv    r3,#1,r3
        mov    r4,#0x3F90B4FB18B485C7
        fmac    r4,r3,r4,#0x3FD38A78B9F065F6
        fadd    r5,r3,#0x40048C54508800DB
        fmac    r4,r3,r4,#0x3FD70FE40E2425B8
        fmac    r5,r3,r5,#0x3FFDF79D6855F0AD
        fmac    r4,r3,r4,#0x3FC0199D980A842F
        fmac    r5,r3,r5,#0x3FE0E4993E122C39
        fmac    r4,r3,r4,#0x3F9078448CD6C5B5
        fmac    r5,r3,r5,#0x3FAEFC42917D7DE7
        fmac    r4,r3,r4,#0x3F4595FD0D71E33C
        fmul    r4,r3,r4
        fmac    r3,r3,r5,#0x3F632147A014BAD1
        fdiv    r3,r4,r3
        fadd    r3,#0x3FE20DD750429B6D,-r3
        fdiv    r3,r3,r2
        br    .LBB141_4
    LBB141_5:
        fmul    r3,r1,r1
        fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E
        sra    r2,r2,#8,#1
        cvtsd    r4,#0
        mux    r2,r2,r3,r4
        mov    r3,#0x3FC7C7905A31C322
        fmac    r3,r2,r3,#0x400949FB3ED443E9
        fadd    r4,r2,#0x403799EE342FB2DE
        fmac    r3,r2,r3,#0x405C774E4D365DA3
        fmac    r4,r2,r4,#0x406E80C9D57E55B8
        fmac    r3,r2,r3,#0x407797C38897528B
        fmac    r4,r2,r4,#0x40940A77529CADC8
        fmac    r3,r2,r3,#0x40A912C1535D121A
        fmul    r1,r3,r1
        fmac    r2,r2,r4,#0x40A63879423B87AD
        fdiv    r2,r1,r2
        mov    r1,r2
        ret
    LBB141_6:
        mov    r3,#0x3E571E703C5F5815
        fmac    r3,r2,r3,#0x3FE20DD508EB103E
        fadd    r4,r2,#0x402F7D66F486DED5
        fmac    r3,r2,r3,#0x4021C42C35B8BC02
        fmac    r4,r2,r4,#0x405D6C69B0FFCDE7
        fmac    r3,r2,r3,#0x405087A0D1C420D0
        fmac    r4,r2,r4,#0x4080C972E588749E
        fmac    r3,r2,r3,#0x4072AA2986ABA462
        fmac    r4,r2,r4,#0x4099558EECA29D27
        fmac    r3,r2,r3,#0x408B8F9E262B9FA3
        fmac    r4,r2,r4,#0x40A9B599356D1202
        fmac    r3,r2,r3,#0x409AC030C15DC8D7
        fmac    r4,r2,r4,#0x40B10A9E7CB10E86
        fmac    r3,r2,r3,#0x40A0062821236F6B
        fmac    r4,r2,r4,#0x40AADEBC3FC90DBD
        fmac    r3,r2,r3,#0x4093395B7FD2FC8E
        fmac    r4,r2,r4,#0x4093395B7FD35F61
        fdiv    r3,r3,r4
    LBB141_4:
        fmul    r4,r2,#16
        fmul    r4,r4,#0x3D800000
        rnd    r4,r4,#5
        fadd    r5,r2,-r4
        fadd    r2,r2,r4
        fmul    r4,r4,-r4
        fexp    r4,r4
        fmul    r2,r2,-r5
        fexp    r2,r2
        fmul    r2,r4,r2
        fadd    r2,#0,-r2
        fmac    r2,r2,r3,#0x3F000000
        fadd    r2,r2,#0x3F000000
        pdlt    r1,T
        fadd    r2,#0,-r2
        mov    r1,r2
        ret
    LBB141_7:
        fcmp    r1,r1,#0
        sra    r1,r1,#8,#1
        cvtsd    r2,#-1
        cvtsd    r3,#1
        mux    r2,r1,r3,r2
        mov    r1,r2
        ret

    All of the constants are use once !

    RISC-V takes 240 instructions and uses 342 words of
    memory {.text, .data, .rodata}

    My 66000 takes 85 instructions and uses 169 words of
    memory {.text, .data, .rodata}


    FWIW:
    FADD Rm, Imm64f, Rn //XG2 Only
    FADD Rm, Imm56f, Rn //

    And:
    FMUL Rm, Imm64f, Rn //XG2 Only
    FMUL Rm, Imm56f, Rn //


    Why don't yuo download polpack, compile it, and state how many
    instructions it takes and how many words of storage it takes ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Aug 13 01:23:12 2024
    On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:

    On 8/12/2024 5:35 PM, MitchAlsup1 wrote:
    On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

    On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

    See polpak:: r8_erf()


    r8_erf:                                 ; @r8_erf
    <snip>

    Why don't yuo download polpack, compile it, and state how many
    instructions it takes and how many words of storage it takes ??

    Found what I assume you are talking about.

    Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes everything;
    Also needed to hack over some compiler holes related to "complex
    _Double" to get it to build;
    Also needed to stub over some library functions that were added in C99
    but missing in my C library.

    I only ask for r8_erf()

    <snip>

    As for "r8_erf()":

    <===

    r8_erf:
    <snip>

    I count 283 instructions compared to my 85 including the 104
    instructions
    it takes your compiler to get to the 1st instruction in My 66000 code !!

    In the middle I see much the same problems as RISC-V has:: while you
    have
    the ability to poof constants, you can't use them without wasting
    registers
    and in general an inefficient FP instruction set {no FMAC, no sign
    control
    on operands, no transcendental instructions, and at least you are not as
    poor of FP compare-branches as RISC-V}

    It is true that LLVM can unroll loops and when the loop is consuming
    only
    constants Brian's compiler just emits the polynomial directly with nary
    a LD or ST, just constants as operands, whereas your compiler poofs
    constants into existence rather than forwarding them directly into
    execution. Every poof cost you an instruction, mine just cost
    instruction
    space not pipeline delay.

    I think this demonstrates my point perfectly--universal constants inside
    a RISC instruction set is a BIG WIN.

    It also illustrates the fact that a RIOSC ISA needs a good compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Aug 13 17:24:30 2024
    On Tue, 13 Aug 2024 3:50:04 +0000, BGB wrote:

    On 8/12/2024 8:23 PM, MitchAlsup1 wrote:
    On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:

    On 8/12/2024 5:35 PM, MitchAlsup1 wrote:
    On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

    On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

    See polpak:: r8_erf()


    r8_erf:                                 ; @r8_erf
    <snip>

    Why don't yuo download polpack, compile it, and state how many
    instructions it takes and how many words of storage it takes ??

    Found what I assume you are talking about.

    Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes >>> everything;
    Also needed to hack over some compiler holes related to "complex
    _Double" to get it to build;
    Also needed to stub over some library functions that were added in C99
    but missing in my C library.

    I only ask for r8_erf()

    <snip>

    As for "r8_erf()":

    <===

    r8_erf:
    <snip>

    I count 283 instructions compared to my 85 including the 104
    instructions
    it takes your compiler to get to the 1st instruction in My 66000 code !!


    Yeah, this is a compiler issue...

    Why not sit down and code it in ASM to see what your ISA can really do?
    Feel free to use My 66000 code as an example.

    It might have been less if the code was like:
    static const double somearr[8]={ ... };

    But, this would still have used memory loads.
    Getting the constants into expressions would likely require using
    #define or similar...

    This is admittedly more how I would have imagined performance-oriented
    code to be written. Not so much with dynamically initialized arrays.

    That particular piece of code was originally written in FORTRAN
    probably late 1960s or early 1970s then ported to C a while back.

    <snip>

    But, as I will note, even with this general level of lackluster code generation, have still been managing to often beat RV64G performance...

    Anybody claiming RISC-V has a good ISA should have their degree revoked.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Aug 13 19:21:04 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Anybody claiming RISC-V has a good ISA should have their degree revoked.

    Interesting datapoint: In the GhostWrite paper, they say that 84.03%
    of the RISC-V instruction space are taken up.

    I could probably gather the same statistic for My 66000...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Aug 13 20:41:10 2024
    On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Anybody claiming RISC-V has a good ISA should have their degree revoked.

    Interesting datapoint: In the GhostWrite paper, they say that 84.03%
    of the RISC-V instruction space are taken up.

    I could probably gather the same statistic for My 66000...

    Most of my groups have a bit under ½ of their space left.

    Major:: 22 of 64 left
    Mem:::: 32 of 64 left
    2-OP::: 33 of 64 left
    3-OP::: 4 of 8 left
    1-OP::: 56 of 64 left
    misc::: 9 of 16 left

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Aug 14 13:15:07 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Anybody claiming RISC-V has a good ISA should have their degree revoked.

    Interesting datapoint: In the GhostWrite paper, they say that 84.03%
    of the RISC-V instruction space are taken up.

    I could probably gather the same statistic for My 66000...

    Most of my groups have a bit under ½ of their space left.

    Major:: 22 of 64 left
    Mem:::: 32 of 64 left
    2-OP::: 33 of 64 left
    3-OP::: 4 of 8 left
    1-OP::: 56 of 64 left
    misc::: 9 of 16 left

    Yep, but there are also gaps in there.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Aug 14 16:59:58 2024
    On Wed, 14 Aug 2024 13:15:07 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Anybody claiming RISC-V has a good ISA should have their degree revoked. >>>
    Interesting datapoint: In the GhostWrite paper, they say that 84.03%
    of the RISC-V instruction space are taken up.

    I could probably gather the same statistic for My 66000...

    Most of my groups have a bit under ½ of their space left.

    Major:: 22 of 64 left

    I forgot to mention that 6 of the taken are permanently reserved
    to prevent jumping into code and having anything execute--inde-
    pendent of whether --E is permitted. So only 36 of 64 are in use.

    Mem:::: 32 of 64 left
    2-OP::: 33 of 64 left
    3-OP::: 4 of 8 left
    1-OP::: 56 of 64 left
    misc::: 9 of 16 left

    Yep, but there are also gaps in there.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 14 22:06:46 2024
    On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >>that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    SPARCs FPGA through UltraSPARC used 1 full cycle to access the windowed register file will MIPS, 88K, and early Alphas used 1/2 cycle. So SPARC architecture saddled them with an inherent disadvantage....

    AMD29K: IIRC a 128-register stack and 64 additional registers

    Similar issues.

    IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
    files to make good use of them.

    Don't know for certain, but I would expect the same as above.

    The additional registers obviously did not give these architectures a decisive advantage.

    Captain Obvious strikes again

    Oh, and BTW, that 1/2 cycle of delay getting started should have cost
    ~5% IPC. But SAPRC never achieved high clock frequencies nor dis IA-64.

    When ARM designed A64, when the RISC-V people designed RISC-V, and
    when Intel designed APX, each of them had the opportinity to go for 64
    GPRs, but they decided not to. Apparently the benefits do not
    outweigh the disadvantages.

    Where is your 4% number coming from?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 14 22:26:27 2024
    On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    Where is your 4% number coming from?

    The 4% number is poor memory and a guess.
    Here is an antique paper on the issue:

    https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

    Interesting. I only skimmed the paper, but I read a lot about
    inlining and interprocedural register allocation. SPARCs register
    windows and AMD29K's and IA-64's register stacks were intended to be
    useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
    that's despite register windows/stacks working even for indirect calls
    (e.g., method calls in the general case), where interprocedural
    register allocation or inlining don't help.

    The problem of register-windows is when "you miss the cache",
    first you have to take the exception,
    then you have to blindly push an IN or pull and OUT with no knowlege
    of how many registers are in use (or several of them)
    then you have to return from the exception.

    So, you have two exception control transfers, and a blind copy of
    fixed sized data, loss of a few TLB entries, and loss of a few
    cache lines of data+instructions.

    Whereas MIPS, 88k, Alpha, RISC-V always "hit in the cache" so to
    speak.

    There was an old paper that stated MIPS team had optimizing compiler up
    and optimizing, while SPARC team bet on HW to compensate for their lack. History has chose the non-SPARC path.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Aug 14 22:19:32 2024
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that
    call,
    as it splits your function and burns registers that would otherwise get
    used.

    What I see around calls is MOV instructions grabbing arguments from the preserved registers and putting return values in to the proper preserved register. Inlining does get rid of these MOVs, but what else ??

    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.

    I am on record as stating the proper number of bits in an instruction- specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
    Making the registers 6-bits would increase that count to 36-bits.

    34-bits comes from having enough Entropy to encode what needs encoding
    and making careful data-driven choices on "what to put in and what to
    leave out" and finding a clever means to access vectorization and multi- precision calculations. Without both of those 36-would likely be the
    best option for the 32-register variants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Aug 14 22:43:07 2024
    On Wed, 14 Aug 2024 10:15:58 +0000, BGB wrote:

    On 8/13/2024 12:24 PM, MitchAlsup1 wrote:

    Assuming I use all of the ISA features that currently exist:

    r8_erf: ; @r8_erf
    MOV R4, R1
    FABS R1,R2
    FCMPGT 0x3780, R2 //Half
    BF .LBB141_5

    FCMPGT 0x4400, R2 //Half
    BF .LBB141_6

    FCMPGE 0x403A8B020C49BA5E, R2
    BT .LBB141_7

    FMUL R1, R1, R3
    FLDCH 0x3C00, R2
    FDIV R2, R3, R3
    MOV 0x3F90B4FB18B485C7, R4
    MOV 0x3FD38A78B9F065F6, R16
    FMAC R3, R16, R4, R4
    FADD R3, 0x40048C54508800DB, R5

    MOV 0x3FD70FE40E2425B8, R16
    FMAC R3, R16, R4, R4

    MOV 0x3FFDF79D6855F0AD, R16
    FMAC R3, R16, R5, R5

    MOV 0x3FC0199D980A842F, R16
    FMAC R3, R16, R4, R4
    MOV 0x3FE0E4993E122C39, R16
    FMAC R3, R16, R5, R5
    MOV 0x3F9078448CD6C5B5, R16
    FMAC R3, R16, R4, R4
    MOV 0x3FAEFC42917D7DE7, R16
    FMAC R3, R16, R5, R5
    MOV 0x3F4595FD0D71E33C, R16
    FMAC R3, R16, R4, R4

    FMUL R4,R3,R4
    MOV 0x3F632147A014BAD1, R16
    FMAC R5, R3, R16, R3
    FDIV R4, R3, R3
    FNEG R3, R3
    FADD R3, 0x3FE20DD750429B6D, R3
    FDIV R3, R2, R3
    BRA .LBB141_4
    LBB141_5:
    FMUL R1, R1, R3
    MOV 0, R4
    FCMPGT 0x3C9FFE5AB7E8AD5E, R2
    CSELT R3, R4, R2
    MOV 0x3FC7C7905A31C322, R3

    MOV 0x400949FB3ED443E9, R16
    fmac R2, R16, R3, R3
    FADD R2,#0x403799EE342FB2DE, R4

    MOV 0x405C774E4D365DA3, R16
    RMAC R2, R16, R3, R3
    MOV 0x406E80C9D57E55B8, R16
    FMAC R2, R16, R4, R4

    MOV 0x407797C38897528B, R16
    FMAC R2, R16, R3, R3
    MOV 0x40940A77529CADC8, R16
    FMAC R2, R16, R4, R4
    MOV 0x40A912C1535D121A, R16
    FMAC R2, R16, R3, R3

    FMUL R3, R1, R1
    MOV 0x40A63879423B87AD, R16
    FMAC R2, R16, R4, R2
    FDIV R1, R2, R2
    RTS

    LBB141_6:
    MOV 0x3E571E703C5F5815, R3
    fmac r3,r2,r3,#0x3FE20DD508EB103E
    fadd r4,r2,#0x402F7D66F486DED5
    fmac r3,r2,r3,#0x4021C42C35B8BC02
    fmac r4,r2,r4,#0x405D6C69B0FFCDE7
    fmac r3,r2,r3,#0x405087A0D1C420D0
    fmac r4,r2,r4,#0x4080C972E588749E
    fmac r3,r2,r3,#0x4072AA2986ABA462
    fmac r4,r2,r4,#0x4099558EECA29D27
    fmac r3,r2,r3,#0x408B8F9E262B9FA3
    fmac r4,r2,r4,#0x40A9B599356D1202
    fmac r3,r2,r3,#0x409AC030C15DC8D7
    fmac r4,r2,r4,#0x40B10A9E7CB10E86
    fmac r3,r2,r3,#0x40A0062821236F6B
    fmac r4,r2,r4,#0x40AADEBC3FC90DBD
    fmac r3,r2,r3,#0x4093395B7FD2FC8E
    fmac r4,r2,r4,#0x4093395B7FD35F61
    fdiv r3,r3,r4
    LBB141_4:
    FMUL R2, 0x40300000, R4
    FMUL R4, 0x3FB00000, R4
    FSTCI R4, R4
    FLDCI R4, R4
    FNEG R4, R6
    fadd R2, R6, R5
    fadd R2, R4, R2
    fmul R4, R6, R4
    fexp r4,r4 //?

    fmul R2,R7, R2
    fexp r2,r2
    fmul R4, R2, R2
    FNEG R2, R2
    fmac r2,r2,r3,#0x3F000000
    fadd r2,r2,#0x3F000000
    pdlt r1,T //?
    fadd r2,#0,-r2
    RTS
    LBB141_7:
    FLDCH 0xBC00, R2
    FLDCH 0x3C00, R3
    FCMPGT 0, R1
    CSELT R2,R3,R2
    RTS

    Not bad: I count 101 instructions and 183 words of memory.
    {{I checked nothing}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Thu Aug 15 00:36:57 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that
    call,
    as it splits your function and burns registers that would otherwise get
    used.

    What I see around calls is MOV instructions grabbing arguments from the preserved registers and putting return values in to the proper preserved register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% that matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.

    I am on record as stating the proper number of bits in an instruction- specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) Making the registers 6-bits would increase that count to 36-bits.

    My 6600 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 6600 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    All the customer will see is more registers, more performance, on top of
    all your other substantial improvements.

    34-bits comes from having enough Entropy to encode what needs encoding
    and making careful data-driven choices on "what to put in and what to
    leave out" and finding a clever means to access vectorization and multi- precision calculations. Without both of those 36-would likely be the
    best option for the 32-register variants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Thu Aug 15 00:54:15 2024
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that
    call,
    as it splits your function and burns registers that would otherwise get
    used.

    What I see around calls is MOV instructions grabbing arguments from the
    preserved registers and putting return values in to the proper preserved
    register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% that matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.

    I am on record as stating the proper number of bits in an instruction-
    specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
    Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    All the customer will see is more registers, more performance, on top of
    all your other substantial improvements.

    34-bits comes from having enough Entropy to encode what needs encoding
    and making careful data-driven choices on "what to put in and what to
    leave out" and finding a clever means to access vectorization and multi-
    precision calculations. Without both of those 36-would likely be the
    best option for the 32-register variants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Brett on Wed Aug 14 22:21:28 2024
    On 8/14/2024 5:54 PM, Brett wrote:
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that
    call,
    as it splits your function and burns registers that would otherwise get >>>> used.

    What I see around calls is MOV instructions grabbing arguments from the
    preserved registers and putting return values in to the proper preserved >>> register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% that >> matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which
    removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it >>>> burns up your opcode space and makes encoding everything more difficult. >>>
    I am on record as stating the proper number of bits in an instruction-
    specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
    Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    Also longer context switch times, as more registers to save/restore.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 15 08:45:30 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    SPARCs FPGA through UltraSPARC used 1 full cycle to access the windowed >register file will MIPS, 88K, and early Alphas used 1/2 cycle.

    Maybe. Obviously did not prevent them from having ALU instructions
    with one-cycle latence and loads with 2-cycles latency in the early implementations, just like MIPS R2000. And the clock rate of the
    SPARC MB86900 (14.28MHz) is not worse than the clock rate of the MIPS
    R2000 (8.3, 12.5, and 15MHz grades), and that despite having the
    interlocks that MIPS were so proud of not having.

    Oh, and BTW, that 1/2 cycle of delay getting started should have cost
    ~5% IPC. But SAPRC never achieved high clock frequencies nor dis IA-64.

    As mentioned above, the clock rate was competetive with the early
    MIPS. If we look at more recent times, the in-order UltraSPARC IV+
    (90nm) achieved 2100MHz in 2007; Intel sold 3GHz 65nm Core 2 Duo E6850
    at the time, so the UltraSPARC IV+ was not that far off. This
    undermines my theory that in-order designs have problems achieving
    high clock rates.

    Going for OoO implementations, the Fujitsu SPARC64 V+ (90nm) was
    shipped in 2004 with 1.89MHz and in 2006 with 2.16MHz. AMD shipped
    the 2.2GHz Athlon 64 3500+ (90nm) in 2004 and a 2.4GHz 90nm version in
    2006, so the SPARC64 V+ was not far off.

    Fujitsu continued their line until the 4.25GHz SPARC64 XII in 2017.
    For comparison: AMD released the Ryzen 1800X in 2017 and that
    supposedly can turbo up to 4GHz (but when I just measured it (with 1
    core loaded), it achied <3.7GHz). Intel sold the Core i7-8700K
    starting on Oct 5, 2017, which achieved 4.7GHz.

    Oracle released the 5000MHz SPARC M8 in 2017.

    Maybe SAPCR (sic!) did not achieve high clock rates, but SPARC did.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Aug 15 17:05:48 2024
    On Thu, 15 Aug 2024 08:45:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

    Brett <ggtgp@yahoo.com> writes:
    The lack of CPU’s with 64 registers is what makes for a market,
    that 4% that could benefit have no options to pick from.

    They had:

    SPARC: Ok, only 32 GPRs available at a time, but more in hardware
    through the Window mechanism.

    SPARCs FPGA through UltraSPARC used 1 full cycle to access the
    windowed register file will MIPS, 88K, and early Alphas used 1/2
    cycle.

    Maybe. Obviously did not prevent them from having ALU instructions
    with one-cycle latence and loads with 2-cycles latency in the early implementations, just like MIPS R2000. And the clock rate of the
    SPARC MB86900 (14.28MHz) is not worse than the clock rate of the MIPS
    R2000 (8.3, 12.5, and 15MHz grades), and that despite having the
    interlocks that MIPS were so proud of not having.

    Oh, and BTW, that 1/2 cycle of delay getting started should have cost
    ~5% IPC. But SAPRC never achieved high clock frequencies nor dis
    IA-64.

    As mentioned above, the clock rate was competetive with the early
    MIPS. If we look at more recent times, the in-order UltraSPARC IV+
    (90nm) achieved 2100MHz in 2007; Intel sold 3GHz 65nm Core 2 Duo E6850
    at the time, so the UltraSPARC IV+ was not that far off.

    Even more so 12 years earlier:
    ULtraSparc - 200 MHz
    PPro - 200 MHz
    R10K - 195 MHz
    PA-RISC 8000 - 180 MHz, but few months later and much pricier

    This
    undermines my theory that in-order designs have problems achieving
    high clock rates.


    POWER6 (same year) is much heavier blow to your theory.


    Going for OoO implementations, the Fujitsu SPARC64 V+ (90nm) was
    shipped in 2004 with 1.89MHz and in 2006 with 2.16MHz. AMD shipped
    the 2.2GHz Athlon 64 3500+ (90nm) in 2004 and a 2.4GHz 90nm version in
    2006, so the SPARC64 V+ was not far off.

    Fujitsu continued their line until the 4.25GHz SPARC64 XII in 2017.
    For comparison: AMD released the Ryzen 1800X in 2017 and that
    supposedly can turbo up to 4GHz (but when I just measured it (with 1
    core loaded), it achied <3.7GHz). Intel sold the Core i7-8700K
    starting on Oct 5, 2017, which achieved 4.7GHz.

    Oracle released the 5000MHz SPARC M8 in 2017.

    Maybe SAPCR (sic!) did not achieve high clock rates, but SPARC did.

    - anton

    Was not Mitch himself involved in design of hyperSPARC that eventually
    reached very respectable clock frequency?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Michael S on Thu Aug 15 10:14:21 2024
    On 8/15/2024 9:33 AM, Michael S wrote:
    On Wed, 14 Aug 2024 22:06:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    SPARCs FPGA <snip>

    F - Fujitsu (?)
    P - ???
    G - gate
    A - array

    Half right. Field Programmable Gate Array. I.E. a "gate array" that
    can be programmed in the field, as opposed to the factory.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Thu Aug 15 19:33:05 2024
    On Wed, 14 Aug 2024 22:06:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    SPARCs FPGA <snip>

    F - Fujitsu (?)
    P - ???
    G - gate
    A - array

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Aug 15 18:04:08 2024
    On Thu, 15 Aug 2024 14:05:48 +0000, Michael S wrote:

    On Thu, 15 Aug 2024 08:45:30 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    Was not Mitch himself involved in design of hyperSPARC that eventually reached very respectable clock frequency?

    We got HyperSPARC up to 200 MHz and had a 250 MHz version in debug.

    This was 20%-25% slower that the competition on the "average" SPARC
    workload, but for some reason the "wall street traders" bought ship-
    loads of them as they were somewhat faster than SuperSPARC or
    UltraSPARC on that kind of workload--where milliseconds faster
    means millions of dollars.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Thu Aug 15 23:54:42 2024
    On Thu, 15 Aug 2024 10:14:21 -0700
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 8/15/2024 9:33 AM, Michael S wrote:
    On Wed, 14 Aug 2024 22:06:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    SPARCs FPGA <snip>

    F - Fujitsu (?)
    P - ???
    G - gate
    A - array

    Half right. Field Programmable Gate Array. I.E. a "gate array" that
    can be programmed in the field, as opposed to the factory.




    Don't you think that if I am asking then I have reasons to think that
    Mitch didn't mean "Field Programmable" ?

    BTW, logic (HDL) design of FPGA-based embedded systems is part of what
    I am doing for living during last 25 years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Aug 15 21:10:54 2024
    On Thu, 15 Aug 2024 20:54:42 +0000, Michael S wrote:

    On Thu, 15 Aug 2024 10:14:21 -0700
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 8/15/2024 9:33 AM, Michael S wrote:
    On Wed, 14 Aug 2024 22:06:46 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    SPARCs FPGA <snip>

    F - Fujitsu (?)
    P - ???
    G - gate
    A - array

    Half right. Field Programmable Gate Array. I.E. a "gate array" that
    can be programmed in the field, as opposed to the factory.




    Don't you think that if I am asking then I have reasons to think that
    Mitch didn't mean "Field Programmable" ?

    I could have been misremembering the ASIC SPARC instead of the FPGA
    SPARC.

    BTW, logic (HDL) design of FPGA-based embedded systems is part of what
    I am doing for living during last 25 years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Stephen Fuld on Fri Aug 16 04:30:54 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 8/14/2024 5:54 PM, Brett wrote:
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that >>>>> call,
    as it splits your function and burns registers that would otherwise get >>>>> used.

    What I see around calls is MOV instructions grabbing arguments from the >>>> preserved registers and putting return values in to the proper preserved >>>> register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% that >>> matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which >>> removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it >>>>> burns up your opcode space and makes encoding everything more difficult. >>>>
    I am on record as stating the proper number of bits in an instruction- >>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>> Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    Actually due to the removal of MOVs and reloads the code density may be basically the same.

    Also longer context switch times, as more registers to save/restore.

    The save is should be free, as the load from ram is so slow.
    If the context is time critical it should be written to use the registers
    that are reloaded first, first. In which case the code could start doing
    work in the same amount of time regardless of register count. (I doubt the
    CPU design is actually that smart, or that the people that program the interrupts are.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Fri Aug 16 18:33:36 2024
    Brett <ggtgp@yahoo.com> wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 8/14/2024 5:54 PM, Brett wrote:
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that >>>>>> call,
    as it splits your function and burns registers that would otherwise get >>>>>> used.

    What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% that
    matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>
    I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    Actually due to the removal of MOVs and reloads the code density may be basically the same.

    Also longer context switch times, as more registers to save/restore.

    The save is should be free, as the load from ram is so slow.
    If the context is time critical it should be written to use the registers that are reloaded first, first. In which case the code could start doing
    work in the same amount of time regardless of register count. (I doubt the CPU design is actually that smart, or that the people that program the interrupts are.)

    When I wrote that I was thinking of visible registers, rename messes that
    up…

    But an interrupt does not need a full register set state to start up, so my comment is valid after all.

    One might need to change how one writes interrupt code, have not done that much, and it was 20 years ago.

    Brett

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Fri Aug 16 18:50:05 2024
    On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 8/14/2024 5:54 PM, Brett wrote:
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls.

    A call can cause a significant amount of garbage code all around that >>>>>> call,
    as it splits your function and burns registers that would otherwise get >>>>>> used.

    What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% >>>> that
    matters.

    The first half of a big function will have some state that has to be
    reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>
    I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    Actually due to the removal of MOVs and reloads the code density may be basically the same.

    Anytime one removes more "MOVs and saves and restore" instructions
    than the called subroutine contains within the prologue and epilogue
    bounds, the subroutine should be inlined.

    Also longer context switch times, as more registers to save/restore.

    The save is should be free, as the load from ram is so slow.

    When HW is doing the saves, the saves can be performed while
    waiting for the first instruction to arrive and for the first
    registers to arrive. Thus, done in HW, the saves are essentially
    free.

    If the context is time critical it should be written to use the
    registers that are reloaded first, first. In which case the code
    could start doing work in the same amount of time regardless of
    register count. (I doubt the CPU design is actually that smart,
    or that the people that program the interrupts are.)

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    For example a 1-wide machine with a 4-ported register file,
    generally operated as 3R1W can be switched to 4R or 4W for
    epilogue or prologue uses respectively. Simulation indicates
    this gets rid of 47% of the cycles spent in prologue and
    epilogue (combined compared to a sequence of stores and loads)
    Simulation also indicates that 42% of the power is saved--
    mainly from Tag and TLB non-access cycles.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sat Aug 17 00:24:24 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 8/14/2024 5:54 PM, Brett wrote:
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

    BGB <cr88192@gmail.com> wrote:


    Another benefit of 64 registers is more inlining removing calls. >>>>>>>
    A call can cause a significant amount of garbage code all around that >>>>>>> call,
    as it splits your function and burns registers that would otherwise get >>>>>>> used.

    What I see around calls is MOV instructions grabbing arguments from the >>>>>> preserved registers and putting return values in to the proper preserved >>>>>> register. Inlining does get rid of these MOVs, but what else ??

    For middling functions, I spent my time optimizing heavy code, the 10% >>>>> that
    matters.

    The first half of a big function will have some state that has to be >>>>> reloaded after a call, or worse yet saved and reloaded.

    Inlining is limited by register count, with twice the registers the
    compiler will generate far larger leaf calls with less call depth. Which >>>>> removes more of those MOVs.

    I can understand the reluctance to go to 6 bit register specifiers, it >>>>>>> burns up your opcode space and makes encoding everything more difficult.

    I am on record as stating the proper number of bits in an instruction- >>>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
    generations
    of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>>> Making the registers 6-bits would increase that count to 36-bits.

    My 66000 hurts less with 6-bits as more constants bits get moved to
    extension words, which is almost free by most metrics.

    Only My 66000 can reasonably be able to implement 6-bits register
    specifiers.
    The market is yours for the taking.

    6-bits will make you stand out and get noticed.

    The only down side I see is a few percent in code density.

    Actually due to the removal of MOVs and reloads the code density may be
    basically the same.

    Anytime one removes more "MOVs and saves and restore" instructions
    than the called subroutine contains within the prologue and epilogue
    bounds, the subroutine should be inlined.

    Also longer context switch times, as more registers to save/restore.

    The save is should be free, as the load from ram is so slow.

    When HW is doing the saves, the saves can be performed while
    waiting for the first instruction to arrive and for the first
    registers to arrive. Thus, done in HW, the saves are essentially
    free.

    If the context is time critical it should be written to use the
    registers that are reloaded first, first. In which case the code
    could start doing work in the same amount of time regardless of
    register count. (I doubt the CPU design is actually that smart,
    or that the people that program the interrupts are.)

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.
    But has anyone told the software guys?
    Of course convincing programmers to RTFM is futile. ;(

    If so this is the first I have heard that more registers is not bad for interrupt response time.

    So we are back to finding any downsides for 64 registers in My 66000.

    Lack of actual significant benefits is irrelevant, as all the programers
    are 100% convinced that it will help some of their code. ;)

    For example a 1-wide machine with a 4-ported register file,
    generally operated as 3R1W can be switched to 4R or 4W for
    epilogue or prologue uses respectively. Simulation indicates
    this gets rid of 47% of the cycles spent in prologue and
    epilogue (combined compared to a sequence of stores and loads)
    Simulation also indicates that 42% of the power is saved--
    mainly from Tag and TLB non-access cycles.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 07:44:49 2024
    Brett <ggtgp@yahoo.com> schrieb:
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    Anytime one removes more "MOVs and saves and restore" instructions
    than the called subroutine contains within the prologue and epilogue
    bounds, the subroutine should be inlined.

    In principle, yes.

    You can either use C++ headers, which result in huge compilation
    times, or you can use LTO. LTO, if done right, is a huge time-eater
    (I was looking for an English translation of "Zeitgrab", literarlly
    "time grave" or "time tomb", this was the best I could come up with).

    [...]

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.
    But has anyone told the software guys?

    Software guys generally work with high-level languages where this is irrelevant, except for...

    Of course convincing programmers to RTFM is futile. ;(

    ...people writing operating systems or drivers, and they better
    read the docs for the architecture they are working on.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding space. Not sure if you have Mitch's document, but having
    one more bit per register would reduce the 16-bit data in the
    offset to 14 (no way you can expand that by a factor of four),
    would require eight instead of one major opcodes for the three-
    register instructions, amd the four-register instructions like FMA...
    you get the picture.

    This would not matter if we were still living in a 36-bit world,
    but the days of the IBM 704, the PDP-10 or the UNIVAC 1100 have
    passed, except for emulation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 07:29:34 2024
    Brett <ggtgp@yahoo.com> schrieb:

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding space.

    Do you have Mitch's ISA document? Memory access instructions
    would be restricted to 14 bit offsets, standard three-register
    arithmetic would use eight instead of one major opcode, and FMA
    and friends...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Aug 17 20:08:55 2024
    On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.

    The Instructions and the compiler's use of them were co-developed.

    But has anyone told the software guys?

    Use HLLs and you don't have to.

    Of course convincing programmers to RTFM is futile. ;(

    Done with Instructions in HW one has to convince exactly two
    people; GCC code generator and LLVM code generator.

    If so this is the first I have heard that more registers is not bad for interrupt response time.

    They are also bad for pipeline stage times.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding
    pipeline staging
    context switch times

    For example, My 66000 current encoding has room for 8 instructions
    in the FMAC category (4 in use) with 6-bit register specifiers
    I would need 4 major OpCodes instead of 1.

    For your 98%-ile source code, 32-registers is plenty.

    Lack of actual significant benefits is irrelevant, as all the programers
    are 100% convinced that it will help some of their code. ;)

    For example a 1-wide machine with a 4-ported register file,
    generally operated as 3R1W can be switched to 4R or 4W for
    epilogue or prologue uses respectively. Simulation indicates
    this gets rid of 47% of the cycles spent in prologue and
    epilogue (combined compared to a sequence of stores and loads)
    Simulation also indicates that 42% of the power is saved--
    mainly from Tag and TLB non-access cycles.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sat Aug 17 20:57:43 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.

    The Instructions and the compiler's use of them were co-developed.

    But has anyone told the software guys?

    Use HLLs and you don't have to.


    I looked at interrupts in your manual and it did not say how many registers were full of garbage leaking information because they were not saved or restored to make interrupts faster. ;)


    Of course convincing programmers to RTFM is futile. ;(

    Done with Instructions in HW one has to convince exactly two
    people; GCC code generator and LLVM code generator.

    If so this is the first I have heard that more registers is not bad for
    interrupt response time.

    They are also bad for pipeline stage times.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding

    Admittedly painful, extremely so.

    pipeline staging

    A longer pipeline is slower to start up, but gets work done faster.
    Is this what you mean?

    context switch times

    Task swapping time is way down in the noise. It’s reloading the L1 and L2 cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

    For example, My 66000 current encoding has room for 8 instructions
    in the FMAC category (4 in use) with 6-bit register specifiers
    I would need 4 major OpCodes instead of 1.

    For your 98%-ile source code, 32-registers is plenty.

    Lack of actual significant benefits is irrelevant, as all the programers
    are 100% convinced that it will help some of their code. ;)

    For example a 1-wide machine with a 4-ported register file,
    generally operated as 3R1W can be switched to 4R or 4W for
    epilogue or prologue uses respectively. Simulation indicates
    this gets rid of 47% of the cycles spent in prologue and
    epilogue (combined compared to a sequence of stores and loads)
    Simulation also indicates that 42% of the power is saved--
    mainly from Tag and TLB non-access cycles.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Thomas Koenig on Sat Aug 17 20:40:55 2024
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Brett <ggtgp@yahoo.com> schrieb:
    MitchAlsup1 <mitchalsup@aol.com> wrote:

    Anytime one removes more "MOVs and saves and restore" instructions
    than the called subroutine contains within the prologue and epilogue
    bounds, the subroutine should be inlined.

    In principle, yes.

    You can either use C++ headers, which result in huge compilation
    times, or you can use LTO. LTO, if done right, is a huge time-eater
    (I was looking for an English translation of "Zeitgrab", literarlly
    "time grave" or "time tomb", this was the best I could come up with).

    [...]

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.
    But has anyone told the software guys?

    Software guys generally work with high-level languages where this is irrelevant, except for...

    Of course convincing programmers to RTFM is futile. ;(

    ...people writing operating systems or drivers, and they better
    read the docs for the architecture they are working on.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding space. Not sure if you have Mitch's document,

    Section 4.1 Instruction Template, Figure 25, page 33-179

    but having
    one more bit per register would reduce the 16-bit data in the
    offset to 14 (no way you can expand that by a factor of four),

    14 is plenty, you can actually do 12 and pack those instructions in with
    shifts which have a pair of 6 bit fields, width and offset. This would
    expand some constants but you make it back in shorter code with less MOVs
    and more performance.

    would require eight instead of one major opcodes for the three-
    register instructions,

    Mitch gloats about how many major opcodes he has free, in his 7 bit opcode
    he has the greater part of a bit free, so we are a good part of the way
    there.

    Conceptually some of the modifier bits move into the opcode space, not as
    clean but you have to squeeze those bits hard. One can come up with a few patterns that are not hard to decode, and spread across several instruction types.

    and the four-register instructions like FMA...

    Trying to wave a red flag in front of Mitch. ;)

    This is a pain point.
    I would sacrifice most or all of XCOM6 the predicate instructions.

    Does it fit or does one look at extended opcodes for FMA.

    This would not matter if we were still living in a 36-bit world,
    but the days of the IBM 704, the PDP-10 or the UNIVAC 1100 have
    passed, except for emulation.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 22:05:03 2024
    Brett <ggtgp@yahoo.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Brett <ggtgp@yahoo.com> schrieb:

    Software guys generally work with high-level languages where this is
    irrelevant, except for...

    Of course convincing programmers to RTFM is futile. ;(

    ...people writing operating systems or drivers, and they better
    read the docs for the architecture they are working on.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding space. Not sure if you have Mitch's document,

    Section 4.1 Instruction Template, Figure 25, page 33-179

    but having
    one more bit per register would reduce the 16-bit data in the
    offset to 14 (no way you can expand that by a factor of four),

    14 is plenty,

    16 is better.

    you can actually do 12 and pack those instructions in with
    shifts which have a pair of 6 bit fields, width and offset. This would
    expand some constants but you make it back in shorter code with less MOVs
    and more performance.

    Hmm... I am not convinced.

    Do you have code to back up your claims?

    would require eight instead of one major opcodes for the three-
    register instructions,

    Mitch gloats about how many major opcodes he has free, in his 7 bit opcode

    It's 6 for the major opcode, actually.

    he has the greater part of a bit free, so we are a good part of the way there.

    That sentence no parse.

    Conceptually some of the modifier bits move into the opcode space, not as clean but you have to squeeze those bits hard

    It is very fine point of semantics if the modifier bits are part
    of the opcode space or not. I happen to think that they are,
    they are just in a (somehwat) different place and spelled a bit
    differently, but it does not really matter how you look at it -
    you need the bits to encode them.

    One can come up with a few
    patterns that are not hard to decode, and spread across several instruction types.

    So, go right ahead. Find an encoding that a) encompasses all of
    Mitch's functionality, b) has six bits for registers everywhere,
    and c) does not drive the assembler writer crazy (that's me,
    for Mitch's design) or hardware designer bonkers (where Mitch has
    the experience).

    Let's start with the... BB1 instruction, which branches on bit
    set in a register, so it needs a major opcode, a bit number, a
    register number and a displacement. How do you propose to do that?
    Shave one bit off the displacement?


    and the four-register instructions like FMA...

    Trying to wave a red flag in front of Mitch. ;)

    I just happen to like FMA :-)

    Of course, it might be possible to code FMA like AVX does, with
    only three registers - 18 bits for three registers, plus two bits
    for which one of them gets smashed for the result.

    But - just making offhand suggestions won't cut it. You will
    have to think about the layout of the instructions, how everything
    fits in, and needing one to four more bits per instruction
    can be accomodated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Aug 17 22:15:17 2024
    On Sat, 17 Aug 2024 20:57:43 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.

    The Instructions and the compiler's use of them were co-developed.

    But has anyone told the software guys?

    Use HLLs and you don't have to.


    I looked at interrupts in your manual and it did not say how many
    registers
    were full of garbage leaking information because they were not saved or restored to make interrupts faster. ;)

    When an ISR[13] returns from handling its exception it ahs a register
    file
    filled with stuff useful to future running's of ISR[13].

    When ISR[13] gains control to handle another interrupt it has a file
    filled
    with what it was filled with the last time it ran--all 30 of them while registers R0..R1 contain information about the current interrupt to be serviced.
    SP points at its stack
    FP points at its frame or is another register containing whatever it
    contained the previous time
    R29..R2 contain the value it had the previous time it ran



    Of course convincing programmers to RTFM is futile. ;(

    Done with Instructions in HW one has to convince exactly two
    people; GCC code generator and LLVM code generator.

    If so this is the first I have heard that more registers is not bad for
    interrupt response time.

    They are also bad for pipeline stage times.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding

    Admittedly painful, extremely so.

    pipeline staging

    A longer pipeline is slower to start up, but gets work done faster.
    Is this what you mean?

    No, I mean the feedback loops take more cycles so apparent latency
    is greater.

    context switch times

    Task swapping time is way down in the noise. It’s reloading the L1 and
    L2
    cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

    While it is under 1% of all cycles, current x86s take 1,000 cycles
    application to application and 10,000 cycles hypervisor to hypervisor.

    I want both of these down in the 20-cycle range.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 17 23:03:34 2024
    On Sat, 17 Aug 2024 22:05:03 +0000, Thomas Koenig wrote:

    Brett <ggtgp@yahoo.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Conceptually some of the modifier bits move into the opcode space, not
    as clean but you have to squeeze those bits hard

    It is very fine point of semantics if the modifier bits are part
    of the opcode space or not. I happen to think that they are,
    they are just in a (somehwat) different place and spelled a bit
    differently, but it does not really matter how you look at it -
    you need the bits to encode them.

    To me, an instruction has 3 components:: Operands, Routing, and
    calculation. We mainly consider the calculation (ADD) to be the
    instruction and fuzz over what is operands and how does one
    route them to places of calculation. My 66000 ISA directly
    annotates the operands and the routing. This is what the
    modifier bits do; they tell how to interpret the register
    specifiers (Rn or #n), (Rn or -Rn) and when to substitute
    another word or doubleword in the instruction stream as an
    operand directly.

    This does not add gates of delay to Operand routing because
    all of the constant stuff is overlapped with the comparison
    of register specifiers with pipeline result specifiers to
    determine forwarding. Constants forward in the network prior
    to register results preventing any added delay.

    One can come up with a few patterns that are not hard to
    decode, and spread across several instruction types.

    So, go right ahead. Find an encoding that a) encompasses all of
    Mitch's functionality, b) has six bits for registers everywhere,
    and c) does not drive the assembler writer crazy (that's me,
    for Mitch's design) or hardware designer bonkers (where Mitch has
    the experience).

    Consider, for example, memory reference address modes for 1
    instruction::
    LDSB Rd,[Rp,disp16]
    LDSB Rd,[IP,disp16]
    and
    LDSB Rd,[Rp,Ri<<s]
    LDSB Rd,[Rp,0]
    LDSB Rd,[IP,Ri<<s]
    LDSB Rd,[Rp,,disp32]
    LDSB Rd,[Rp,Ri<<2,disp32]
    LDSB Rd,[IP,,disp32]
    LDSB Rd,[IP,Ri<<s,disp32]
    LDSB Rd,[Rp,,disp64]
    LDSB Rd,[Rp,Ri<<s,disp64]
    LDSB Rd,[Rp,,disp64]
    LDSB Rd,[IP,Ri<<s,disp64]

    I use 2 instructions here::
    1) a major OpCode with 16-bit immediate
    R0 in the Rb position is a proxy for IP
    2) a major OpCode and a MEME OpCode with 5-bits of Modifiers.
    R0 in Rb position is remains a Proxy for IP
    R0 in Ri position is a proxy for #0.
    3) I still have 1-bit left over to denote participation in ATOMIC
    events.
    you get all sizes and signs of Load-Locked
    you get up to 8 LLs
    you can use as many Store-Conditionals as you need
    all interested 3rd parties see memory before or after the event
    and nothing in between.

    Using 6-bit registers I would be down by 3-bits causing all sorts of
    memory reference grief--leading to other compromises in ISA design
    elsewhere.

    Based on the code I read out of Brian's compiler: there is no particular
    need for 64-registers. I am already using only 72% of the instructions
    {72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
    {same compiler, same optimizations, just different code generators}.

    One can argue that having 64-bit displacements is not-all-that-necessary
    But how does one take dusty deck FORTRAN FEM programs and allow the
    common blocks to grow bigger than 4GBs ?? This is the easiest way
    to port code written 5 decades ago to use the sizes of memory they
    need to run those "Great Big" FEM models today.

    Let's start with the... BB1 instruction, which branches on bit
    set in a register, so it needs a major opcode, a bit number, a
    register number and a displacement. How do you propose to do that?
    Shave one bit off the displacement?

    Then proceed to Branch on Condition:: along with the standard::
    EQ0, NE0, GT0, GE0, LT0, LE0 conditions one gets with other encodings,
    I also get FEQ0, FNE0, FGT0, FGE0, FLT0, FLE0, DEQ0, DNE0, DGT0,
    DGE0, DLT0, DLE0 along with Interference, SVC, SVR, and RET.
    {And I left out the unordered float/double comparisons, above.}}
    1-instruction due mostly to NOT having condition codes.


    and the four-register instructions like FMA...

    I prefer 3-operand 1-result instead of 4-register. 4-register could
    have 1-operand and 3 results and lacks decent specificity. 35 years
    ago I used 3-register to describe Mc88100 and I regret that now.

    I prefer FMAC instead of FMA--in hindsight I should had made it
    FMAC and DMAC, but alas... I use FMAC to cover all 4 of::

    x = y * z + q
    x = y * -z + q
    x = y * z - q
    x = y * -z - q

    Trying to wave a red flag in front of Mitch. ;)

    I just happen to like FMA :-)

    Of course, it might be possible to code FMA like AVX does, with
    only three registers - 18 bits for three registers, plus two bits
    for which one of them gets smashed for the result.

    Why do I get the feeling the compiler guys would not like this ??

    But - just making offhand suggestions won't cut it. You will
    have to think about the layout of the instructions, how everything
    fits in, and needing one to four more bits per instruction
    can be accomodated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 18 02:39:04 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 17 Aug 2024 22:05:03 +0000, Thomas Koenig wrote:

    Brett <ggtgp@yahoo.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Conceptually some of the modifier bits move into the opcode space, not
    as clean but you have to squeeze those bits hard

    It is very fine point of semantics if the modifier bits are part
    of the opcode space or not. I happen to think that they are,
    they are just in a (somehwat) different place and spelled a bit
    differently, but it does not really matter how you look at it -
    you need the bits to encode them.

    To me, an instruction has 3 components:: Operands, Routing, and
    calculation. We mainly consider the calculation (ADD) to be the
    instruction and fuzz over what is operands and how does one
    route them to places of calculation. My 66000 ISA directly
    annotates the operands and the routing. This is what the
    modifier bits do; they tell how to interpret the register
    specifiers (Rn or #n), (Rn or -Rn) and when to substitute
    another word or doubleword in the instruction stream as an
    operand directly.

    This does not add gates of delay to Operand routing because
    all of the constant stuff is overlapped with the comparison
    of register specifiers with pipeline result specifiers to
    determine forwarding. Constants forward in the network prior
    to register results preventing any added delay.

    One can come up with a few patterns that are not hard to
    decode, and spread across several instruction types.

    So, go right ahead. Find an encoding that a) encompasses all of
    Mitch's functionality, b) has six bits for registers everywhere,
    and c) does not drive the assembler writer crazy (that's me,
    for Mitch's design) or hardware designer bonkers (where Mitch has
    the experience).

    Consider, for example, memory reference address modes for 1
    instruction::
    LDSB Rd,[Rp,disp16]
    LDSB Rd,[IP,disp16]
    and
    LDSB Rd,[Rp,Ri<<s]
    LDSB Rd,[Rp,0]
    LDSB Rd,[IP,Ri<<s]
    LDSB Rd,[Rp,,disp32]
    LDSB Rd,[Rp,Ri<<2,disp32]
    LDSB Rd,[IP,,disp32]
    LDSB Rd,[IP,Ri<<s,disp32]
    LDSB Rd,[Rp,,disp64]
    LDSB Rd,[Rp,Ri<<s,disp64]
    LDSB Rd,[Rp,,disp64]
    LDSB Rd,[IP,Ri<<s,disp64]

    I use 2 instructions here::
    1) a major OpCode with 16-bit immediate
    R0 in the Rb position is a proxy for IP
    2) a major OpCode and a MEME OpCode with 5-bits of Modifiers.
    R0 in Rb position is remains a Proxy for IP
    R0 in Ri position is a proxy for #0.
    3) I still have 1-bit left over to denote participation in ATOMIC
    events.
    you get all sizes and signs of Load-Locked
    you get up to 8 LLs
    you can use as many Store-Conditionals as you need
    all interested 3rd parties see memory before or after the event
    and nothing in between.

    Using 6-bit registers I would be down by 3-bits causing all sorts of
    memory reference grief--leading to other compromises in ISA design
    elsewhere.

    Based on the code I read out of Brian's compiler: there is no particular
    need for 64-registers. I am already using only 72% of the instructions
    {72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
    {same compiler, same optimizations, just different code generators}.

    One can argue that having 64-bit displacements is not-all-that-necessary
    But how does one take dusty deck FORTRAN FEM programs and allow the
    common blocks to grow bigger than 4GBs ?? This is the easiest way
    to port code written 5 decades ago to use the sizes of memory they
    need to run those "Great Big" FEM models today.

    Let's start with the... BB1 instruction, which branches on bit
    set in a register, so it needs a major opcode, a bit number, a
    register number and a displacement. How do you propose to do that?
    Shave one bit off the displacement?

    Then proceed to Branch on Condition:: along with the standard::
    EQ0, NE0, GT0, GE0, LT0, LE0 conditions one gets with other encodings,
    I also get FEQ0, FNE0, FGT0, FGE0, FLT0, FLE0, DEQ0, DNE0, DGT0,
    DGE0, DLT0, DLE0 along with Interference, SVC, SVR, and RET.
    {And I left out the unordered float/double comparisons, above.}} 1-instruction due mostly to NOT having condition codes.


    and the four-register instructions like FMA...

    I prefer 3-operand 1-result instead of 4-register. 4-register could
    have 1-operand and 3 results and lacks decent specificity. 35 years
    ago I used 3-register to describe Mc88100 and I regret that now.

    I prefer FMAC instead of FMA--in hindsight I should had made it
    FMAC and DMAC, but alas... I use FMAC to cover all 4 of::

    x = y * z + q
    x = y * -z + q
    x = y * z - q
    x = y * -z - q

    Trying to wave a red flag in front of Mitch. ;)

    I just happen to like FMA :-)

    Of course, it might be possible to code FMA like AVX does, with
    only three registers - 18 bits for three registers, plus two bits
    for which one of them gets smashed for the result.

    Why do I get the feeling the compiler guys would not like this ??

    But - just making offhand suggestions won't cut it. You will
    have to think about the layout of the instructions, how everything
    fits in, and needing one to four more bits per instruction
    can be accomodated.


    Yes I know and agree that you have a beautiful instruction set layout.
    And a 64 register variant would be butt ugly, but x86 won. Thumb 2 won over ARM32 which was better. Thumb 2 almost never happened because management
    hated it.

    I know my fellow programmers, give them a 64 register variant and they will make the stupid choice like me 80% of the time. ;)

    Ask the customers what they want, and don’t be surprised when they pick the stupid option. If it gets you a sale you would have lost, just count the
    money and be happy.

    I don’t expect you to do any work on 64 registers, just add a vapor ware option and put it on ice for a few years. Let boredom and demand kick in,
    maybe it will just die like most vapor ware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 18 02:16:04 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 17 Aug 2024 20:57:43 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:

    When HW is doing the saves, it does them in a known order and
    can mark the registers "in use" or "busy" instantaneously and
    clear that status as data arrives. When SW is doing the same,
    SW ahs to wait for the instruction to arrive and then do them
    one-to-small numbers at a time. HW is not so constrained.

    Ok, so the hardware is smart enough.

    The Instructions and the compiler's use of them were co-developed.

    But has anyone told the software guys?

    Use HLLs and you don't have to.


    I looked at interrupts in your manual and it did not say how many
    registers
    were full of garbage leaking information because they were not saved or
    restored to make interrupts faster. ;)

    When an ISR[13] returns from handling its exception it ahs a register
    file
    filled with stuff useful to future running's of ISR[13].

    When ISR[13] gains control to handle another interrupt it has a file
    filled
    with what it was filled with the last time it ran--all 30 of them while registers R0..R1 contain information about the current interrupt to be serviced.
    SP points at its stack
    FP points at its frame or is another register containing whatever it
    contained the previous time
    R29..R2 contain the value it had the previous time it ran

    I don’t remember the PlayStation using all registers in an interrupt, it
    was only a few lines of code and 8 registers was fine. This would only save
    you 3 cycles and is probably not worth the potential hassles.

    I have heard programmers complaining that interrupt response was too slow
    and so they had to add a second toy CPU just to handle interrupts. Probably people that made the mistake of upgrading to x86.

    Have wondered if having a scratchpad for interrupt code (and critical data) would solve those problems as memory can be 150 cycles away, plus you can
    have 40 pending reads queued ahead of you. Makes servicing interrupts in a timely matter difficult, even if you are not touching much memory.

    Of course convincing programmers to RTFM is futile. ;(

    Done with Instructions in HW one has to convince exactly two
    people; GCC code generator and LLVM code generator.

    If so this is the first I have heard that more registers is not bad for >>>> interrupt response time.

    They are also bad for pipeline stage times.

    So we are back to finding any downsides for 64 registers in My 66000.

    Encoding

    Admittedly painful, extremely so.

    pipeline staging

    A longer pipeline is slower to start up, but gets work done faster.
    Is this what you mean?

    No, I mean the feedback loops take more cycles so apparent latency
    is greater.

    context switch times

    Task swapping time is way down in the noise. It’s reloading the L1 and
    L2
    cache that swamps the time. 64 registers is nothing compared to 32k or
    megabytes.

    While it is under 1% of all cycles, current x86s take 1,000 cycles application to application and 10,000 cycles hypervisor to hypervisor.

    I want both of these down in the 20-cycle range.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Aug 18 06:34:34 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Based on the code I read out of Brian's compiler: there is no particular
    need for 64-registers. I am already using only 72% of the instructions
    {72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
    {same compiler, same optimizations, just different code generators}.

    That's true - the code is usually expressed as a very straightforward translation of the original code, at least for C.

    Register pressure will increase for unrolling of outer loops,
    for languages which use dope vectors (aka array descriptors),
    and for more aggressive inlining.

    Consider an argument passed as an assumed-shape array in
    Fortran.

    subroutine foo(a)
    real, dimension(:,:)

    where the array assumes the shape from the caller,
    two-dimensional in this case.

    For passing such an array, we need a base pointer and
    information about

    - the lower bound
    - the upper bound
    - the stride

    along each dimension, so it is 7 quantities in ths case.


    One can argue that having 64-bit displacements is not-all-that-necessary
    But how does one take dusty deck FORTRAN FEM programs and allow the
    common blocks to grow bigger than 4GBs ?? This is the easiest way
    to port code written 5 decades ago to use the sizes of memory they
    need to run those "Great Big" FEM models today.

    That is certainly one reason. Another is being able to have
    a "huge" model with code > 2GB without too much effort.
    Programs _are_ getting bigger...

    Of course, it might be possible to code FMA like AVX does, with
    only three registers - 18 bits for three registers, plus two bits
    for which one of them gets smashed for the result.

    Why do I get the feeling the compiler guys would not like this ??

    Because they won't? :-) It is certainly more straightforward
    this way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sun Aug 18 22:03:01 2024
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    <snip>

    High registers is mostly marketing vapor ware extension for you, see if anyone cares and put them on a list for when a market for that extension
    pops up.

    The lack of CPU’s with 64 registers is what makes for a market, that 4% that could benefit have no options to pick from. You would be happy to
    have control of a market that big. Point customers at a compiler
    configured
    for 64 registers and say that with high registers and inline constants
    that
    is what they could expect for code generation.

    I agree with the lead in, and disagree with where you took it.

    Let up postulate that having 64 registers is a 10% win (overstating
    the size of its win my 2.5×) but that 98% of subroutines don't need 64-registers. So, 98% gains nothing and 2% gains 10%

    0.98*1.0 + 0.02*1.1 = 1.002
    or
    0.2% gain.

    If there is demand for high registers you will probably just spin a CPU
    arch with more registers, but that will never happen if you never ask.

    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    This
    is the definition of vapor ware, a free market survey. You can even add
    more registers as an incompatible extension, if fact you should.

    I will leave stuff like this to you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Aug 19 12:05:22 2024
    Task swapping time is way down in the noise. It’s reloading the L1 and L2 cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

    Depends on the kind of swap. If you're thinking of time-sharing
    preemption, then indeed context switch time is not important.

    But when considering communication between processes, then very fast
    context switch times allow for finer grain divisions, like
    micro-kernels.

    Historically, these things have never really materialized, admittedly.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Mon Aug 19 18:22:27 2024
    On Mon, 19 Aug 2024 16:05:22 +0000, Stefan Monnier wrote:

    Task swapping time is way down in the noise. It’s reloading the L1 and
    L2
    cache that swamps the time. 64 registers is nothing compared to 32k or
    megabytes.

    Depends on the kind of swap. If you're thinking of time-sharing
    preemption, then indeed context switch time is not important.

    But when considering communication between processes, then very fast
    context switch times allow for finer grain divisions, like
    micro-kernels.

    MicroKernels failed due to the excessive overhead of context switching.
    Whether is was control delivery delay, TLB reloads, Cache reloads,
    register file loads and stores, ... it doesn't really mater as each
    delay adds up. When there is too much delay the system is sluggish
    and unacceptable en-the-large.

    Historically, these things have never really materialized, admittedly.

    Pigs don't win the 100 yard dash at the Olympics, either.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Aug 19 18:34:05 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    Pigs don't win the 100 yard dash at the Olympics, either.

    Cheetahs would, but that would be cheeting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Mon Aug 19 18:52:39 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    <snip>

    High registers is mostly marketing vapor ware extension for you, see if
    anyone cares and put them on a list for when a market for that extension
    pops up.

    The lack of CPU’s with 64 registers is what makes for a market, that 4%
    that could benefit have no options to pick from. You would be happy to
    have control of a market that big. Point customers at a compiler
    configured
    for 64 registers and say that with high registers and inline constants
    that
    is what they could expect for code generation.

    I agree with the lead in, and disagree with where you took it.

    Let up postulate that having 64 registers is a 10% win (overstating
    the size of its win my 2.5×) but that 98% of subroutines don't need 64-registers. So, 98% gains nothing and 2% gains 10%

    0.98*1.0 + 0.02*1.1 = 1.002
    or
    0.2% gain.

    I agree with this, but you have 4% of the market where more registers gives
    a much larger speedup. You would be glad to have that much market share.

    If there is demand for high registers you will probably just spin a CPU
    arch with more registers, but that will never happen if you never ask.

    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16, of course x84 had fab and
    cubic dollar advantages that dwarfed the register limit.

    This
    is the definition of vapor ware, a free market survey. You can even add
    more registers as an incompatible extension, if fact you should.

    I will leave stuff like this to you.

    I do agree that high registers to double your register count is far cleaner
    for the instruction set than going to 64 separate registers. You have much
    of high register implemented anyway if you support integer vector
    operations in the integer register file like MIPS, or have a unified
    register file, be it visible or not.

    64 separate registers was a bridge to far, but it was an interesting
    exercise before it crashed and burned due to the bits being not quite available.
    So close, yet so far. I could not make it work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Mon Aug 19 19:31:54 2024
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:


    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16, of course x84 had fab and
    cubic dollar advantages that dwarfed the register limit.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers. Do not underestimate this phenomenon. The
    gain from 16-32 registers is only 3%-ish so one would estimate that 22 registers would have already gained 1/2 of all of what is possible.

    64 separate registers was a bridge to far, but it was an interesting
    exercise before it crashed and burned due to the bits being not quite available. So close, yet so far. I could not make it work.

    We remain hobbled by the definition of Byte containing exactly 8-bits.
    It is this which drives the 16-bit and 32-bit instruction sizes; and
    it is this which drives the sizes of constants used by the instruction
    stream.

    64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
    But we must all face facts::
    a) Little Endian Won
    b) 8-bit Bytes Won
    c) longer operands are composed of multiple bytes mostly powers of 2.
    d) otherwise it is merely an academic exercise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Mon Aug 19 23:35:54 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:


    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16, of course x84 had fab and
    cubic dollar advantages that dwarfed the register limit.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers. Do not underestimate this phenomenon. The
    gain from 16-32 registers is only 3%-ish so one would estimate that 22 registers would have already gained 1/2 of all of what is possible.

    64 separate registers was a bridge to far, but it was an interesting
    exercise before it crashed and burned due to the bits being not quite
    available. So close, yet so far. I could not make it work.

    We remain hobbled by the definition of Byte containing exactly 8-bits.
    It is this which drives the 16-bit and 32-bit instruction sizes; and
    it is this which drives the sizes of constants used by the instruction stream.

    64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
    But we must all face facts::
    a) Little Endian Won
    b) 8-bit Bytes Won
    c) longer operands are composed of multiple bytes mostly powers of 2.
    d) otherwise it is merely an academic exercise.


    If you pack 7 instructions in 8 long words that gives you an extra nibble,
    4 bits.
    You can do lots of four operand dual operations, which may get you back the code density lost, while improving performance.

    3 instructions packed in 4 longs gives 64 registers plus four operand dual instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Tue Aug 20 00:12:44 2024
    On Mon, 19 Aug 2024 23:35:54 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:


    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16, of course x84 had fab and
    cubic dollar advantages that dwarfed the register limit.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers. Do not underestimate this phenomenon. The
    gain from 16-32 registers is only 3%-ish so one would estimate that 22
    registers would have already gained 1/2 of all of what is possible.

    64 separate registers was a bridge to far, but it was an interesting
    exercise before it crashed and burned due to the bits being not quite
    available. So close, yet so far. I could not make it work.

    We remain hobbled by the definition of Byte containing exactly 8-bits.
    It is this which drives the 16-bit and 32-bit instruction sizes; and
    it is this which drives the sizes of constants used by the instruction
    stream.

    64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
    But we must all face facts::
    a) Little Endian Won
    b) 8-bit Bytes Won
    c) longer operands are composed of multiple bytes mostly powers of 2.
    d) otherwise it is merely an academic exercise.


    If you pack 7 instructions in 8 long words that gives you an extra
    nibble, > 4 bits.
    You can do lots of four operand dual operations, which may get you back
    the code density lost, while improving performance.

    Given 36-bit containers--how do you add 32 or 64-bit constants ??
    throw 36-bits at the 32-bit needs case and 72-bits at the 64-bit
    needs case ?!?

    3 instructions packed in 4 longs gives 64 registers plus four operand
    dual instructions.

    {{ note 3 instructions in 4 longs is 85.3-bits per instruction::
    I suspect you mean 3 instructions in 4 words which is 42.6-bits
    per instruction far more than is needed. You get 14 instructions
    of 36-bits in 512-bits (a cache line)}}

    Why don't you give it a try !?!

    But notice, you are starting out with a much larger instruction--
    how are you going to "profitably" utilize all those bits from
    source code of typical imperative languages ??

    whereas with 32-bit instructions don't violate the RISC tenets.
    I end up needing only 72% the number of instructions RISC-V needs
    (a near 40% pipelined instruction advantage).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Tue Aug 20 03:50:36 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 19 Aug 2024 23:35:54 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:


    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming >>>>> one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16, of course x84 had fab and >>>> cubic dollar advantages that dwarfed the register limit.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers. Do not underestimate this phenomenon. The
    gain from 16-32 registers is only 3%-ish so one would estimate that 22
    registers would have already gained 1/2 of all of what is possible.

    64 separate registers was a bridge to far, but it was an interesting
    exercise before it crashed and burned due to the bits being not quite
    available. So close, yet so far. I could not make it work.

    We remain hobbled by the definition of Byte containing exactly 8-bits.
    It is this which drives the 16-bit and 32-bit instruction sizes; and
    it is this which drives the sizes of constants used by the instruction
    stream.

    64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
    But we must all face facts::
    a) Little Endian Won
    b) 8-bit Bytes Won
    c) longer operands are composed of multiple bytes mostly powers of 2.
    d) otherwise it is merely an academic exercise.


    If you pack 7 instructions in 8 long words that gives you an extra
    nibble, > 4 bits.
    You can do lots of four operand dual operations, which may get you back
    the code density lost, while improving performance.

    Given 36-bit containers--how do you add 32 or 64-bit constants ??
    throw 36-bits at the 32-bit needs case and 72-bits at the 64-bit
    needs case ?!?

    The four extra bits is four extra bits, or it could be a scale/shift
    amount, though that could stall an instruction that crossed a cache line.
    Same for 72 bits, but with more opcode flags of say extract insert. It is arguable that such fields are data and not opcode. Assuming the data does
    not impose a gate delay of concern that an opcode would not.

    3 instructions packed in 4 longs gives 64 registers plus four operand
    dual instructions.

    {{ note 3 instructions in 4 longs is 85.3-bits per instruction::
    I suspect you mean 3 instructions in 4 words which is 42.6-bits
    per instruction far more than is needed. You get 14 instructions
    of 36-bits in 512-bits (a cache line)}}

    10 more bits gives you a register plus a second operation.
    Add from memory and LEA being the classic examples.
    Though I am more in the line of just combining general operations.
    A more general load pair, three way add etc.

    If you can find enough combine potential then code density will not suffer.
    And for reasonably clocked devices the code will be faster. A three way add only adds a few gates, it why you allow negate on all sources, it’s cheap
    and faster than two operations.

    This is what I call Post RISC.

    Why don't you give it a try !?!

    Yes it works.

    But notice, you are starting out with a much larger instruction--
    how are you going to "profitably" utilize all those bits from
    source code of typical imperative languages ??

    whereas with 32-bit instructions don't violate the RISC tenets.
    I end up needing only 72% the number of instructions RISC-V needs
    (a near 40% pipelined instruction advantage).


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Aug 20 07:01:49 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16

    And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
    was first developed for an in-order microarchitecture) and are now
    going to 32 GPRs with APX (no in-order excuse here). And IIRC the
    announcement of APX says something about 10% fewer memory accesses or
    somesuch.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers.

    You feeling is strong (as shown by your repeatedly ignoring the counterevidence), but wrong:

    LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
    equivalent to 17 registers on a load/store architecture:

    Let's call the 17th register r16:

    On a load-store architecture you replace "LD-OP dest,src" with:

    ld r16=src
    op dest,dest,r16

    On a load-store architecture you replace "LD-OP-ST dest,src" with:

    ld r16=dest
    op r16,r16,src
    st dest=r16

    For a VAX-like three-memory-argument instruction you need two extra
    registers, r16 and r17:

    "mem1 = mem2 op mem3" becomes:

    ld r16=mem2
    ld r17=mem3
    op r16,r16,r17
    st mem1=r17

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Aug 20 11:59:31 2024
    On Mon, 19 Aug 2024 18:22:27 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 19 Aug 2024 16:05:22 +0000, Stefan Monnier wrote:

    Task swapping time is way down in the noise. It’s reloading the L1
    and L2
    cache that swamps the time. 64 registers is nothing compared to
    32k or megabytes.

    Depends on the kind of swap. If you're thinking of time-sharing preemption, then indeed context switch time is not important.

    But when considering communication between processes, then very fast context switch times allow for finer grain divisions, like
    micro-kernels.

    MicroKernels failed due to the excessive overhead of context
    switching. Whether is was control delivery delay, TLB reloads, Cache
    reloads, register file loads and stores, ... it doesn't really mater
    as each delay adds up. When there is too much delay the system is
    sluggish and unacceptable en-the-large.


    I don't believe that failure of uKernels to take over the world of
    OSes is related to the factors, you mentioned.
    It failed because relatively to monolithic kernel it is less convenient
    way to structure the OS software. Various parts of the OS are
    more dependent on each other logically, esp. in read-only manner, than proponents of uKernels are admitting. Every change takes more
    developer's time and causes touching more places in code than with
    monolithic.

    Historically, these things have never really materialized,
    admittedly.

    Pigs don't win the 100 yard dash at the Olympics, either.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Aug 20 09:40:11 2024
    I can understand the reluctance to go to 6 bit register specifiers, it
    burns up your opcode space and makes encoding everything more difficult.
    But today that is an unserviced market which will get customers to give you
    a look. Put out some vapor ware and see what customers say.

    If the issue is only the encoding, then presumably, Mitch could go the
    route of a prefix instruction (like his PRED instruction or the
    instruction he uses to do wide shifts/adds/...).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Aug 20 16:40:06 2024
    On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all
    would be fine. x86 does mostly fine with 16

    And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
    was first developed for an in-order microarchitecture) and are now
    going to 32 GPRs with APX (no in-order excuse here). And IIRC the announcement of APX says something about 10% fewer memory accesses or somesuch.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>like it has 20-22 registers.

    You feeling is strong (as shown by your repeatedly ignoring the counterevidence), but wrong:

    LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
    equivalent to 17 registers on a load/store architecture:

    Let's call the 17th register r16:

    On a load-store architecture you replace "LD-OP dest,src" with:

    ld r16=src
    op dest,dest,r16

    On a load-store architecture you replace "LD-OP-ST dest,src" with:

    ld r16=dest
    op r16,r16,src
    st dest=r16

    For a VAX-like three-memory-argument instruction you need two extra registers, r16 and r17:

    "mem1 = mem2 op mem3" becomes:

    ld r16=mem2
    ld r17=mem3
    op r16,r16,r17
    st mem1=r17

    - anton


    That is not what I am talking about::

    i = i + 1;
    as
    ADD [&i],#1

    1 instruction = 1 add, 1 LD and 1 ST. And

    i = i + j;
    as
    ADD Ri,[&j]

    In neither case is an extra register needed, and you may have
    several of these in a local sequence of code. ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Aug 20 20:40:50 2024
    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet
    another one etc... Each one can reuse the same temporary register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Aug 20 14:18:25 2024
    MitchAlsup1 wrote:
    On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With
    renaming
    one can have R7 in use 40 times in a 100 instruction deep execution
    window.

    If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16

    And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
    was first developed for an in-order microarchitecture) and are now
    going to 32 GPRs with APX (no in-order excuse here). And IIRC the
    announcement of APX says something about 10% fewer memory accesses or
    somesuch.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
    like it has 20-22 registers.

    You feeling is strong (as shown by your repeatedly ignoring the
    counterevidence), but wrong:

    LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
    equivalent to 17 registers on a load/store architecture:

    Let's call the 17th register r16:

    On a load-store architecture you replace "LD-OP dest,src" with:

    ld r16=src
    op dest,dest,r16

    On a load-store architecture you replace "LD-OP-ST dest,src" with:

    ld r16=dest
    op r16,r16,src
    st dest=r16

    For a VAX-like three-memory-argument instruction you need two extra
    registers, r16 and r17:

    "mem1 = mem2 op mem3" becomes:

    ld r16=mem2
    ld r17=mem3
    op r16,r16,r17
    st mem1=r17

    - anton


    That is not what I am talking about::

    i = i + 1;
    as
    ADD [&i],#1

    1 instruction = 1 add, 1 LD and 1 ST. And

    i = i + j;
    as
    ADD Ri,[&j]

    In neither case is an extra register needed, and you may have
    several of these in a local sequence of code. ...

    On an in-order pipeline you need someplace to stash the temp value.
    If you want, call it a special in-flight pseudo-register that only exists
    for forwarding, it is still an identifier for a value that is outside
    the architectural register set.

    I think it might need two registers if you can have two such instructions
    in the pipeline back-to-back as there could be multiple temp values
    in-flight at once

    ADD [&i],#1
    ADD [&j],#1

    could have &i doing its store while &j is doing its load.

    On OoO, if the reservation stations are valueless, you need a real
    physical register to stash the temp value as there is no guarantee
    the OP part of the uOp will launch just when the LD part finishes
    doing its thing and forwards the value.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Aug 20 20:59:28 2024
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Aug 20 21:05:41 2024
    On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    The thing is that one you go down the GBOoO route, your lack of
    registers
    "namable in ASM" ceases to become a performance degrader. With
    renaming
    one can have R7 in use 40 times in a 100 instruction deep execution >>>>>> window.

    If this was true we would have 16 or even 8 visible registers, and all >>>>> would be fine. x86 does mostly fine with 16

    And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
    was first developed for an in-order microarchitecture) and are now
    going to 32 GPRs with APX (no in-order excuse here). And IIRC the
    announcement of APX says something about 10% fewer memory accesses or
    somesuch.

    Careful, here::

    x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>>> like it has 20-22 registers.

    You feeling is strong (as shown by your repeatedly ignoring the
    counterevidence), but wrong:

    LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
    equivalent to 17 registers on a load/store architecture:

    Let's call the 17th register r16:

    On a load-store architecture you replace "LD-OP dest,src" with:

    ld r16=src
    op dest,dest,r16

    On a load-store architecture you replace "LD-OP-ST dest,src" with:

    ld r16=dest
    op r16,r16,src
    st dest=r16

    For a VAX-like three-memory-argument instruction you need two extra
    registers, r16 and r17:

    "mem1 = mem2 op mem3" becomes:

    ld r16=mem2
    ld r17=mem3
    op r16,r16,r17
    st mem1=r17

    - anton


    That is not what I am talking about::

    i = i + 1;
    as
    ADD [&i],#1

    1 instruction = 1 add, 1 LD and 1 ST. And

    i = i + j;
    as
    ADD Ri,[&j]

    In neither case is an extra register needed, and you may have
    several of these in a local sequence of code. ...

    On an in-order pipeline you need someplace to stash the temp value.
    If you want, call it a special in-flight pseudo-register that only
    exists for forwarding, it is still an identifier for a value that
    is outside the architectural register set.

    The LD-OP-ST machine would have this built into the pipeline--
    such that nobody has to name the carrier of the value down the
    pipeline.

    I think it might need two registers if you can have two such
    instructions in the pipeline back-to-back as there could be
    multiple temp values in-flight at once

    ADD [&i],#1
    ADD [&j],#1

    could have &i doing its store while &j is doing its load.

    On OoO, if the reservation stations are valueless, you need a real
    physical register to stash the temp value as there is no guarantee
    the OP part of the uOp will launch just when the LD part finishes
    doing its thing and forwards the value.

    In the LD-OP-ST microarchitecture there would be some buffer
    that carries the intermediate values through the execution
    window. And, Yes, you can build a LD-OP-ST reservation station
    (Athlon and Opteron did). It becomes easier if there is some
    buffer to carry the intermediate values {address, operand, result}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Tue Aug 20 23:08:03 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet
    another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.


    LD-OP-ST is a bridge too far for me.

    LD-OP and OP-ST are fine with me and have benefits.
    But you have not built such, you built an improved RISC…

    I assume OP-ST has issues with the value getting stuck if the address is
    slow to resolve. With a register the value can just spill to the register backing file. And because of this you create a hidden register name for the value.

    You have information on how many hidden registers are in flight on average
    and worst case, so I believe your numbers.

    I have not looked to see if compilers generate LD-OP and OP-ST, at one
    point Intel was discouraging such code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Aug 21 01:40:10 2024
    On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet
    another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.


    LD-OP-ST is a bridge too far for me.

    LD-OP and OP-ST are fine with me and have benefits.

    If you put cache write at or after register file write in the
    pipeline; LD-OP-ST basically falls out for free and you can
    move the intermediate values from whence they are produced
    to where they are consumed with forwarding.

    But you have not built such, you built an improved RISC…

    I spent 7 years doing x86-64.....so much for not having.....

    It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
    and LD-OP-STs. Then I took that and made a better RISC ISA.
    That RISC ISA did not have LD-OP-STs because of OpCode
    encoding reasons not from pipelining reasons.

    I assume OP-ST has issues with the value getting stuck if the address is
    slow to resolve. With a register the value can just spill to the
    register backing file. And because of this you create a hidden register
    name for the value.

    Athlon and Opteron had value capturing reservation stations.
    K9 have value-free RSs. It caused little headache because
    while we did not give it a named physical register, we did
    give it a physical register for the intermediates. SW can only
    read/write named PRs getting the name from logical to physical
    register renaming.

    You have information on how many hidden registers are in flight on
    average and worst case, so I believe your numbers.

    I have not looked to see if compilers generate LD-OP and OP-ST, at one
    point Intel was discouraging such code.

    Partially because AMD performed "relatively" better on LD-OPs and
    LD-OP-STs than Intel at that time. Where "relatively" means
    significantly above the noise level but "not all that much".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Wed Aug 21 05:13:41 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet >>>> another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.


    LD-OP-ST is a bridge too far for me.

    LD-OP and OP-ST are fine with me and have benefits.

    If you put cache write at or after register file write in the
    pipeline; LD-OP-ST basically falls out for free and you can
    move the intermediate values from whence they are produced
    to where they are consumed with forwarding.

    LD-OP-ST mostly only fits if it is add to memory.

    42 bit opcodes work, you only need one in four RISC opcodes to merge to
    LD-OP or OP-ST for code density to be the same, and generally you will do better.

    The two leftover bits can be ignored, or be a template indicator, so you
    can pack in a LD-OP-ST, or 31 bit RISC ops.

    Or go heads and tails packing.

    But you have not built such, you built an improved RISC…

    I spent 7 years doing x86-64.....so much for not having.....

    It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
    and LD-OP-STs. Then I took that and made a better RISC ISA.
    That RISC ISA did not have LD-OP-STs because of OpCode
    encoding reasons not from pipelining reasons.

    I assume OP-ST has issues with the value getting stuck if the address is
    slow to resolve. With a register the value can just spill to the
    register backing file. And because of this you create a hidden register
    name for the value.

    Athlon and Opteron had value capturing reservation stations.
    K9 have value-free RSs. It caused little headache because
    while we did not give it a named physical register, we did
    give it a physical register for the intermediates. SW can only
    read/write named PRs getting the name from logical to physical
    register renaming.

    You have information on how many hidden registers are in flight on
    average and worst case, so I believe your numbers.

    I have not looked to see if compilers generate LD-OP and OP-ST, at one
    point Intel was discouraging such code.

    Partially because AMD performed "relatively" better on LD-OPs and
    LD-OP-STs than Intel at that time. Where "relatively" means
    significantly above the noise level but "not all that much".


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Aug 21 12:00:47 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:
    On OoO, if the reservation stations are valueless, you need a real
    physical register to stash the temp value as there is no guarantee
    the OP part of the uOp will launch just when the LD part finishes
    doing its thing and forwards the value.

    In the LD-OP-ST microarchitecture there would be some buffer
    that carries the intermediate values through the execution
    window. And, Yes, you can build a LD-OP-ST reservation station
    (Athlon and Opteron did).

    All the material I have seen is that AMD has a load-store ROP, but the
    op in between is in a separate functional unit, with a separate
    scheduler entry; and I expect that the load-store ROP occupies the
    load/store scheduler(s) twice: once for the load part, once for the
    store part. There is also something about macroops that can be
    load-op-stores, but from what I have read, when it comes to execution,
    they are split into ROPs.

    If you have more details that contradict the information published up
    to now, please let us know more about them.

    On the Intel side, LD-OP-ST is split into three uops according to
    everything I have read. Apparently they are satisfied with this
    approach, or they would have gone for something else.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Aug 21 10:13:12 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    Latency is not the issue in modern high-performance AMD64 cores, which
    have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>.

    And yet, putting variables in registers gives a significant speedup:
    On a Rocket Lake, numbers are times in seconds:

    sieve bubble matrix fib fft
    0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
    0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

    In the first line, I used gforth-fast and tried to disable all
    optimizations except those that keep certain variables in registers:

    gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

    I could not reduce the static superinstructions below 31 and still get
    a result; I will have to investigate why, but that probably does not
    make that much of a difference for several of these benchmarks.

    In the second line I used gforth, an engine that keeps the top of
    stack in memory, the return-stack pointer in memory, stores IP to
    memory after every change, and does not use static superinstructions,
    all for better identifying where an error happened.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.

    What makes you think that instruction count is particularly relevant?
    Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more
    complex decoder. And in the OoO engine the difference may be gone (at
    least on Intel CPUs).

    Consider the Forth program

    : squared dup * ;

    This results in the following code sequences for the two engines
    mentioned above:

    dup 1->1 dup 0->0
    mov $50[r13],r15
    add rbx,$08 add r15,$08
    mov $00[r13],r8 mov rax,[r14]
    sub r13,$08 sub r14,$08
    mov [r14],rax
    * 1->1 * 0->0
    mov $50[r13],r15
    add rbx,$08 add r15,$08
    mov rax,$08[r14]
    imul r8,$08[r13] imul rax,[r14]
    add r13,$08 add r14,$08
    mov [r14],rax
    ;s 1->1 ;s 0->0
    mov $50[r13],r15
    mov rax,$58[r13]
    mov rbx,[r14] mov r10,[rax]
    add r14,$08 add rax,$08
    mov $58[r13],rax
    mov r15,r10
    mov rax,[rbx] mov rcx,[r15]
    jmp rax jmp rcx

    TOS=r8, RP=r14, IP=rbx TOS=[r14], RP=$58[r13], IP=r15/$50[r13]

    The registers are allocated differently in the two engines; for the
    three things where the memory/register allocation differed, I have
    shown the allocation.

    One interesting case is the sequence

    7FA02A77133D: mov rax,$58[r13]
    7FA02A771341: mov r10,[rax]
    7FA02A771344: add rax,$08
    7FA02A771348: mov $58[r13],rax

    Sure you could use a load-op-store instruction for adding 8 to
    $58[r13], but the mov in 7FA02A771341 still needs the value in a
    register, so apparently gcc (which produced the code snippets for the individual Forth words above) decided that it's better to save
    execution resources rather than reduce the number of instructions (at
    a higher execution resource cost) by writing the code as

    mov rax,$58[r13]
    add $58[r13], $8
    mov r10,[rax]

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Wed Aug 21 12:09:58 2024
    Brett <ggtgp@yahoo.com> writes:
    LD-OP-ST is a bridge too far for me.

    LD-OP and OP-ST are fine with me and have benefits.

    Can you name one architecture that has OP-ST? For VAX instructions
    the sources can be in registers and the target in memory, so let's
    refine this: Can you name one architecture that does not have
    LD-OP-ST, but that has OP-ST? The S/360 and PDP-11 approach of having
    one memory operand that can be either a source (ld-op) or a source and
    target (ld-op-st) seems to have had many successors, in particular 8086/IA-32/AMD64.

    I have not looked to see if compilers generate LD-OP and OP-ST, at one
    point Intel was discouraging such code.

    There is no OP-ST in the AMD64 instruction set, but gcc certainly
    generates LD-OP and LD-OP-ST; for the latter:

    Code +
    5570F6F544D1: mov $50[r13],r15
    5570F6F544D5: add r15,$08
    5570F6F544D9: lea rax,$08[r14]
    5570F6F544DD: mov rdx,[r14]
    5570F6F544E0: add [rax],rdx
    5570F6F544E3: mov r14,rax
    5570F6F544E6: mov rcx,[r15]
    5570F6F544E9: jmp rcx

    The instruction at 5570F6F544E0 is a LD-OP-ST.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Aug 21 16:42:33 2024
    On Wed, 21 Aug 2024 12:00:47 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    On the Intel side, LD-OP-ST is split into three uops according to
    everything I have read. Apparently they are satisfied with this
    approach, or they would have gone for something else.

    - anton

    AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are
    immediately fused into 2 fused uOps. They travel through rename phase
    as 2 uOps. I am not sure if they are split back into 4 uOps before or
    after OoO schedulers, but would guess the former.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Brett on Wed Aug 21 14:28:17 2024
    Brett <ggtgp@yahoo.com> writes:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet
    another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.


    LD-OP-ST is a bridge too far for me.

    These are some of the most important operations. Even ARM64
    supports a small set of LD-OP-ST (atomic) operations. Most CPU
    implementations delegate them to the cache subsystem or
    to an I/O device (e.g. PCIe which supports atomic operations).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Wed Aug 21 15:28:05 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Aug 2024 12:00:47 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    On the Intel side, LD-OP-ST is split into three uops according to
    everything I have read. Apparently they are satisfied with this
    approach, or they would have gone for something else.

    - anton

    AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.

    Which 4 uops and 2 macroops are those? My guess is that ST is
    store-data and store-address uops, and ld and op are one uop each.

    They travel through rename phase
    as 2 uOps.

    Interesting. But yes, only two values are generated for physical
    registers: the result of the load and the result of the op. So I
    expect that the two store parts are tacked onto the op on the way
    through the renamer, and then that macroop is split into its parts on
    the way to the schedulers.

    I am not sure if they are split back into 4 uOps before or
    after OoO schedulers, but would guess the former.

    Golden Cove is depicted as having an op scheduler, a load scheduler
    and a store scheduler, so they have to split the ld-op-store into at
    least three parts for scheduling.

    Sunny Cove is depicted as having an op scheduler, a store data
    scheduler, and two AGU schedulers, which would again mean at least
    three parts, but this time with a different split.

    Both based on <https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Wed Aug 21 08:49:10 2024
    On 8/21/2024 3:13 AM, Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    Latency is not the issue in modern high-performance AMD64 cores, which
    have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>.

    And yet, putting variables in registers gives a significant speedup:
    On a Rocket Lake, numbers are times in seconds:

    sieve bubble matrix fib fft
    0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
    0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

    In the first line, I used gforth-fast and tried to disable all
    optimizations except those that keep certain variables in registers:

    gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

    I could not reduce the static superinstructions below 31 and still get
    a result; I will have to investigate why, but that probably does not
    make that much of a difference for several of these benchmarks.

    In the second line I used gforth, an engine that keeps the top of
    stack in memory, the return-stack pointer in memory, stores IP to
    memory after every change, and does not use static superinstructions,
    all for better identifying where an error happened.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.

    What makes you think that instruction count is particularly relevant?
    Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more complex decoder. And in the OoO engine the difference may be gone (at
    least on Intel CPUs).

    There are also some savings in reduced I-cache usage (possibly leading
    to higher I-cache hit rate), reduced memory I-fetch memory bandwidth
    required, etc, though these may be modest at best.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Wed Aug 21 16:45:37 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    There are also some savings in reduced I-cache usage (possibly leading
    to higher I-cache hit rate), reduced memory I-fetch memory bandwidth >required, etc, though these may be modest at best.

    Let's see how that works out. I am using the code size numbers
    from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    So the least code size is from a load/store architecture with 16
    registers, followed (or preceded in the case of grep) by a load/store architecture with 32 registers. The instruction sets that have
    loap-op and load-op-st instructions result in bigger code. The
    different sizes of armhf (ARMv7) and armel (ARMv4t-ARMv6t) show that
    there is more to code sizes than just the architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Wed Aug 21 17:54:44 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:
    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    Latency is not the issue in modern high-performance AMD64 cores, which
    have zero-cycle store-to-load forwarding ><http://www.complang.tuwien.ac.at/anton/memdep/>.

    And yet, putting variables in registers gives a significant speedup:
    On a Rocket Lake, numbers are times in seconds:

    sieve bubble matrix fib fft
    0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
    0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

    In the first line, I used gforth-fast and tried to disable all
    optimizations except those that keep certain variables in registers:

    gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

    I could not reduce the static superinstructions below 31 and still get
    a result; I will have to investigate why, but that probably does not
    make that much of a difference for several of these benchmarks.

    Fixed that, so now with

    gforth-fast --ss-states=1 --ss-number=0 --opt-ip-updates=0 onebench.fs

    sieve bubble matrix fib fft
    0.069 0.074 0.036 0.052 0.017 TOS in reg, RP in reg, IP in reg
    0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

    Or on a Golden Cove:

    sieve bubble matrix fib fft
    0.059 0.059 0.024 0.047 0.020 TOS in reg, RP in reg, IP in reg
    0.108 0.156 0.065 0.098 0.037 TOS in mem, RP in mem, IP write-through to mem

    So even on these advanced cores with zero-cycle store-to-load
    forwarding it hurts quite a bit to keep variables in memory.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Wed Aug 21 10:20:07 2024
    On 8/21/2024 9:45 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    There are also some savings in reduced I-cache usage (possibly leading
    to higher I-cache hit rate), reduced memory I-fetch memory bandwidth
    required, etc, though these may be modest at best.

    Let's see how that works out. I am using the code size numbers
    from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

    bash grep gzip
    595204 107636 46744 armhf 16 regs load/store 32-bit
    599832 101102 46898 riscv64 32 regs load/store 64-bit
    796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
    829776 134784 56868 arm64 32 regs load/store 64-bit
    853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
    891128 158544 68500 armel 16 regs load/store 32-bit
    892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
    1020720 170736 71088 mips64el 32 regs load/store 64-bit
    1168104 194900 83332 ppc64el 32 regs load/store 64-bit

    So the least code size is from a load/store architecture with 16
    registers, followed (or preceded in the case of grep) by a load/store architecture with 32 registers. The instruction sets that have
    loap-op and load-op-st instructions result in bigger code.

    Interesting, thanks.


    The
    different sizes of armhf (ARMv7) and armel (ARMv4t-ARMv6t) show that
    there is more to code sizes than just the architecture.

    Certainly. It would take a more detailed analysis (that I am not
    capable of), to determine all the causes of the results you show.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Wed Aug 21 22:31:01 2024
    On Wed, 21 Aug 2024 19:13:55 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    The LD-OP-STs in Athlon and Opteron had a memory OpCode and
    calculation OpCode, and was performed in such a way that the physical
    address of the LD was used for the ST when its time came. The
    calculation OpCode was an ALU or the IMUL/DIV unit.


    Are you sure about IMUL/DIV ?
    MUL and DIV instructions have no RMW form on x86/i386/AMD64.
    OTOH, shifts have.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 21 19:13:55 2024
    On Wed, 21 Aug 2024 12:00:47 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:
    On OoO, if the reservation stations are valueless, you need a real
    physical register to stash the temp value as there is no guarantee
    the OP part of the uOp will launch just when the LD part finishes
    doing its thing and forwards the value.

    In the LD-OP-ST microarchitecture there would be some buffer
    that carries the intermediate values through the execution
    window. And, Yes, you can build a LD-OP-ST reservation station
    (Athlon and Opteron did).

    All the material I have seen is that AMD has a load-store ROP, but the
    op in between is in a separate functional unit, with a separate
    scheduler entry; and I expect that the load-store ROP occupies the
    load/store scheduler(s) twice: once for the load part, once for the
    store part.

    The LD-OP-STs in Athlon and Opteron had a memory OpCode and calculation
    OpCode, and was performed in such a way that the physical address of
    the LD was used for the ST when its time came. The calculation OpCode
    was an ALU or the IMUL/DIV unit.

    There is also something about macroops that can be load-op-stores, but from what I have read, when it comes to execution,
    they are split into ROPs.

    If you have more details that contradict the information published up
    to now, please let us know more about them.

    On the Intel side, LD-OP-ST is split into three uops according to
    everything I have read. Apparently they are satisfied with this
    approach, or they would have gone for something else.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Aug 21 22:46:12 2024
    On Wed, 21 Aug 2024 15:28:05 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 21 Aug 2024 12:00:47 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    On the Intel side, LD-OP-ST is split into three uops according to
    everything I have read. Apparently they are satisfied with this
    approach, or they would have gone for something else.

    - anton

    AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.

    Which 4 uops and 2 macroops are those? My guess is that ST is
    store-data and store-address uops, and ld and op are one uop each.


    Most likely.

    They travel through rename phase
    as 2 uOps.

    Interesting. But yes, only two values are generated for physical
    registers: the result of the load and the result of the op. So I
    expect that the two store parts are tacked onto the op on the way
    through the renamer, and then that macroop is split into its parts on
    the way to the schedulers.

    I am not sure if they are split back into 4 uOps before or
    after OoO schedulers, but would guess the former.

    Golden Cove is depicted as having an op scheduler, a load scheduler
    and a store scheduler, so they have to split the ld-op-store into at
    least three parts for scheduling.

    Sunny Cove is depicted as having an op scheduler, a store data
    scheduler, and two AGU schedulers, which would again mean at least
    three parts, but this time with a different split.

    Both based on <https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/>

    - anton

    Unlike previous Intel cores, both Sunny Cove and Golden Cove have no
    univerrsal AGUs. Each AGU is dedicated either to calculation of load
    addresses or to calculation of store addresses (2+2 on SuCo, 3+2 on
    GoCo).
    So, on this cores I see no way how any less than 4 uOps can go
    Schedulers. My uncertainty was about older PRF-based cores, i.e. SB
    through Skylake.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Thu Aug 22 12:01:13 2024
    On 8/20/2024 6:40 PM, MitchAlsup1 wrote:


    snip


    I spent 7 years doing x86-64.....so much for not having.....

    It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
    and LD-OP-STs. Then I took that and made a better RISC ISA.
    That RISC ISA did not have LD-OP-STs because of OpCode
    encoding reasons not from pipelining reasons.


    I understand that providing LD-OP for all the operations would take a
    lot of op code space. But I suspect that there is a distribution of the utility of LD-OP depending upon which operation is involved. e.g. there
    are probably more instances where load and integer add combined would be
    useful than load and floating point divide would be. I suspect that determining the few most useful combinations wouldn't be too difficult.

    So the question. Does it make sense to use a few op codes to implement
    the most common LD-OPs?




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brett on Fri Aug 23 00:32:14 2024
    Brett <ggtgp@yahoo.com> wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

    On Tue, 20 Aug 2024 16:40:06 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:


    and you may have
    several of these in a local sequence of code. ...

    No, you can not have several. It's always one then another one then yet >>>>> another one etc... Each one can reuse the same temporary register.

    The point is that the cost of not getting allocated into a register
    is vastly lower--the count of instructions remains 1 while the
    latency increases. That increase in latency does not hurt those
    use once/seldom variables.

    The the examples cited, the lack of register allocation triples
    the instruction count due to lack of LD-OP and LD-OP-ST. The
    register count I stated is how many registers would a
    non-LD-OP machine need to break even on the instruction count.


    LD-OP-ST is a bridge too far for me.

    LD-OP and OP-ST are fine with me and have benefits.

    If you put cache write at or after register file write in the
    pipeline; LD-OP-ST basically falls out for free and you can
    move the intermediate values from whence they are produced
    to where they are consumed with forwarding.

    LD-OP-ST mostly only fits if it is add to memory.

    42 bit opcodes work, you only need one in four RISC opcodes to merge to
    LD-OP or OP-ST for code density to be the same, and generally you will do better.

    The two leftover bits can be ignored, or be a template indicator, so you
    can pack in a LD-OP-ST, or 31 bit RISC ops.

    When you use a packet to hold 3 LD-OP-ST or 4 RISC ops, I am am not talking about two separate decoders.

    75% of the data format would be shared, which yes means one will be
    scattered, but that does not matter in the grand schemes of things. Two
    full separate decoders would be far far more ugly.

    Or go heads and tails packing.

    But you have not built such, you built an improved RISC…

    I spent 7 years doing x86-64.....so much for not having.....

    It is from that episode the cemented me on the value of
    [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
    and LD-OP-STs. Then I took that and made a better RISC ISA.
    That RISC ISA did not have LD-OP-STs because of OpCode
    encoding reasons not from pipelining reasons.

    I assume OP-ST has issues with the value getting stuck if the address is >>> slow to resolve. With a register the value can just spill to the
    register backing file. And because of this you create a hidden register
    name for the value.

    Athlon and Opteron had value capturing reservation stations.
    K9 have value-free RSs. It caused little headache because
    while we did not give it a named physical register, we did
    give it a physical register for the intermediates. SW can only
    read/write named PRs getting the name from logical to physical
    register renaming.

    You have information on how many hidden registers are in flight on
    average and worst case, so I believe your numbers.

    I have not looked to see if compilers generate LD-OP and OP-ST, at one
    point Intel was discouraging such code.

    Partially because AMD performed "relatively" better on LD-OPs and
    LD-OP-STs than Intel at that time. Where "relatively" means
    significantly above the noise level but "not all that much".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)