• Re: Why I've Dropped In

    From MitchAlsup1@21:1/5 to David Chmelik on Thu May 22 17:42:14 2025
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:


    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2
    data types are being supported {36-bits, 48-bits, 60-bits} along
    with the standard power of 2 lengths {8, 16, 32, 64}.

    This creates "interesting" situations with respect to instruction
    formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu May 22 18:03:34 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:


    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2
    data types are being supported {36-bits, 48-bits, 60-bits} along
    with the standard power of 2 lengths {8, 16, 32, 64}.

    Both sets are congruent to zero modulo 4. Therefore, the
    only proper solution becomes that modulo value, which amounts
    in this case to a 4-bit digit/nibble. Any size data type can
    be constructed from a variable number of nibbles up
    to some architectural max (e.g. 400 bits for a 100 nibble
    operand). The processor can treat them as binary or BCD
    depending on the requirements of the application (e.g. BCD
    fits COBOL well).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri May 23 12:37:38 2025
    On Thu, 22 May 2025 18:03:34 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:


    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2
    data types are being supported {36-bits, 48-bits, 60-bits} along
    with the standard power of 2 lengths {8, 16, 32, 64}.

    Both sets are congruent to zero modulo 4.

    Restricted because it does not support 28-bits, 40-bits, 44-bits,...

    Therefore, the
    only proper solution becomes that modulo value, which amounts
    in this case to a 4-bit digit/nibble. Any size data type can
    be constructed from a variable number of nibbles up
    to some architectural max (e.g. 400 bits for a 100 nibble
    operand). The processor can treat them as binary or BCD
    depending on the requirements of the application (e.g. BCD
    fits COBOL well).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri May 23 13:24:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 22 May 2025 18:03:34 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:


    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2
    data types are being supported {36-bits, 48-bits, 60-bits} along
    with the standard power of 2 lengths {8, 16, 32, 64}.

    Both sets are congruent to zero modulo 4.

    Restricted because it does not support 28-bits, 40-bits, 44-bits,...

    28/4 = 7 digits, 40/4 = 10 digits. Works just fine.

    The burroughs medium systems loaded 10 digits at a time from
    memory when processing an operand.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 11 05:56:33 2025
    quadibloc <quadibloc@gmail.com> schrieb:

    Since the basis of the ISA is a RISC-like ISA,

    [...]

    3) Use only four base registers instead of eight.
    4) Use only three index registers instead of seven.
    5) Use only six index registers instead of seven, and use only four base registers instead of eight when indexing is used.

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Jun 11 16:37:27 2025
    On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

    On 6/11/2025 12:56 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:

    Since the basis of the ISA is a RISC-like ISA,

    [...]

    3) Use only four base registers instead of eight.
    4) Use only three index registers instead of seven.
    5) Use only six index registers instead of seven, and use only four base >>> registers instead of eight when indexing is used.

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.


    Agreed.

    Ideally, one has an ISA where nearly all registers are the same:
    No distinction between base/index/data registers;
    No distinction between integer and floating point registers;
    No distinction between general registers and SIMD registers;
    ...

    Agreed:: But most architectures get the FP registers wrong under that distinction, and apparently everyone gets the SIMD registers wrong.

    Maybe it should be stated:: There is one register file of k-bits per
    register (where K=32, 64, 128} and that there is no distinction between
    what kind of data can go in what register.

    Though, there are tradeoffs. For example, SPRs can be, by definition,
    not the same as GPRs. Say, if you have an SP or LR, almost by
    definition, you will not be using it as a GPR.

    Disagree:: One uses the SP as a base register "all the time",
    one uses LR as a JMP source "every subroutine return".
    Either is generally done using GPRs, and thus the problem
    is to guarantee that you don't have so many of them that
    you can't use them naturally in your ISA>

    So, if ZR/LR/SP/GP are "not GPR", this is fine.
    Pretty much everything else is best served by being a GPR or suitably
    GPR like.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 16:49:06 2025
    On Wed, 11 Jun 2025 14:12:04 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
    bit left for the opcode.

    And that is why you don't do it that way.

    We can all agree that [Rbase+Rindex<<scale+Displacement] is (IS) the
    proper way to abstract address-generation. The problem is how does
    one "get there". Another way to look at this is that the AGEN unit
    is built to do index scaling AND to do DISPlacement addition as its
    primitive. The rest is routing of operands to AGEN--and in this case
    we KNOW that DISP is a constant at DECODE time and can address its instruction-queueing appropriately.

    In my case I broke it into 2 sets of patterns::

    MEM Rd,[Rbase+DISP16]
    and
    MEM Rd,[Rbase+Rindex<<scale]

    both of which fit in 32-bits. With 6-bit Major OpCode, this eats
    up 3/8ths of the OpCode space (There are 2× as many LDs as STs)
    in the Major OpCode repository. THEN one finds a way to add
    DISP32 and DISP64 (or ABS64) constants to the second form.

    DISP16 covers 70%-ile of memory references, base+index covers
    another 20%-ile, so one only needs [b+i<<s+DISP] 5%-10% of the
    time. But every time you can use it it save executing another
    instruction (sometimes 2).

    As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
    bits somewhere.

    My LDs are content free (LDs don't care if they are loading
    integer, or floating point data, or SIMD data, ...

    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those purposes.

    This is going to hurt register allocation.

    A standard RISC would not have an index register field, only a base
    register field, meaning array accesses would require multiple
    instructions.

    The 68000 only had base-index addressing with an 8-bit displacement;
    true base-index addressing with a normal displacement arrived in the
    68020, but the instructions using it took up 48 bits.

    I'll agree the 68000 architecture did have a serious mistake. It was
    CISC, so it didn't need to be RISC-like, but the special address
    registers should only have been used as base registers; the regular arithmetic registers should have been the ones used as index registers,
    since one has to do arithmetic to produce valid index values.

    The separate address registers would then have been useful, by allowing
    those of the eight (rather than 16 or 32) general registers that would
    have been used up holding static base register values to be freed up.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 11 17:05:13 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
    bit left for the opcode.

    What is the use case for having base and index register and a
    16-bit displacement?

    16 bit is (usually) large enough to addresses relative to a stack or
    frame pointer. It is rarely useful to address members of a struct,
    which are usually smaller.

    The use case for base + index exists, for things like

    for (i=0; i<n; i++)
    a[i] = b[i] + c[i]

    you only need four registers instead of six.

    If you want to have base + index + offset, it would probably be
    wise to restrict yourself to a smaller offset, or go big and
    allow a 32-bit offset.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Wed Jun 11 16:51:29 2025
    BGB <cr88192@gmail.com> writes:
    For example, SPRs can be, by definition,
    not the same as GPRs. Say, if you have an SP or LR, almost by
    definition, you will not be using it as a GPR.

    So, if ZR/LR/SP/GP are "not GPR", this is fine.

    I assume you mean Zero register, link register, stack pointer, global
    pointer. On most register architectures (those with GPRs) all of them
    are addressed as GPRs in most instructions. Specifically:

    Zero register: The CISCs (S/360, PDP-11, VAX, IA-32, AMD64) don't have
    a zero register, but use immediate 0 instead. Most RISCs have a
    register (register 0 or 31) that is addressed like a GPR, but really
    is a special-purpose register: It reads as 0 and writing to it has no
    effect. Power has some instructions that treat register 0 as zero
    register and others that treat it as GPR.

    Link register: On some architectures there is a register that is a GPR
    as far as most instructions are concerned. But the call instruction
    with immediate (relative) target uses that register as implicit target
    for the return address. MIPS is an example of that. Power has LR as
    a special-purpose register.

    Stack pointer: That's just software-defined on many register
    architectures, i.e., one could change the ABI to use a different stack
    pointer, and the resulting code would have the same size and speed.
    An interesting case is RISC-V. In RV64G it's just software-defined,
    but the C (compressed) extension defines some instructions that
    provide smaller instructions for a specific assignment of SP to the
    GPRs; I expect that similar things happen for other compressed
    instruction set extensions.

    Global pointer: That's just software-defined on all register
    architectures I am aware of.

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
    have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even
    have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to quadibloc on Wed Jun 11 17:33:35 2025
    quadibloc <quadibloc@gmail.com> writes:
    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base >register, and 16 bits for the displacement, then there would only be one
    bit left for the opcode.

    The solution of RISC architectures has been to not have displacement
    and index registers at the same time (MIPS and its descendants do not
    have base+index addressing at all). The solution of CISC
    architectures has been to allow bigger instructions, and possibly
    different displacement sizes (e.g., 8 bits and 32 bits for IA-32 and
    AMD64).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Wed Jun 11 19:08:06 2025
    On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

    BGB <cr88192@gmail.com> writes:
    For example, SPRs can be, by definition,
    not the same as GPRs. Say, if you have an SP or LR, almost by
    definition, you will not be using it as a GPR.

    So, if ZR/LR/SP/GP are "not GPR", this is fine.

    I assume you mean Zero register, link register, stack pointer, global pointer. On most register architectures (those with GPRs) all of them
    are addressed as GPRs in most instructions. Specifically:

    Zero register: The CISCs (S/360, PDP-11, VAX, IA-32, AMD64) don't have
    a zero register, but use immediate 0 instead. Most RISCs have a
    register (register 0 or 31) that is addressed like a GPR, but really
    is a special-purpose register: It reads as 0 and writing to it has no
    effect. Power has some instructions that treat register 0 as zero
    register and others that treat it as GPR.

    My 66000 is a RISC architecture that does NOT have a zero register.
    Most instructions have the ability to use the 5-bit register
    specifier as a 5-bit immediate, and for these instructions; #0
    signifies zero in both integer and floating point senses. #1
    signifies 0x0000000000000001 or 0x7FE0000000000000, ... so that
    one can do FADD R7,R19,#7 as a single 32-bit instruction word,
    saving instructions and code space. {Brian gets credit for this}

    Over on the memory side:: Rbase = 0 implies IP is the Base register
    Rindex = 0 implies no indexing (but still
    having
    access to DISP32 and DISP64 constants)

    Over on the call/return side: When safe stack is in use, RETaddr
    goes on the top of CSP and R0 is not modified, but when safe stack
    is not in use, R0 <= RETaddr. The RET instruction, then, does the
    right thing based on the status of the Safe-Stack in use flag.
    CSP (call stack pointer) is used to hold RETaddr and preserved
    registers in a way the called program can neither read nor write
    adding safety against actual attacks, and bad programming.

    And then there is the CALX instruction--which is a LDD IP,[address]--
    which transfers control through a table in memory to an entry point
    in the table. Good for external linkage and method calls. An
    interesting point, here, is that this is only for CALL/RET and not
    for branching--thus, it can be predicted better than with typical jump-predict-tables because you are not predicting at time of JMP
    but you are predicting at the time of the LDD; so, you can use
    LOBs of the address to help with the prediction.

    Link register: On some architectures there is a register that is a GPR
    as far as most instructions are concerned. But the call instruction
    with immediate (relative) target uses that register as implicit target
    for the return address. MIPS is an example of that. Power has LR as
    a special-purpose register.

    You could call my use of safe-stack as putting LR in a "more-better"
    place than in a GPR.

    Stack pointer: That's just software-defined on many register
    architectures, i.e., one could change the ABI to use a different stack pointer, and the resulting code would have the same size and speed.
    An interesting case is RISC-V. In RV64G it's just software-defined,
    but the C (compressed) extension defines some instructions that
    provide smaller instructions for a specific assignment of SP to the
    GPRs; I expect that similar things happen for other compressed
    instruction set extensions.

    My 66000 did something similar:: The ENTER and EXIT instructions
    use SP == R31 (or CSP) implicitly, values needing preserved are
    placed in memory the callee cannot LD not ST. Other than ENTER
    EXIT, and RET, SP could be any register.

    Interesting point:: the compiled code is not sensitive to the
    setting of the safe-stack flag--only the thread control regs.
    The only pieces of SW that need cognition of safe-stack are
    longjump() and stack-walk-back as used by TRY-THROW-CATCH.
    Both use the EXIT instruction in a "special" way to peal back
    layers on the stack.

    Global pointer: That's just software-defined on all register
    architectures I am aware of.

    In My 66000, it is simply an address constant. There is no rational
    to consume a register to get at something one can access with a
    longer address constant.

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
    have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even
    have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    I consider IP as a GPR a mistake--I think the PDP-11 and VAX
    people figured this out as Alpha did not repeat this mistake.
    Maybe in a 16-bit machine it makes some sense but once you have
    8-wide fetch-decode-execute it no longer does.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to quadibloc on Wed Jun 11 14:56:56 2025
    quadibloc wrote:
    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
    bit left for the opcode.

    Plus 3 bits for the load/store operand type & size,
    plus 2 or 3 bits for the index scaling (I use 3).
    It all won't fit into a 32-bit fixed length instruction.

    A separate LEA Load Effective Address instruction to calculate rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an alternative.
    Then rDest is used as a base in the LD or ST.

    One or two constant prefix instruction(s) I mentioned before
    (6 bit opcode, 26 bit constant) could extend the immediate value
    to imm26<<13+imm13 = sign extended 39 bits,
    or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

    As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
    bits somewhere.

    The problem is that there are 7 integer data types,
    signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    So 3 bits for the data type for loads and stores, which if you
    put that in the opcode field uses up almost all your opcodes.
    So you take the data types out of the disp16 field and now your
    offset range is 13 bits +/- 4kB.

    And a constant prefix instruction can extend the disp13 field
    to 26+13=39 or 26+26+13=65(64) bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 19:16:29 2025
    On Wed, 11 Jun 2025 17:34:54 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 16:49:06 +0000, MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 14:12:04 +0000, quadibloc wrote:

    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those
    purposes.

    This is going to hurt register allocation.

    Yes. It will. Unfortunately.

    Basically, as should be apparent by now, my overriding goal in defining
    the Concertina II architecture - and its predecessor as well - was to
    make it "just as good", or at least "just _about_ as good", as both the
    68020 and the IBM System/360.

    This meant that I had to be able to fit base plus index plus
    displacement into 32 bits, since the System/360 did that, and I had to
    have 16-bit displacements because the 68020, and indeed x86 and most microprocessors did that.

    There is enough evidence that 12-bit positive displacement (/360 model)
    is insufficient for modern applications, that I was surprised the RISC-V
    went in that direction. EMBench has many subroutines with more than 4K
    of stack variables that cause RISC-V to emit a LUI just to set the 12th
    or 13th bit and access. SPARC had enough problems with 13-bits that any-
    one with their ear to the rail should have heard the consternation.

    And I had to have register-to-register operate instructions that fit
    into only 16 bits. Because the System/360 had them, and indeed so do
    many microprocessors.

    Otherwise, my ISA would be clearly and demonstrably inferior. Where I couldn't attain a full match, I tried to be at least "almost" as good.
    So either my 16-bit operate instructions have to come in pairs, and have
    a very restricted set of operations, or they require the overhead of a
    block header. I couldn't attain the goal of matching the S/360
    completely, but at least I stayed close.

    So while having 32 registers like a RISC, I ended up having some
    purposes for which I could only use a set of eight registers. Not great,
    but it was the tradeoff that was left to me given the choice I made.

    So here it is - an ISA that offers RISC-like simplicity of decoding, but
    an instruction set that approaches CISC in code compactness - and which offers a choice of RISC, CISC, or VLIW programming styles. Which may
    lead to VLIW speed and efficiency on suitable implementations.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jun 11 21:17:46 2025
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

    quadibloc wrote:
    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base
    register, and 16 bits for the displacement, then there would only be one
    bit left for the opcode.

    Plus 3 bits for the load/store operand type & size,
    plus 2 or 3 bits for the index scaling (I use 3).
    It all won't fit into a 32-bit fixed length instruction.

    A separate LEA Load Effective Address instruction to calculate rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
    alternative.

    For my part, LEA is the other form of LDD (since the signed/unsigned
    notation is unused as there is no register distinction between signed
    64-bit LD and an unsigned 64-bit LD.

    Then rDest is used as a base in the LD or ST.

    One or two constant prefix instruction(s) I mentioned before
    (6 bit opcode, 26 bit constant) could extend the immediate value
    to imm26<<13+imm13 = sign extended 39 bits,
    or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

    My Mantra is to never use instructions to paste constants together.

    As I required 5 bits for the opcode to allow both loads and stores for
    several sizes each of integer and floating-point operands, I had to save
    bits somewhere.

    The problem is that there are 7 integer data types,
    signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

    There are 8 if you want to detect overflow differently between
    signed and unsigned 64-bit values) 99.44% of programs don't care.
    Which is why one "cooperates" with signedness in LDs, ignores
    signedness in STs, and does exception detection only in calculation instructions.

    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    So 3 bits for the data type for loads and stores,

    3-bits for LDs, 2-bits for STs.

    which if you
    put that in the opcode field uses up almost all your opcodes.

    With a Major OpCode size of 6-bits, the LDs + STs with DISP16
    uses 3/8ths of the OpCode space, a far cry from "almost all";
    but this is under the guise of a machine where GPRs and FPRs
    coexist in the same file.

    By using 1 Major OpCode to access another 6-bit encoding space
    (called XOP1) one then has another 6-bit encoding space where
    the typical LDs and STs consume 3/8ths leaving room to encode
    indexing, scaling, locking behavior, and ST #value,[address]
    which then avoids constant pasting instructions and waste of
    registers.

    I Also snuck in CALX--which is simply a LDD IP,[address] saving
    either the LDD or the JMP depending on how you look at it.

    So you take the data types out of the disp16 field and now your
    offset range is 13 bits +/- 4kB.

    The S.E.L. machines that did this only supported signed partial
    word LDs (saving a bit)--probably not the best choice in todays
    analysis and language uses.

    Secondarily, one had LDbyte and LDsized instruction variants.

    And a constant prefix instruction can extend the disp13 field
    to 26+13=39 or 26+26+13=65(64) bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 21:26:02 2025
    On Wed, 11 Jun 2025 18:14:34 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 17:33:35 +0000, Anton Ertl wrote:
    quadibloc <quadibloc@gmail.com> writes:

    However, if the memory reference instructions had 5 bits for the >>>destination register, 5 bits for the index register, 5 bits for the base >>>register, and 16 bits for the displacement, then there would only be one >>>bit left for the opcode.

    The solution of RISC architectures has been to not have displacement
    and index registers at the same time (MIPS and its descendants do not
    have base+index addressing at all). The solution of CISC
    architectures has been to allow bigger instructions, and possibly
    different displacement sizes (e.g., 8 bits and 32 bits for IA-32 and
    AMD64).

    And what I've chosen is...

    - to have an architecture which superficially resembles RISC,

    yes

    - but which offers all the capabilities of CISC

    I chose only "some" of the CISC characteristics

    - and which tries to approach achieving the same code density as such
    classic CISC machines as the System/360

    I am getting 1.1 VAX instruction counts compared to MIPS* getting 1.5
    VAX instruction counts. (*) R3000 and most other RISCs

    - and in addition which offers VLIW features as well

    Which I never saw any purpose for. VLIW ties one to multiples of a
    particular width. We now have access to machines which are 1-wide,
    3-wide, 4-wide, 6-wide, 8-wide, and now 10-wide. There seems to be
    no least common multiple or greatest common divisor.

    To look like RISC, and yet to have the code density of CISC is to
    attempt to achieve two goals which seem to be in profound conflict with
    each other. So it shouldn't be surprising that in order to do this, I've
    had to break a few rules and sacrifice some elegance.

    As I've striven to achieve what seemed impossible - even if some may say
    I'm tilting at windmills, as no one really cares that much about code
    density any more

    I care--getting rid of instructions that just paste constants together
    is 1/4 of the way from MIPS's 1.5× VAX to My 66000's 1.1× VAX.

    The x86 memory references are another 1/4
    The CISC ENTER EXIT instructions are another 1/4
    Leaving 3-other things for the last 1/4 ...


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jun 11 21:35:43 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:


    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    And more costly, given the potential need for two
    TLB lookups (which could access or dirty fault (restartable))
    and the potential cache coherency latency.

    Holding exclusive access to two cache lines at once
    as an atomic unit would complicate the coherency protocol,
    particularly with respect to deadlock prevention, nicht wahr?

    That's one reason multiword atomics are generally required
    to be naturally aligned; to avoid crossing a cache-line or page boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jun 11 18:00:33 2025
    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
    have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even
    have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    I consider IP as a GPR a mistake--I think the PDP-11 and VAX
    people figured this out as Alpha did not repeat this mistake.
    Maybe in a 16-bit machine it makes some sense but once you have
    8-wide fetch-decode-execute it no longer does.

    It had the advantage of meaningfully reusing some address modes
    rather than having to add new opcode formats:
    PC & autoincrement => immediate value (opcode data type gives size)
    PC & autoincrement deferred => absolute address of data
    PC & B/W/L relative => PC relative address of data
    PC & B/W/L relative deferred => PC relative address of address of data

    On the con side, for all PC-relative addressing the offset is
    relative to the incremented PC after *each* operand specifier.
    So two PC-rel references to the same location within a single
    instruction will have different offsets.
    (Not a big deal but yet another thing one has to deal with in
    Decode and carry with you in any uOps.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jun 11 23:01:04 2025
    On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
    have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even
    have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    I consider IP as a GPR a mistake--I think the PDP-11 and VAX
    people figured this out as Alpha did not repeat this mistake.
    Maybe in a 16-bit machine it makes some sense but once you have
    8-wide fetch-decode-execute it no longer does.

    It had the advantage of meaningfully reusing some address modes
    rather than having to add new opcode formats:
    PC & autoincrement => immediate value (opcode data type gives size)
    PC & autoincrement deferred => absolute address of data
    PC & B/W/L relative => PC relative address of data
    PC & B/W/L relative deferred => PC relative address of address of data

    I can argue that there are other ways to encode each of the above
    without using "address modes".

    On the con side, for all PC-relative addressing the offset is
    relative to the incremented PC after *each* operand specifier.
    So two PC-rel references to the same location within a single
    instruction will have different offsets.

    This is exactly what made wide VAXs so hard to pipeline. At least
    when I use IP as a means to access something in my ISA, the IP used
    is the IP of the first word of the instruction {rather than a run-
    ning copy of IP}

    (Not a big deal but yet another thing one has to deal with in
    Decode and carry with you in any uOps.)

    It becomes quadratically harder as instruction width increases,
    cubically if the accessed operands have variable widths.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jun 11 23:13:00 2025
    On Wed, 11 Jun 2025 21:35:43 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:


    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    And more costly, given the potential need for two
    TLB lookups (which could access or dirty fault (restartable))
    and the potential cache coherency latency.

    There are optional ways around those problems::
    {
    a) if you take a TLB fault on either access--just assume the
    . atomic event fails and restart after TLB is repaired.
    b) if both cache lines are not writeable--just assume the
    . atomic event fails and restart after both lines have
    . arrived in a writeable condition.
    }
    which simplify the problem space.

    But it is (IS I S ) the critical problem to be solved--how does
    one appear to hold onto {not a cache line but its} write permission
    in the face of uncertain delay (of related memory references)
    and data access through the cache hierarchy; and all the coherence
    traffic that transpires under this delay.

    Holding exclusive access to two cache lines at once
    as an atomic unit would complicate the coherency protocol,
    particularly with respect to deadlock prevention, nicht wahr?

    a) you don't try because
    b) you can't under all conditions
    what you do do is to see if both are writeable and if so
    proceed to perform both; otherwise fail both.

    My 66000 cache coherence protocol has a NAK but this feature can
    only be used under very tight HW restrictions, and the feature
    is under control of thread priority (so higher priority wins
    any conflicts). The tight restrictions would take 1000-2000
    words to adequately explain the subtle nuances that must be
    avoided.

    That's one reason multiword atomics are generally required
    to be naturally aligned; to avoid crossing a cache-line or page
    boundary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to quadibloc on Thu Jun 12 06:30:31 2025
    quadibloc <quadibloc@gmail.com> writes:
    To look like RISC, and yet to have the code density of CISC

    I have repeatedly posted measurements of code sizes of several
    programs on various architectures, most recently in <2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
    posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.


    Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
    packages:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Data from <2025Mar3.174417@mips.complang.tuwien.ac.at> for NetBSD
    packages.

    bash grep xz
    710838 42236 m68k
    748354 159304 40930 vax
    829077 176836 42840 amd64
    855400 164188 aarch64
    877284 186924 48032 sparc
    882847 187203 49866 i386
    898532 179844 earmv7hf
    962128 205776 54704 powerpc
    1004864 192256 53632 sparc64
    1025136 51160 mips64eb
    1147664 232688 63456 alpha
    1172692 mipsel

    Unfortunately, Debian does not have m68k or vax ports (the
    architectures with the smallest code size on NetBSD) in the regular distribution, and NetBSD does not have ARM T32 (armhf) or RV64GC
    (riscv64) (the architectures with the smallest code size on Debian) in
    its regular distribution, so one cannot compare them directly.
    However, taking the bash numbers and computing the relations of these
    four architectures to AMD64, we get:

    0.747 armhf/amd64 (debian)
    0.753 riscv64/amd64 (debian)
    0.857 m68k/amd64 (NetBSD)
    0.903 vax/amd64 (NetBSD)
    1.000 amd64

    So it seems that if you want code density, the way to go is to
    implement compressed RISC instruction sets. And compile with -Os
    (while I have posted cases where it is counterproductive, it tends to
    rein in on loop unrolling and inlining).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Jun 12 08:38:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
    have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even
    have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    I consider IP as a GPR a mistake--I think the PDP-11 and VAX
    people figured this out as Alpha did not repeat this mistake.
    Maybe in a 16-bit machine it makes some sense but once you have
    8-wide fetch-decode-execute it no longer does.

    Actually, it seems to me that for the first RISC generation, it made
    the least sense. You could not afford the transistors to do special
    handling of the PC.

    Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
    addressable like a GPR, so no, it does not look to be cost-effective
    to have the PC addressable like a GPR. AMD64 and ARM A64 have
    PC-relative addressing modes, while RISC-V does not.

    On the con side, for all PC-relative addressing the offset is
    relative to the incremented PC after *each* operand specifier.
    So two PC-rel references to the same location within a single
    instruction will have different offsets.

    This is exactly what made wide VAXs so hard to pipeline.

    I don't think that would cause a real problem for decoder designers
    these days. It might cost some additional transistors, though. This
    design choice in VAX was very likely due to the implementation choices (sequential decoding of instruction parts) they had in mind, and these
    days one would probably make a different choice even if one decided to
    design an otherwise VAX-style instruction set. How did the NS32k
    designers choose in this respect?

    That being said, how does the design choice to include PC-relative
    addressing in AMD64 and ARM A64 come out in the long run? When AMD64
    and ARM A64 was designed, the data was still delivered in the
    microinstruction in most microarchitectures, and in that context,
    PC-relative addressing does not cost extra; you just fill in the data
    from the start.

    But Intel switched to having separate rename registers in Sandy Bridge
    (around the time when ARM A64) appeared, and others did the same, so
    now there is no space in the microinstructions for including the value
    of the PC when the instruction was decoded. I guess that this value
    is stored in a physical register on decoding, and each use of
    PC-relative addressing reduces the amount of available physical
    registers from the time when the register renamer processes the
    instruction until the time when the instruction is processed by the
    ROB; can someone confirm this, or is it done in some other way?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Thu Jun 12 07:05:22 2025
    BGB <cr88192@gmail.com> writes:
    On 6/11/2025 11:51 AM, Anton Ertl wrote:
    Link register: On some architectures there is a register that is a GPR
    as far as most instructions are concerned. But the call instruction
    with immediate (relative) target uses that register as implicit target
    for the return address. MIPS is an example of that. Power has LR as
    a special-purpose register.


    It is GPR like, but in terms of role, I don't consider it as such.

    In RV64, in theory, JAL and JALR could use any register. But, the C ABI >effectively limits this choice to X1.

    So what? The architecture does not. Also, given that static linking
    and maybe even whole-program optimization are on the rise, the forces
    that coerce you to use the ABI are getting smaller.

    Implicitly, the 'C' extension and some other (less standardized)
    extensions also tend to hard-code the assumption of X1 being the link >register.

    Yes, the C extension is designed for minimizing the code size of
    common code, and assumes that the code follows the ABI. But nothing
    in the architecture forces you to use compressed instructions.
    Whenever it is more advantageous to use an instruction in a way that
    cannot be compressed, you just do it. E.g., if you do whole-program optimization, and you have functions

    A (called from 10 sites)
    B calls A (called from 2 sites, address not taken for indirect calling)
    C calls B

    Then one can use JAL or JALR with the target X1 to call A, and these instructions may be compressible. And one can use JAL or JALR with a
    different target to call B. The benefit of that is that B does not
    need to save and restore the address it returns to, eliminating the
    code needed for that, and the time needed to perform this saving and
    restoring.

    However, the architecture specification says that using x1 and x5 are considered to be link registers for branch prediction purposes, so
    ideally one will use x5 as target for calls to B, and for further
    levels the question is if it is good enough to just use plain
    indirect-branch prediction for those calls, or if one invests into the
    saving and restoring in order to use the return-address stack for
    branch prediction.

    Well, it is more a case here of, "try to put something other than the
    stack pointer in SP and see how far you get with that".

    There are multiple levels of systems (ISA design, OS, ...)

    Neither the ISA nor the system call interfaces I have looked at would
    cause any problems if I use x2 (sp) on RISC-V for something else.
    Maybe in case of a handled signal the OS would write to to a place
    pointed to by x2, but that requires 1) installing a signal handler and
    2) not using sigaltstack() to tell the OS where to write in such a
    case. Of course the signal handler (if any) with see x2 set to point
    to the alternative stack, but all the regular user code can use x2 for
    whatever purpose seems appropriate. Some programming languages are
    designed to work without stack (e.g., early Fortran), some to use
    multiple stacks (e.g., Forth).

    I am not saying they don't look like GPRs in the ISA, but rather that
    they aren't really GPRs in terms of roles or behavior, but rather they
    are essentially SPRs that just so happen to live in the GPR numbering space.

    As far as the architecture is concerned, they are GPRs. Yes, an ABI
    specifies a special role for some of them, but the ABI is software,
    not architecture. E.g., in early MIPS ABIs (in particular, on
    Ultrix), there was no GP, in later MIPS ABIs, there was.

    One can nicely see the role of the ABI in Table 25.1 of <http://staff.ustc.edu.cn/~comparch/reference/riscv-spec%EF%BC%880305%EF%BC%89.pdf>
    (page 137); it has a column called "Register" (with names like "x1")
    and a column called "ABI name" (with names like "ra"). The caption
    says: "Assembler mnemonics for RISC-V integer and floating-point
    registers, and their role in the first standard calling convention."

    So the architects expect the architecture to live longer than this
    ABI.

    It might be even due to things as simple as "well, the OS kernel and
    program launcher assume that stack is in X2, and system calls assume
    stack is in X2, ...". You have little real choice but to put the stack
    in X2, and if you try putting something else there, and a system call or >interrupt happens, ..., there is little to say that things wont "go >sideways", so, not really a GPR.

    I don't know what OS you have in mind, but in any OS where there is a
    boundary between user space and system space, the system does not use
    what may be the user-space stack pointer for storing its data, not on
    system calls, and certainly not on interrupts. And when I last looked
    at Linux system calls, the actual system call interface (not the C
    wrapper around it) passed parameters to system calls in registers, not
    on the user-level stack.

    The only case where a stack pointer register may come into play is
    when the OS calls a signal handler, but I have not looked at the
    machine-level interface there, so I cannot say for sure. In any case,
    that does not affect all the code that is not a signal handler.

    Global Pointer is assumed as such by the ABI, and OS may care about it,
    so not really a GPR.

    Why should the OS kernel care about the global pointer of a user-level
    program?

    I decided to classify X5/TP as a GPR as its usage is roughly up to the >discretion of the ABI and C runtime library (at least in RISC-V, there
    are no hard-coded ISA level assumptions about TP, nor does it cross into
    the OS kernel's realm of concern).

    Table 25.1 (mentioned above) gives tp as ABI name for x4, and t0 as
    ABI name for x5.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu Jun 12 09:12:59 2025
    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

    quadibloc wrote:
    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base >>> register, and 16 bits for the displacement, then there would only be one >>> bit left for the opcode.

    Plus 3 bits for the load/store operand type & size,
    plus 2 or 3 bits for the index scaling (I use 3).
    It all won't fit into a 32-bit fixed length instruction.

    A separate LEA Load Effective Address instruction to calculate
    rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
    alternative.

    For my part, LEA is the other form of LDD (since the signed/unsigned
    notation is unused as there is no register distinction between signed
    64-bit LD and an unsigned 64-bit LD.

    LEA doesn't need the 3 bits for data type/size.
    We can allocate them to index scaling which does need them.

    If fp128 is to be (someday) supported then the index scaling must be
    at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
    single instruction array index calculations up to fp128 octonions.
    The index scaling selects the octonion array element and the
    displacement selects a coefficient in it.

    Not that I have a use for octonions myself,
    just thinking of the kids out there.

    Then rDest is used as a base in the LD or ST.

    One or two constant prefix instruction(s) I mentioned before
    (6 bit opcode, 26 bit constant) could extend the immediate value
    to imm26<<13+imm13 = sign extended 39 bits,
    or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

    I've had a look at the H&P graph of displacement % usage and see
    that there is no significant difference between 12 and 13 bits.

    As a second cut at the design, I'd make the immediate 12 bits.
    So the immediate constants are either 12, 26+12=38 or 26+26+12=64 bits.
    And that leaves 4 bits for function codes or data types.

    My Mantra is to never use instructions to paste constants together.

    John's scenario chose fixed length 32-bit instructions
    so I'm just playing the cards dealt.

    This allows an operation with up to 64 bits of immediate to be
    defined in just 12 bytes of instruction space (same as My 66000).
    It is spread over 3 instruction slots, but those CONST instructions are
    defined as fused in Decode so its similar to a variable length ISA
    in that it requires no extra execute clocks.

    As I required 5 bits for the opcode to allow both loads and stores for
    several sizes each of integer and floating-point operands, I had to save >>> bits somewhere.

    The problem is that there are 7 integer data types,
    signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

    There are 8 if you want to detect overflow differently between
    signed and unsigned 64-bit values) 99.44% of programs don't care.
    Which is why one "cooperates" with signedness in LDs, ignores
    signedness in STs, and does exception detection only in calculation instructions.

    I have various instructions to check integer down-casts ranges too
    and fault on overflow. For checked languages, most overflow range
    checks require one extra instruction before the ST to a smaller type.

    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    I want double wide instructions for atomic swap and compare-and-swap.
    Those are restricted to naturally aligned addresses and trap if not.

    To support these I also need to be able to load and store wide values.
    The load and store register pair instructions accept any address but if
    you want an atomic guarantee then the address must be naturally aligned.
    If not naturally aligned then LD or ST could use 2 separate memory accesses.

    So 3 bits for the data type for loads and stores,

    3-bits for LDs, 2-bits for STs.

    For integers, the register pair makes ST sizes 1, 2, 4, 8, 16.
    I potentially have 5 float types from 1 to 16 bytes.
    If I include float register pairs (for complex numbers) then
    it could be load and store of 10 float data types.
    So 3 bits for data type.

    which if you
    put that in the opcode field uses up almost all your opcodes.

    With a Major OpCode size of 6-bits, the LDs + STs with DISP16
    uses 3/8ths of the OpCode space, a far cry from "almost all";
    but this is under the guise of a machine where GPRs and FPRs
    coexist in the same file.

    John said he had a 5-bit opcode.
    He also said he wants separate integer and float register files
    so that means separate LD, ST and FLD, FST.

    For the data types I listed above, but NOT including the float pairs,
    it would use opcodes for 8 LD, 5 ST, 5 FLD, 5 FST = 23 of 32 opcodes.
    If I include some float pairs for complex fp32, fp64 and fp128 then
    that uses up 29 or 32 opcodes. And that is just for loads and stores.

    So yes, "almost all".

    Either it:
    (a) moves the type/size bits somewhere else (the offset field), as I did,
    (b) or drop support for some sizes and require an extra sign or zero
    extend instruction to handle the others, as Alpha did.

    By using 1 Major OpCode to access another 6-bit encoding space
    (called XOP1) one then has another 6-bit encoding space where
    the typical LDs and STs consume 3/8ths leaving room to encode
    indexing, scaling, locking behavior, and ST #value,[address]
    which then avoids constant pasting instructions and waste of
    registers.

    But you have variable length instructions. I would too.

    John's premise assumes they are fixed 32-bits.
    I'm running *that* scenario forward to see if we can get a better
    result than the RISC-V folks got, where they need 6 instructions
    and 24 bytes to do a LD or ST with 64 bit offset.

    The CONST prefix instruction approach shows it can be done in
    3 instructions of 12 bytes which are fused in Decode so require
    no extra working register and no execute clocks for pasting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 12 13:41:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 21:35:43 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:


    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    And more costly, given the potential need for two
    TLB lookups (which could access or dirty fault (restartable))
    and the potential cache coherency latency.

    There are optional ways around those problems::
    {
    a) if you take a TLB fault on either access--just assume the
    . atomic event fails and restart after TLB is repaired.
    b) if both cache lines are not writeable--just assume the
    . atomic event fails and restart after both lines have
    . arrived in a writeable condition.
    }
    which simplify the problem space.

    Although it does require some other fallback to ensure
    fairness and prevent starvation. A big hammer like
    the x86 system bus lock, perhaps, if the atomic can't
    complete in some period of time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Thu Jun 12 09:38:31 2025
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:


    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).
    Store a pair of registers into two different memory locations
    atomically is more powerful.

    And more costly, given the potential need for two
    TLB lookups (which could access or dirty fault (restartable))
    and the potential cache coherency latency.

    Holding exclusive access to two cache lines at once
    as an atomic unit would complicate the coherency protocol,
    particularly with respect to deadlock prevention, nicht wahr?

    That's one reason multiword atomics are generally required
    to be naturally aligned; to avoid crossing a cache-line or page boundary.

    I don't need it to be atomic for any alignment.
    The spec for LD and ST register pair would say that IF the address is
    16-byte aligned THEN the operation is guaranteed to be done atomically.
    If the address is not aligned it may use two separate operations.
    This is the same guarantee as 2, 4 and 8 byte LD or ST.

    As to whether the register pair is specified as one field with
    and implied increment or two separate fields, I have cases for both.
    Once I started adding double-wide operate instructions I found
    usages where assuming the register pairs were contiguous
    (eg only even numbered registers) was too constraining.
    It forces many extraneous MOV's to create the even numbered pairs.

    On the other hand, having register pairs specified by two fields
    quickly winds up with instructions that have 5 or 6 register fields
    (2 dest and 2+1 source, or 2 dest and 2+2 source).
    But this only affects Decode as the uOp formats require 6 fields.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jun 12 13:43:59 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    quadibloc <quadibloc@gmail.com> writes:
    To look like RISC, and yet to have the code density of CISC

    I have repeatedly posted measurements of code sizes of several
    programs on various architectures, most recently in ><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
    posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.


    Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
    packages:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Seems to me that the text size is not the interesting
    metric here - rather the typical working set size is
    far more important.

    Take bash, for instance; in typical operation I
    would not expect it to use more than a small fraction of
    the total text.

    It may be that text size isn't a particularly
    good metric for judging instruction set effectiveness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Thu Jun 12 08:44:14 2025
    On 6/12/2025 8:00 AM, quadibloc wrote:
    On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

    What is the use case for having base and index register and a
    16-bit displacement?

    The IBM System/360 had a base and index register and a 12-bit
    displacement.

    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).


    Most microprocessors have a base register and a 16-bit displacement.

    So this lets my architecture be a superset of both of them.

    But your different ISA format, etc. means that it is not a true
    superset. That is, any S/360 program would have to be recompiled to run
    on your architecture. So it is only for some sort of "conceptual", but
    not actual compatibility that is only for assembler language programmers
    (and compiler writers).


    Great
    selling point,

    I suspect that the number of S/360 assembler programs being written
    these days is asymptotic to zero, so not so much.

    and thus I didn't think too hard about whether it is
    "needed", because an architecture that instead tries to only provide genuinely necessary capabilities...

    now forces programmers, used to other systems that were more generous in
    one way or another, to change their habits!

    That would presumably spoil sales or ruin the popularity of the ISA.

    I don't think so. And consider how much the other problems that you
    have been struggling with would become so much simpler if you eliminated
    the base registers and used those for bits in the instructions for other things.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jun 12 18:44:20 2025
    On Thu, 12 Jun 2025 8:38:06 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

    Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX) >>>>> have the PC addressed like a GPR, although it clearly is a
    special-purpose register. Most RISCs don't have this, and don't even >>>>> have a PC-relative addressing mode or somesuch. Instead, they use
    ABIs where global pointers play a big role.

    I consider IP as a GPR a mistake--I think the PDP-11 and VAX
    people figured this out as Alpha did not repeat this mistake.
    Maybe in a 16-bit machine it makes some sense but once you have
    8-wide fetch-decode-execute it no longer does.

    Actually, it seems to me that for the first RISC generation, it made
    the least sense. You could not afford the transistors to do special
    handling of the PC.

    Instead, you built the increment loop around the IP itself. Basically,
    you have a 4-input multiplexer, 1 leg feeds the current IP back to the
    adder which is then flopped in the IP register, the other 3 inputs are
    for branch displacement, interrupt vector, and JMUP register input.
    It is basically a degenerate ALUU+forwarding path.

    Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
    addressable like a GPR, so no, it does not look to be cost-effective
    to have the PC addressable like a GPR. AMD64 and ARM A64 have
    PC-relative addressing modes, while RISC-V does not.

    Consider that in a 8-wide machine, IP gets added to 8 times per cycle,
    whereas no GPR has a property anything like that.

    On the con side, for all PC-relative addressing the offset is
    relative to the incremented PC after *each* operand specifier.
    So two PC-rel references to the same location within a single
    instruction will have different offsets.

    This is exactly what made wide VAXs so hard to pipeline.

    I don't think that would cause a real problem for decoder designers
    these days. It might cost some additional transistors, though.

    I disagree, the way VLE is implemented in My 66000 allows instruction
    boundary determination to be tree-ifide. The way VAX (and PDP-11) did
    it does not allow tree-ififcation. My 66000 is quadratic whereas VAX
    is higher than cubic when you consider the large operand instructions.
    If you don't do the wide operand instructions, VAX is only a little
    harder than cubic.

    This
    design choice in VAX was very likely due to the implementation choices (sequential decoding of instruction parts) they had in mind, and these
    days one would probably make a different choice even if one decided to
    design an otherwise VAX-style instruction set. How did the NS32k
    designers choose in this respect?

    That being said, how does the design choice to include PC-relative
    addressing in AMD64 and ARM A64 come out in the long run? When AMD64
    and ARM A64 was designed, the data was still delivered in the microinstruction in most microarchitectures, and in that context,
    PC-relative addressing does not cost extra; you just fill in the data
    from the start.

    My 66000 also has this property, but also the property that any IP
    needed as an operand to any instruction is the virtual address of
    the instruction itself (not incremented); and is thus easy to synthesize
    in the DECODE pipeline.

    But Intel switched to having separate rename registers in Sandy Bridge (around the time when ARM A64) appeared, and others did the same, so
    now there is no space in the microinstructions for including the value
    of the PC when the instruction was decoded.

    K9 was going to unify x86+64, x87, MMX-SBVE into a single register file,
    too. These are decisions based on how the microarchitecture takes place.
    In K9's case, the unified file was 1/2 the size of the 3 separate files.

    I guess that this value
    is stored in a physical register on decoding, and each use of
    PC-relative addressing reduces the amount of available physical
    registers from the time when the register renamer processes the
    instruction until the time when the instruction is processed by the
    ROB; can someone confirm this, or is it done in some other way?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 12 18:55:49 2025
    On Thu, 12 Jun 2025 13:12:59 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

    quadibloc wrote:
    On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

    Having different classes of base and index registers is very
    un-RISCy, and not generally a good idea. General purpose registers
    is one of the great things that the /360 got right, as the VAX
    later did, and the 68000 didn't.

    This is true.

    However, if the memory reference instructions had 5 bits for the
    destination register, 5 bits for the index register, 5 bits for the base >>>> register, and 16 bits for the displacement, then there would only be one >>>> bit left for the opcode.

    Plus 3 bits for the load/store operand type & size,
    plus 2 or 3 bits for the index scaling (I use 3).
    It all won't fit into a 32-bit fixed length instruction.

    A separate LEA Load Effective Address instruction to calculate
    rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
    alternative.

    For my part, LEA is the other form of LDD (since the signed/unsigned
    notation is unused as there is no register distinction between signed
    64-bit LD and an unsigned 64-bit LD.

    LEA doesn't need the 3 bits for data type/size.
    We can allocate them to index scaling which does need them.

    If fp128 is to be (someday) supported then the index scaling must be
    at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
    single instruction array index calculations up to fp128 octonions.
    The index scaling selects the octonion array element and the
    displacement selects a coefficient in it.

    Not that I have a use for octonions myself,
    just thinking of the kids out there.

    An architecture is defined as much by what gets left out as by what
    gets left in. You have to draw the line somewhere.

    Then rDest is used as a base in the LD or ST.

    One or two constant prefix instruction(s) I mentioned before
    (6 bit opcode, 26 bit constant) could extend the immediate value
    to imm26<<13+imm13 = sign extended 39 bits,
    or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

    I've had a look at the H&P graph of displacement % usage and see
    that there is no significant difference between 12 and 13 bits.

    As a second cut at the design, I'd make the immediate 12 bits.
    So the immediate constants are either 12, 26+12=38 or 26+26+12=64 bits.
    And that leaves 4 bits for function codes or data types.

    My Mantra is to never use instructions to paste constants together.

    John's scenario chose fixed length 32-bit instructions
    so I'm just playing the cards dealt.

    I suggest a new deck in in order.

    This allows an operation with up to 64 bits of immediate to be
    defined in just 12 bytes of instruction space (same as My 66000).
    It is spread over 3 instruction slots, but those CONST instructions are defined as fused in Decode so its similar to a variable length ISA
    in that it requires no extra execute clocks.

    For that statement to be always true, you would have to have the
    final 12-bit constant on all FP calculation instructions.

    And I do not believe that you have addresses the universal placement
    idea of My 66000::

    FDIV R7,#3.141592653589216,R19

    at least for the non-commutative calculations.

    As I required 5 bits for the opcode to allow both loads and stores for >>>> several sizes each of integer and floating-point operands, I had to save >>>> bits somewhere.

    The problem is that there are 7 integer data types,
    signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

    There are 8 if you want to detect overflow differently between
    signed and unsigned 64-bit values) 99.44% of programs don't care.
    Which is why one "cooperates" with signedness in LDs, ignores
    signedness in STs, and does exception detection only in calculation
    instructions.

    I have various instructions to check integer down-casts ranges too
    and fault on overflow. For checked languages, most overflow range
    checks require one extra instruction before the ST to a smaller type.

    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Store a pair of registers into two different memory locations
    atomically is more powerful.

    I want double wide instructions for atomic swap and compare-and-swap.
    Those are restricted to naturally aligned addresses and trap if not.

    My point was:: 2 address atomics are more powerful that 2-wide single
    address atomics.

    To support these I also need to be able to load and store wide values.
    The load and store register pair instructions accept any address but if
    you want an atomic guarantee then the address must be naturally aligned.
    If not naturally aligned then LD or ST could use 2 separate memory
    accesses.

    So 3 bits for the data type for loads and stores,

    3-bits for LDs, 2-bits for STs.

    For integers, the register pair makes ST sizes 1, 2, 4, 8, 16.
    I potentially have 5 float types from 1 to 16 bytes.
    If I include float register pairs (for complex numbers) then
    it could be load and store of 10 float data types.
    So 3 bits for data type.

    Go ahead and shoot yourself in the foot.

    which if you
    put that in the opcode field uses up almost all your opcodes.

    With a Major OpCode size of 6-bits, the LDs + STs with DISP16
    uses 3/8ths of the OpCode space, a far cry from "almost all";
    but this is under the guise of a machine where GPRs and FPRs
    coexist in the same file.

    John said he had a 5-bit opcode.

    Which makes 3/8ths into 3/4ths or from quite reasonable to completely unreasonable.

    He also said he wants separate integer and float register files
    so that means separate LD, ST and FLD, FST.

    I have found this to be a burden, not an enhancement.

    For the data types I listed above, but NOT including the float pairs,
    it would use opcodes for 8 LD, 5 ST, 5 FLD, 5 FST = 23 of 32 opcodes.
    If I include some float pairs for complex fp32, fp64 and fp128 then
    that uses up 29 or 32 opcodes. And that is just for loads and stores.

    So yes, "almost all".

    Either it:
    (a) moves the type/size bits somewhere else (the offset field), as I
    did,
    (b) or drop support for some sizes and require an extra sign or zero
    extend instruction to handle the others, as Alpha did.

    By using 1 Major OpCode to access another 6-bit encoding space
    (called XOP1) one then has another 6-bit encoding space where
    the typical LDs and STs consume 3/8ths leaving room to encode
    indexing, scaling, locking behavior, and ST #value,[address]
    which then avoids constant pasting instructions and waste of
    registers.

    But you have variable length instructions. I would too.

    Everything necessary for decoding, determining operands, and routing
    operands to a function unit are contained in the first word of the
    VLE. Only constants follow this first word.

    John's premise assumes they are fixed 32-bits.
    I'm running *that* scenario forward to see if we can get a better
    result than the RISC-V folks got, where they need 6 instructions
    and 24 bytes to do a LD or ST with 64 bit offset.

    Whereas My 66000 needs only 3-words and only 1 instruction.

    The CONST prefix instruction approach shows it can be done in
    3 instructions of 12 bytes which are fused in Decode so require
    no extra working register and no execute clocks for pasting.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 12 19:01:52 2025
    On Wed, 11 Jun 2025 19:37:03 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

    The problem is that there are 7 integer data types,
    signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).

    Even I have refused to contend with all of this, at least for my basic
    32-bit instruction set. Some exotic types that I do intend to support
    will just have to make do with 48-bit or longer instructions instead.

    But signed and unsigned integers aren't _quite_ the same as different
    types for load and store. I may have separate integer and floating
    registers, but I don't have separate signed and unsigned registers.

    But you DO HAVE signed and unsigned versions of LD {B,H,W} don't you ??

    Instead, I've followed the System/360. When it comes to load and store,
    for integers I have two additional operations - unsigned load and
    insert. But only for integers shorter than the register.

    What code is produced from::

    uint32_t function( uint32_t u )
    {
    int32_t i[99];
    return i[u];
    }

    a) are you going to signed-word-load the i array and then zero-
    extend at word boundary, or
    b) do you propagate the unsigned into the load of array of i so
    you can return the value directly
    ??
    {{Notice I am using 32-bit data in a 64-bit machine}}

    Load sign extends. Unsigned Load zero extends. Insert leaves bits in the register preceding what is loaded untouched.

    Since arithmetic is two's complement, there is only one add instruction,
    and there is only one store instruction, for each length. If we were
    really dealing with different types, we would need additional
    instructions of those kinds as well.

    For floats, I deal with fp32, fp48, fp64, and fp128 only as the primary floating-point types.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 12 19:19:18 2025
    On Thu, 12 Jun 2025 13:38:31 +0000, EricP wrote:

    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:


    and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
    There might also be special (non-ieee) float formats for AI support.
    Plus one might also want some register pair operations
    (eg load a complex fp32 value into a pair of fp registers,
    store a pair of integer registers as a single (atomic) int128).
    Store a pair of registers into two different memory locations
    atomically is more powerful.

    And more costly, given the potential need for two
    TLB lookups (which could access or dirty fault (restartable))
    and the potential cache coherency latency.

    Holding exclusive access to two cache lines at once
    as an atomic unit would complicate the coherency protocol,
    particularly with respect to deadlock prevention, nicht wahr?

    That's one reason multiword atomics are generally required
    to be naturally aligned; to avoid crossing a cache-line or page
    boundary.

    I don't need it to be atomic for any alignment.
    The spec for LD and ST register pair would say that IF the address is
    16-byte aligned THEN the operation is guaranteed to be done atomically.

    If the container does not cross a cache line boundary it can be
    performed atomically, and otherwise it cannot.

    If the address is not aligned it may use two separate operations.
    This is the same guarantee as 2, 4 and 8 byte LD or ST.

    As to whether the register pair is specified as one field with
    and implied increment or two separate fields, I have cases for both.
    Once I started adding double-wide operate instructions I found
    usages where assuming the register pairs were contiguous
    (eg only even numbered registers) was too constraining.

    Compiler people H A T E pairing {LoB = 0 and 1} and sharing
    {Rsecond = Rfirst+1}, they want to be able to allocate any
    value into any register without such constraints. After all
    register allocation is already NP, pairing and sharing moves
    the needle to NP-hard.

    It forces many extraneous MOV's to create the even numbered pairs.

    In my Samsung GPU I invented a DBLE instruction. Its only job was to
    supply register operands to another instruction which would then be
    performed double wide. This gets around all the pairing and sharing
    problems.

    On the other hand, having register pairs specified by two fields
    quickly winds up with instructions that have 5 or 6 register fields
    (2 dest and 2+1 source, or 2 dest and 2+2 source).

    DBLE simply supplies the extra register specification fields;
    obviating the problem in expressing the unit-of-work.

    But this only affects Decode as the uOp formats require 6 fields.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Jun 12 19:13:10 2025
    On Wed, 11 Jun 2025 19:47:42 +0000, BGB wrote:

    On 6/11/2025 11:37 AM, MitchAlsup1 wrote:
    On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:
    ------------------

    LR: Functionally, in most ways the same as a GPR, but is assigned a
    special role and is assumed to have that role. Pretty much no one uses
    it as a base register though, with the partial exception of potential
    JALR wonk.

    One can use JALR to call special subroutines that store multiple
    registers
    on the stack (or restore them later) wrapping prologue and Epilogue into
    little subroutine calls that use a separate LR and thus have lower over-
    head than a full blown call. Other than this use and some PDP-11-style co-routines the explicit specification of LE is completely unnecessary.

    JALR X0, X1, 16 //not technically disallowed...

    If one uses the 'C' extension, assumptions about LR and SP are pretty
    solidly baked in to the ISA design.


    ZR: Always reads as 0, assignments are ignored; this behavior is very un-GPR-like.

    GP: Similar situation to LR, as it mostly looks like a GPR.
    In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so saving/restoring GP will also implicitly save/restore the dynamic
    rounding mode and similar (as opposed to proper RISC-V which has this
    stuff in a CSR).

    With universal constants, you get this register back.



    Though, this isn't practically too much different from using the HOB's
    of captured LR values to hold the CPU ISA mode and similar (which my
    newer X3VM retains, though I am still on the fence about the "put FPSR
    bits into HOBs of GP" thing).

    Does mean that either dynamic rounding mode is lost every time a GP
    reload is done (though, only for the callee), or that setting the
    rounding mode also needs to update the corresponding PBO GP pointer
    (which would effectively make it semi-global but tied to each PE image).

    The traditional assumption though was that dynamic rounding mode is
    fully global, and I had been trying to make it dynamically scoped.

    The modern interpretation is that the dynamic rounding mode can be set
    prior to any FP instruction. So, you better be able to set it rapidly
    and without pipeline drain, and you need to mark the downstream FP
    instructions as dependent on this.

    So, it may be that having FPSR as its own thing, and then explicitly saving/restoring FPSR in functions that modify the rounding mode, may be
    a better option.

    RM is separate from FPSR in My 66000, and uniquely accessible. -----------------------
    Though, OTOH, Quake has stuff like:
    typedef float vec3_t[3];
    vec3_t v0, v1, v2;
    ...
    VectorCopy(v0, v1);
    Where VectorCopy is a macro that expands it out to something like, IIRC,
    do { v1[0]=v0[0]; v1[1]=v0[1]; v1[2]=v0[2]; } while(0);

    Where BGBCC will naively load each value, widen it to double, narrow it
    back to float, and store the result.

    Sounds like you should be working on the compiler instead of microarchitectures.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 12 19:24:36 2025
    On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote:

    I thought I saw a post in this thread which asked why I included VLIW capabilities in my ISA.
    Perhaps that post was deleted, or I saw it in another thread and misremembered.
    However, I thought it was worth a reply, in case anyone had forgotten
    what VLIW was "good for".

    Today's microprocessors achieve considerably improved performance
    through the use of Out-of-Order Execution. Compare the 486, which
    doesn't have it, to the Pentium II, which does have it. Intel's Atom processors originally did not have OoO in order to be small and
    inexpensive, but their low performance, and smaller transistors making
    more complex chips more easily possible led to even the Atom going OoO.

    OoO comes with a cost, though. It increases transistor costs
    considerably. Also, it comes with vulnerabilities like Spectre and
    Meltdown.

    VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
    promise of achieving OoO level performance without the costs of OoO.

    Pick a VLIW that was successful like x86 or ARM in the marketplace.

    This is because it lets the pipeline achieve high efficiency by directly indicating within the code itself when succeeding instructions may be executed in parallel, without requiring the computer to make the effort
    of determining when this is possible.

    That is the theory. But theory works better in theory than in practice.
    Itanic made a run at it, but ultimately failed, as OoO was found to be
    the better tool. Itanic held some leads for a while, while it and 2×
    the number of pins in the memory side and 2× the Function units on the calculation side. When it had an equal number of pins, it was never
    ahead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Thu Jun 12 19:55:06 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Actually, it seems to me that for the first RISC generation, it made
    the least sense. You could not afford the transistors to do special
    handling of the PC.

    Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
    addressable like a GPR, so no, it does not look to be cost-effective
    to have the PC addressable like a GPR. AMD64 and ARM A64 have
    PC-relative addressing modes, while RISC-V does not.

    Power has added this as an extended instruction with the v3.1
    version of their ISA (Power 10); you can now do loads and stores
    to PC with a 34-bit signed offset.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Thu Jun 12 20:50:51 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    If fp128 is to be (someday) supported then the index scaling must be
    at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
    single instruction array index calculations up to fp128 octonions.
    The index scaling selects the octonion array element and the
    displacement selects a coefficient in it.

    Index scaling is very nice to have when you add, let's say, a
    elements of a real array to the real part of a complex array -
    you only need one register for the index variable.

    For just doing

    for (i=0; i<n; i++)
    a[i] = b[i] + c[i]

    where a, b and c are all of the same type, you can use
    a non-scaled single index register and increment it by
    the size of the type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Jun 13 00:00:57 2025
    On Thu, 12 Jun 2025 21:30:39 +0000, BGB wrote:

    On 6/12/2025 2:13 PM, MitchAlsup1 wrote:
    ------------------------------

    GP: Similar situation to LR, as it mostly looks like a GPR.
    In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so
    saving/restoring GP will also implicitly save/restore the dynamic
    rounding mode and similar (as opposed to proper RISC-V which has this
    stuff in a CSR).

    With universal constants, you get this register back.


    Well, if using an ABI that either allows absolute addressing or PC-rel
    access to globals.

    It is ISA that directly supports access to Globals.


    The ABI designs I am using in BGBCC and TestKern use a global pointer
    for accessing globals, and allocate the storage for ".data"/".bss"
    separately from ".text". In this ABI design, the pointer is unavoidable.

    I want a system where .data and .bss can be > 1TB away from each other.
    So, that .data grows when ld.so loads another dynamic library, and .bss
    grows for the same reasons.

    Does allow multiple process instances in a single address space with non-duplicated ".text" though (and is more friendly towards NOMMU
    operation).




    Though, this isn't practically too much different from using the HOB's
    of captured LR values to hold the CPU ISA mode and similar (which my
    newer X3VM retains, though I am still on the fence about the "put FPSR
    bits into HOBs of GP" thing).

    Does mean that either dynamic rounding mode is lost every time a GP
    reload is done (though, only for the callee), or that setting the
    rounding mode also needs to update the corresponding PBO GP pointer
    (which would effectively make it semi-global but tied to each PE image). >>>
    The traditional assumption though was that dynamic rounding mode is
    fully global, and I had been trying to make it dynamically scoped.

    The modern interpretation is that the dynamic rounding mode can be set
    prior to any FP instruction. So, you better be able to set it rapidly
    and without pipeline drain, and you need to mark the downstream FP
    instructions as dependent on this.

    Errm, there is likely to be a delay here, otherwise one will get a stale rounding mode.

    RM is "just 3-bits" that get read from control register and piped
    through instruction queue to function unit. Think of the problem
    one would have if a hyperthreaded core had to stutter step through
    changing RM ...


    So, setting the rounding mode might be something like:
    MOV .L0, R14
    MOVTT GP, 0x8001, GP //Set to rounding mode 1, clear flag bits
    JMP R14 //use branch to flush pipeline
    .L0: //updated FPSR now ready
    FADDG R11, R12, R10 //FADD, dynamic mode

    Setting RM to a constant (known) value::

    HRW rd,RM,#imm3 // rd gets old value

    Or, use an encoding with an explicit (static) rounding mode:
    FADD R11, R12, 1, R10


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Thu Jun 12 20:36:07 2025
    On 6/12/2025 8:09 PM, quadibloc wrote:
    On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

    On 6/12/2025 8:00 AM, quadibloc wrote:

    The IBM System/360 had a base and index register and a 12-bit
    displacement.

    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).

    I hadn't thought about it that way.

    It does make sense that on a timesharing system, virtual memory meant
    that different users would not have to share the same memory space, so programs wouldn't have to be relocatable.

    But if you drop base registers for that reason, suddenly you are forced
    to always use virtual memory.

    No. Other systems in the S/360 time frame (i.e. before virtual memory)
    used a system "base register", that was hidden from the user (but was in
    its context), that was set by the OS when the program was loaded, or if
    it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage
    of not requiring a user register for that purpose, it allowed a program
    to be swapped in to a different memory address than it was swapped out
    from, a feature the S/360 didn't enjoy.

    snip


    Of course, then why did the 68020 support it, I could ask.

    Someone more familiar with the 68020 would have to answer that.

    But in any case, the answer to Thomas's original question is that there
    is no use case for it now, and the cost in instruction bits is too large
    to consider using them.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 06:03:02 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    No. Other systems in the S/360 time frame (i.e. before virtual memory)
    used a system "base register", that was hidden from the user (but was in
    its context), that was set by the OS when the program was loaded, or if
    it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage
    of not requiring a user register for that purpose, it allowed a program
    to be swapped in to a different memory address than it was swapped out
    from, a feature the S/360 didn't enjoy.

    It was supposed to, but I belive that was one of the earliest
    failures that they noticed, and should have realized before:
    their memory to memory instructions did not have base+offset+index.

    Also, how was storing a pointer to somewhere supposed to work
    for swapping out/swapping in?

    You would only have to store it as an offset to a base pointer,
    so basically a "fat pointer" containing both base and index
    register. Of course, nobody did that.

    They really didn't think that one through.

    Which brings me to one of my favorite musings... how would a /360
    have looked with the benefit of things that could/should have been
    seen at the time? PC-relative branches with 16 bit offset and
    ARM-style condition codes come to mind (introduced with the /390,
    I believe), as could be binary floats (don't save those few gates).

    Also, just discussed: Throw out the base registers and put in
    memory operations with 16-bit offset.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jun 13 07:31:20 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    quadibloc <quadibloc@gmail.com> writes:
    To look like RISC, and yet to have the code density of CISC

    I have repeatedly posted measurements of code sizes of several
    programs on various architectures, most recently in >><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
    posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.


    Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
    packages:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Seems to me that the text size is not the interesting
    metric here - rather the typical working set size is
    far more important.

    Yes, something like it. But how do you measure it? And do you think
    that the text size of binaries for different architectures are not
    correlated to the working set sizes of these architectures?

    It may be that text size isn't a particularly
    good metric for judging instruction set effectiveness.

    Why would it not be a good predictor, and what would you use instead?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Jun 13 07:03:07 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    Compiler people H A T E pairing {LoB = 0 and 1} and sharing
    {Rsecond = Rfirst+1}, they want to be able to allocate any
    value into any register without such constraints. After all
    register allocation is already NP, pairing and sharing moves
    the needle to NP-hard.

    Register allocation is NP-complete, and thus also NP-hard (every
    NP-complete problem is NP-hard).

    I don't know if the additional condition for NP-completeness does not
    hold for register allocation with pairing (I doubt it), but the reason
    why compiler people dislike it is not this practically irrelevant
    theoretical difference. We don't solve even NP-complete problems in
    compilers. Instead, we use heuristics to solve a different problem:
    produce a good, but not necessarily optimal register allocation;
    going for optimality would be NP-complete.

    The reason why compiler people dislike pairing is that it introduces
    another complication in an already complicated and bug-fraught part of
    the compiler. It gets especially complicated if you want to produce a
    good solution.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to quadibloc on Fri Jun 13 07:39:55 2025
    quadibloc <quadibloc@gmail.com> writes:
    OoO comes with a cost, though. It increases transistor costs
    considerably. Also, it comes with vulnerabilities like Spectre and
    Meltdown.

    No. AMD's OoO CPUs never had Meltdown AFAIK. As for Spectre, that
    can be fixed (not mitigated) at moderate hardware and performance cost
    with Invisible Speculation; I wrote an overview paper about Spectre
    and how to fix it <http://www.euroforth.org/ef23/papers/ertl.pdf>, but
    the actual Invisible Speculation research was done by others. It's
    just that the hardware designers don't want to; apparently the
    customers are not interested enough and prefer to pay the performance
    and software development cost of Spectre mitigations (or are too
    indifferent to care about it at all).

    VLIW, in the sense of the Itanium

    IA-64 is EPIC, not VLIW. And IA-64 gives you Spectre, too, in a way
    that is cannot be fixed by Invisible Speculation, because the
    speculation is architectural, and there is no explicit "commit" that
    turns speculation into non-speculation.

    You can avoid Spectre in IA-64 by avoiding the use of the speculation
    features of the architecture (in particular control-speculative or data-speculative loads) or at least avoiding further loads based on
    the loaded data while that is still speculative (you probably need to
    avoid other things as well). That's the IA-64 equivalent of the
    Speculative Load Hardening mitigation which costs more than a factor
    of 2 in performance on OoO CPUs. I expect that it costs less
    performance on IA-64, but the end result will still be less
    performance on IA-64 than on OoO CPUs with the same transistor and
    power budget.

    This is because it lets the pipeline achieve high efficiency by directly >indicating within the code itself when succeeding instructions may be >executed in parallel, without requiring the computer to make the effort
    of determining when this is possible.

    Looking at the end result, IA-64 implementations consumed more
    transistors and more power than contemporary in-order CPUs that
    produced better SPECint results.

    For code that spends a lot of time in software-pipelinable loops
    (SPECfp), IA-64 looked competetive for a while, but SIMD reduces the
    overheads even more (there's a reason why Cray went for it and why
    Cray's customers went for his products), and the additional
    flexibility of EPIC apparently does not provide a benefit over SIMD in
    enough cases.

    Concerning the TMS 320C6000, that's designed exactly for the kinds of
    loops where EPIC and VLIW are competetive, but even for that, I have
    not heard anything about it in recent years (which may or may not mean something).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Fri Jun 13 13:14:29 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 6/12/2025 8:09 PM, quadibloc wrote:
    On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

    On 6/12/2025 8:00 AM, quadibloc wrote:

    The IBM System/360 had a base and index register and a 12-bit
    displacement.

    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).

    I hadn't thought about it that way.

    It does make sense that on a timesharing system, virtual memory meant
    that different users would not have to share the same memory space, so
    programs wouldn't have to be relocatable.

    But if you drop base registers for that reason, suddenly you are forced
    to always use virtual memory.

    No. Other systems in the S/360 time frame (i.e. before virtual memory)
    used a system "base register", that was hidden from the user (but was in
    its context), that was set by the OS when the program was loaded, or if
    it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage
    of not requiring a user register for that purpose, it allowed a program
    to be swapped in to a different memory address than it was swapped out
    from, a feature the S/360 didn't enjoy.

    Note that the B5500 had a form of virtual memory before the 360
    was released. The B6500 (1969) adding paging.

    The B3500 operated as you describe, with a hidden base register;
    until 1983 when the architecture was enhanced to support
    eight hidden base registers (supporting 8 active regions).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jun 13 14:48:20 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    quadibloc <quadibloc@gmail.com> writes:
    To look like RISC, and yet to have the code density of CISC

    I have repeatedly posted measurements of code sizes of several
    programs on various architectures, most recently in >>><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
    posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.


    Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian >>>packages:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Seems to me that the text size is not the interesting
    metric here - rather the typical working set size is
    far more important.

    Yes, something like it. But how do you measure it? And do you think
    that the text size of binaries for different architectures are not
    correlated to the working set sizes of these architectures?x

    Without data, it's all speculation. Given, however, that there
    doesn't seem to be a rush to replace x86 or arm64 with
    armhf or riscv64, I don't believe that the text size is
    particularly interesting to the general user. There's only
    a 17% difference between riscv and arm64, after all, and
    arm64 is far more mature.


    It may be that text size isn't a particularly
    good metric for judging instruction set effectiveness.

    Why would it not be a good predictor, and what would you use instead?

    I'm not convinced that "instruction set effectiveness" is a
    useful metric for modern systems. Having been involved with
    ARM64 from 2012 (including a stint on the technical advisory board),
    I've watched the architecture evolve over the last fourteen years
    into a rather complicated behemoth - due primarily to the evolving
    requirements of the customer base and the desire to target additional application classes. Whether or not the architecture supports
    in-line 64-bit constants with a simple instruction encoding doesn't
    seem particularly interesting to anyone other than compiler
    code generator developers.

    I can't say that 'text size' has been
    a major consideration by the general end-users community outside of
    a small subset of the embedded system development community. It
    has an impact on Icache, certainly, but can you quantify it
    vis-a-vis the other archtiectural trade-offs between the competing
    processor families?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 13 08:15:55 2025
    On 6/13/2025 4:52 AM, quadibloc wrote:
    On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

    On 6/12/2025 8:00 AM, quadibloc wrote:
    On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

    What is the use case for having base and index register and a
    16-bit displacement?

    The IBM System/360 had a base and index register and a 12-bit
    displacement.

    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).

    I have been thinking about this, and I don't think that base registers
    only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.

    No. If you wanted to address larger than the displacement field, you
    still had the index register. And remember that the need for that is
    reduced because you could have a 16 bit displacement by using the four
    bits free'd up by eliminating the base register field.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 08:23:17 2025
    On 6/12/2025 11:03 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    No. Other systems in the S/360 time frame (i.e. before virtual memory)
    used a system "base register", that was hidden from the user (but was in
    its context), that was set by the OS when the program was loaded, or if
    it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage
    of not requiring a user register for that purpose, it allowed a program
    to be swapped in to a different memory address than it was swapped out
    from, a feature the S/360 didn't enjoy.

    It was supposed to, but I belive that was one of the earliest
    failures that they noticed, and should have realized before:
    their memory to memory instructions did not have base+offset+index.

    Also, how was storing a pointer to somewhere supposed to work
    for swapping out/swapping in?

    Since the "base" pointer was known to the OS, if the program was swapped
    out and swapped back in to a different location in memory, the OS just
    changed the value in the base register to reflect the new location.
    This didn't work in the S/360 because the OS didn't know what
    register(s) the user program was using as base register(s) so it
    couldn't change the values in them if the program was to be relocated.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Fri Jun 13 17:10:09 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    However, some comp.arch regulars seem to consider it quite important,
    and they regularly make claims about the code density of various
    instruction sets. I have started measuring the text size of programs
    in order to provide empirical counterevidence to this wishful
    thinking. This apparently has made little impression on those making
    such claims, but maybe the rest of you will gain somthing from these
    data.

    One problem is that different architectures may make different
    decisions about such things as inlining, cloning and loop unrolling.
    While your numbers can be indicative, a comparision with -Os would
    give better overview of achievable code size.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jun 13 15:38:43 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    Without data, it's all speculation. Given, however, that there
    doesn't seem to be a rush to replace x86 or arm64 with
    armhf or riscv64, I don't believe that the text size is
    particularly interesting to the general user.

    Probably not, but I don't think the reason is that "working set size"
    would produce significantly different results.

    However, apparently code size is important enough in some markets that
    ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
    followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
    (which allows mixing 16-bit and 32-bit encodings); Power specified VLE
    (are there any implementations of that?); and RISC-V specified the C
    extension, which is implemented widely AFAICT.

    I'm not convinced that "instruction set effectiveness" is a
    useful metric for modern systems.

    One would have to define that first.

    As for code density (however measured), yes, I think that in the
    markets that ARM A64 was designed for, that's probably not a top
    consideration when selecting an instruction set.

    However, some comp.arch regulars seem to consider it quite important,
    and they regularly make claims about the code density of various
    instruction sets. I have started measuring the text size of programs
    in order to provide empirical counterevidence to this wishful
    thinking. This apparently has made little impression on those making
    such claims, but maybe the rest of you will gain somthing from these
    data.

    It
    has an impact on Icache, certainly, but can you quantify it

    One way of quantifying it would be to take (or simulate)
    implementations with the same I-cache organization (size, cache line
    length, associativity, replacement policy and lack of an uop cache)
    and measure the number of I-cache misses for running a specific
    program. Actually, simulation is better, running would include
    differences in prefetching, which are probably influenced by other considerations than code density.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jun 13 17:22:12 2025
    On Fri, 13 Jun 2025 7:39:55 +0000, Anton Ertl wrote:

    quadibloc <quadibloc@gmail.com> writes:
    OoO comes with a cost, though. It increases transistor costs
    considerably. Also, it comes with vulnerabilities like Spectre and >>Meltdown.

    No. AMD's OoO CPUs never had Meltdown AFAIK. As for Spectre, that
    can be fixed (not mitigated) at moderate hardware and performance cost
    with Invisible Speculation; I wrote an overview paper about Spectre
    and how to fix it <http://www.euroforth.org/ef23/papers/ertl.pdf>, but
    the actual Invisible Speculation research was done by others. It's
    just that the hardware designers don't want to; apparently the
    customers are not interested enough and prefer to pay the performance
    and software development cost of Spectre mitigations (or are too
    indifferent to care about it at all).

    VLIW, in the sense of the Itanium

    IA-64 is EPIC, not VLIW.

    IA-64 is an EPIC failure, as are all other VLIW-like architectures.

    And IA-64 gives you Spectre, too, in a way
    that is cannot be fixed by Invisible Speculation, because the
    speculation is architectural, and there is no explicit "commit" that
    turns speculation into non-speculation.
    -------------------

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 17:40:44 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/12/2025 11:03 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    No. Other systems in the S/360 time frame (i.e. before virtual memory)
    used a system "base register", that was hidden from the user (but was in >>> its context), that was set by the OS when the program was loaded, or if
    it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage
    of not requiring a user register for that purpose, it allowed a program
    to be swapped in to a different memory address than it was swapped out
    from, a feature the S/360 didn't enjoy.

    It was supposed to, but I belive that was one of the earliest
    failures that they noticed, and should have realized before:
    their memory to memory instructions did not have base+offset+index.

    Also, how was storing a pointer to somewhere supposed to work
    for swapping out/swapping in?

    Since the "base" pointer was known to the OS, if the program was swapped
    out and swapped back in to a different location in memory, the OS just changed the value in the base register to reflect the new location.

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    This didn't work in the S/360 because the OS didn't know what
    register(s) the user program was using as base register(s) so it
    couldn't change the values in them if the program was to be relocated.

    Even with that provision, it would not have worked.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Fri Jun 13 17:42:03 2025
    quadibloc <quadibloc@gmail.com> schrieb:

    I have been thinking about this, and I don't think that base registers
    only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.

    Registers do that, and you only need a single one.

    In fact, I think this was the primary reason, and using
    them to relocate code and data was a nice idea that came after.

    The literature says otherwise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 10:57:13 2025
    On 6/13/2025 10:40 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/12/2025 11:03 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.

    It was supposed to, but I belive that was one of the earliest
    failures that they noticed, and should have realized before:
    their memory to memory instructions did not have base+offset+index.

    Also, how was storing a pointer to somewhere supposed to work
    for swapping out/swapping in?

    Since the "base" pointer was known to the OS, if the program was swapped
    out and swapped back in to a different location in memory, the OS just
    changed the value in the base register to reflect the new location.

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    Got it :-)


    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    Short answer. I don't know. Perhaps someone who does can provide the
    answer. I agree that a straight pointer would work, and can be computed
    in the calling routine, since it knows the base address of the common
    block and since it is an argument passed, the compiler would know that
    and so know not to need a base register to reference it from the
    subroutine. But that doesn't negate the use of base registers in the
    more common case of typical code.



    This didn't work in the S/360 because the OS didn't know what
    register(s) the user program was using as base register(s) so it
    couldn't change the values in them if the program was to be relocated.

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common
    designer (Seymour Cray).


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 11:00:59 2025
    On 6/13/2025 10:42 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:

    I have been thinking about this, and I don't think that base registers
    only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a
    displacement field large enough to address all of memory in every
    instruction.

    Registers do that, and you only need a single one.

    Yup!

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 18:11:18 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space. The other machines made it invisible from user space
    and added its contents to every memory access. This does not take
    up opcode space and allows swapping in and out of the whole process,
    which was a much better solution for the early 1960s. (Actually, I
    believe the UNIVAC had two, one for program and one for data).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 13 18:37:29 2025
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/12/2025 11:03 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
    whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.

    It was supposed to, but I belive that was one of the earliest
    failures that they noticed, and should have realized before:
    their memory to memory instructions did not have base+offset+index.

    Also, how was storing a pointer to somewhere supposed to work
    for swapping out/swapping in?

    Since the "base" pointer was known to the OS, if the program was swapped
    out and swapped back in to a different location in memory, the OS just
    changed the value in the base register to reflect the new location.

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    The term "base register" is being used in different ways in this thread.

    a) We have the base register of base-and-bounds program relocation
    {that the user is not allowed to see of know}
    b) we have a pointing register (/360) that can have indexing and
    offsetting applied to form a virtual address.

    Since base-and-bounds slipped into history circa 1980 after even microprocessors got TLBs, I suggest we use the /360 terminology
    for address generation and some kind of MMU/TLB terminology for
    relocation and protection.

    Back to the question: What should FOO bas to BAR is a pointer to B
    if arguments are passed by reference, or the actual value of B if
    arguments are passed by copy-in-copy-out.

    The former:

    LDA R1,[IP,,#COMMON.COM.B-.]
    CALL BAR

    BAR:
    LDD R2,[R1]
    ADD R2,R2,#1
    STD R2,[R1]
    RET

    the later:

    LDD R1,[IP,,#COMMON.COM.B-.]
    CALL BAR
    STD R1,[IP,,#COMMON.COM.B-.]

    BAR:
    ADD R1,R1,#1
    RET

    Both means work rather well in practice.

    This didn't work in the S/360 because the OS didn't know what
    register(s) the user program was using as base register(s) so it
    couldn't change the values in them if the program was to be relocated.

    Even with that provision, it would not have worked.

    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 11:55:51 2025
    On 6/13/2025 11:11 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common
    designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space.

    Yes. That is the point I was making. Now I am lost as to why you said
    above "even with that provision, it would not have worked."


    The other machines made it invisible from user space
    and added its contents to every memory access. This does not take
    up opcode space and allows swapping in and out of the whole process,
    which was a much better solution for the early 1960s.

    You and I are in violent agreement! Only John seems to disagree.


    (Actually, I
    believe the UNIVAC had two, one for program and one for data).

    Correct. I didn't want to complicate the discussion. The separation of instructions from data allowed the OS to put them in different memory
    modules, thus allowing simultaneous access to the current instruction's
    operand and to the next instruction fetch, thus dramatically improving performance. (Remember, no cache). And later follow on systems such as
    the 1110 actually had two sets of Instruction and Data bases, and later machines went to 16 base registers. This is similar to the progression
    Scott talked about for the Burroughs medium systems.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 13 18:42:40 2025
    On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common
    designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space.

    A base register not part of GPRs is a descriptor (or segment).
    And we don't want to go there.

    The other machines made it invisible from user space
    and added its contents to every memory access. This does not take
    up opcode space and allows swapping in and out of the whole process,

    It also fails when one has 137 different things to track with those descriptors.

    which was a much better solution for the early 1960s. (Actually, I
    believe the UNIVAC had two, one for program and one for data).

    Still insufficient for modern use cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Fri Jun 13 18:18:09 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common
    designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space. The other machines made it invisible from user space
    and added its contents to every memory access. This does not take
    up opcode space and allows swapping in and out of the whole process,
    which was a much better solution for the early 1960s. (Actually, I
    believe the UNIVAC had two, one for program and one for data).

    On the Burroughs systems had two 'states'. Control state
    (which today is called kernel or supervisor mode) was defined by a
    BASE register with the value 0. The MCP executed with
    BASE=0 and had access to all of memory (directly to the
    first 500KB, indirectly for the rest).

    Normal state was defined by a non-zero BASE register and
    privileged instructions would fault.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Poulsen@21:1/5 to Stephen Fuld on Fri Jun 13 19:49:58 2025
    On 2025-06-13, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another. There may be others, perhaps
    the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).

    While Seymour Cray worked on the 1103, he was off to CDC long before the
    1108 was designed. I don't think the idea of multiprogramming and
    swapping (and hence the need for a base/limit register pair) had entered anyone's mind back in the days of the 1103.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 13 13:50:24 2025
    On 6/13/2025 12:50 PM, quadibloc wrote:
    On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
    On 6/13/2025 4:52 AM, quadibloc wrote:

    I have been thinking about this, and I don't think that base registers
    only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a
    displacement field large enough to address all of memory in every
    instruction.

    No.  If you wanted to address larger than the displacement field, you
    still had the index register.  And remember that the need for that is
    reduced because you could have a 16 bit displacement by using the four
    bits free'd up by eliminating the base register field.

    Certainly, you can use the index register to address an area larger than
    the displacement field. Otherwise, RISC CPUs wouldn't work. However,
    then if you want to do an array access in that wider range, once more
    you need extra instructions to calculate the index value.

    And "reduced" is not the same as "completely eliminated", and so I fail
    to see how that makes base registers unnecessary.

    It's all a tradeoff. Yes, occasionally you need an extra instruction,
    but you gain four bits for a larger displacement (or something else if
    you want). And don't forget, you need the "extra" BALR instructions, or
    other ones to load the base register for every 4K chunk of data or instructions, and the loss of an otherwise available register.

    Everyone else who has evaluated the tradeoff chose not to use the extra register.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jun 13 21:09:06 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/12/2025 11:03 PM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:


    The term "base register" is being used in different ways in this thread.

    a) We have the base register of base-and-bounds program relocation
    {that the user is not allowed to see of know}
    b) we have a pointing register (/360) that can have indexing and
    offsetting applied to form a virtual address.

    Since base-and-bounds slipped into history circa 1980 after even

    Actually, closer to 2010 as the B3500 descendents were still
    running production then (some of the systems were 25+ years old
    when they were retired and replaced by dozens of windows server
    boxes).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 13 20:50:49 2025
    On Fri, 13 Jun 2025 20:12:07 +0000, quadibloc wrote:

    On Thu, 12 Jun 2025 19:01:52 +0000, MitchAlsup1 wrote:

    What code is produced from::

    uint32_t function( uint32_t u )
    {
    int32_t i[99];
    return i[u];
    }

    That wouldn't even compile. The array i is not initialized.

    However, I'll assume that this is a fragment of a larger program.

    You've stated that this is for a 64-bit machine.

    So it takes an index variable as an argument, and returns an element
    from an array.

    The array is declared as signed 32 bit integers, but the function
    returns an unsigned 32 bit integer.

    Well, the answer is that it doesn't matter if I use a "load" or an
    "unsigned load", since what the function returns is a pointer to a *32-bit-long* value in memory. Which the calling program will interpret
    as unsigned.

    No, the function returns (unsigned) i[u]; the value itself not a pointer
    to it. The question is how does CII deal with signed/unsigned mismatches expressly written into the code. Seems to me that having both signed and unsigned LDs and a trifling of pattern recognition, solves the problem.

    Or maybe the function return is in register zero. In that case, I will
    indeed generate a "load" rather than an "unsigned load" inside the
    program. The caller, however, will presumably extract the least
    significant bits of that register into a 32-bit long variable before
    use, so my "error" will not have disastrous consequences.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Jun 13 22:01:54 2025
    On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote:

    On 6/13/2025 12:50 PM, quadibloc wrote:
    On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
    On 6/13/2025 4:52 AM, quadibloc wrote:

    I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a
    displacement field large enough to address all of memory in every
    instruction.

    No.  If you wanted to address larger than the displacement field, you
    still had the index register.  And remember that the need for that is
    reduced because you could have a 16 bit displacement by using the four
    bits free'd up by eliminating the base register field.

    Certainly, you can use the index register to address an area larger than
    the displacement field. Otherwise, RISC CPUs wouldn't work. However,
    then if you want to do an array access in that wider range, once more
    you need extra instructions to calculate the index value.

    And "reduced" is not the same as "completely eliminated", and so I fail
    to see how that makes base registers unnecessary.

    It's all a tradeoff. Yes, occasionally you need an extra instruction,
    but you gain four bits for a larger displacement (or something else if
    you want). And don't forget, you need the "extra" BALR instructions, or
    other ones to load the base register for every 4K chunk of data or instructions, and the loss of an otherwise available register.

    Everyone else who has evaluated the tradeoff chose not to use the extra register.

    Can you restate what you intended to mean in the last sentence/paragraph
    but use different words ??

    Certainly the /360 designers, the VAX designers, the x86 designers, and
    others; looked at the problem and allowed [Rpointer+Rindex+displacement] addressing. So, it is not everyone.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Fri Jun 13 23:10:56 2025
    On 6/13/2025 3:01 PM, MitchAlsup1 wrote:
    On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote:

    On 6/13/2025 12:50 PM, quadibloc wrote:
    On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
    On 6/13/2025 4:52 AM, quadibloc wrote:

    I have been thinking about this, and I don't think that base registers >>>>> only existed to allow program relocation in a crude form that virtual >>>>> memory superseded. They also existed simply to avoid having to have a >>>>> displacement field large enough to address all of memory in every
    instruction.

    No.  If you wanted to address larger than the displacement field, you >>>> still had the index register.  And remember that the need for that is >>>> reduced because you could have a 16 bit displacement by using the four >>>> bits free'd up by eliminating the base register field.

    Certainly, you can use the index register to address an area larger than >>> the displacement field. Otherwise, RISC CPUs wouldn't work. However,
    then if you want to do an array access in that wider range, once more
    you need extra instructions to calculate the index value.

    And "reduced" is not the same as "completely eliminated", and so I fail
    to see how that makes base registers unnecessary.

    It's all a tradeoff.  Yes, occasionally you need an extra instruction,
    but you gain four bits for a larger displacement (or something else if
    you want). And don't forget, you need the "extra" BALR instructions, or
    other ones to load the base register for every 4K chunk of data or
    instructions, and the loss of an otherwise available register.

    Everyone else who has evaluated the tradeoff chose not to use the extra
    register.

    Can you restate what you intended to mean in the last sentence/paragraph
    but use different words ??

    Certainly the /360 designers, the VAX designers, the x86 designers, and others; looked at the problem and allowed [Rpointer+Rindex+displacement] addressing. So, it is not everyone.

    OK. As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory. They
    chose the "visible" (to the program) base register instead of say the "invisible" base register, which was, as I have said, IMO a mistake.

    The VAX was in a different situation. Being a virtual memory design,
    they didn't need it for the reason that the S/360 did. I am not an
    expert, but ISTM that the VAX designers wanted to include almost
    anything in the ISA to close the "semantic gap", and certainly didn't
    feel constrained to keep instructions within 32 bits, so adding the 3
    input address calculation, with potentially large offsets seemed
    reasonable to them. For various reasons, this all proved not to be a
    good choice eventually.

    As for the X86, I freely confess to not knowing the constraints its
    designers were operating under, so I can't really comment.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Sat Jun 14 09:26:04 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why? I remember reading that the systems software people
    spent a lot of work on an overlay mechanism, so the thinking at IBM at
    the time was apparently not about keeping several programs in RAM at
    the same time, but about running one program at one time, and finding
    ways to make that program fit into available RAM.

    In any case, it's no problem to add a virtual-memory mechanism that is
    not visible to user-level, or maybe even kernel-level (does the
    original S/360 have that?) programs, whether it's paged virtual memory
    or a simple base+range mechanism.

    As for the X86, I freely confess to not knowing the constraints its
    designers were operating under, so I can't really comment.

    There is no X86.

    For the 8086 architecture, the effective addresses are reg, reg+const
    or reg+reg (with severe restrictions on the registers usable for that;
    the 8086 does not have GPRs).

    For IA-32, the addresses can be reg+reg*1/2/4/8+const, using any
    registers (i.e., IA-32 has GPRs); this addressing mode was probably
    inspired by the VAX, which was in full reign when IA-32 was designed
    (the 386 was released in 1985). IA-32 has both the segmentation
    mechanism inherited from the 80286 (and extended to 32 bit segments)
    and paging, so using the addressing modes for any form of virtual
    memory was not the intention for providing this addressing mode.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Jun 14 10:45:57 2025
    BGB <cr88192@gmail.com> schrieb:
    On 6/12/2025 10:00 AM, quadibloc wrote:
    On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

    What is the use case for having base and index register and a
    16-bit displacement?

    The IBM System/360 had a base and index register and a 12-bit
    displacement.
    Most microprocessors have a base register and a 16-bit displacement.


    Serious overkill...

    For [Rb+Disp] with a 32-bit encoding, 9 or 10 is sufficient, if scaled
    by the element size, 12 otherwise.

    Maybe for the code you like to compile, but a lot of software
    has other needs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 14 10:44:16 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why?

    I don't have a BiBTeX entry like you usually do, but you can
    find "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks easly.

    A quote:

    # Now the question was: How much capacity was to be made directly
    # addressable, and how much addressable only via base registers? Some
    # early uses of base register techniques had been fairly unsuccessful,
    # principally because of awkward transitions between direct and
    # base addressing. It wasdecided to commit the system completely
    # to a base-register technique; the direct part of the address,
    # the displacement, was made so small (12 bits, or 4096 characters)
    # that direct addressing is a practical programming technique only
    # on very small models. This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    That they got wrong, egregiously so, as the example with passing
    a pointer to something from a COMMON block shows.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Sat Jun 14 10:48:01 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form between routines, because they knew even then that it would never work.
    At least not when subroutines are *compiled separately*, which was the
    normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Poulsen@21:1/5 to Anton Ertl on Sat Jun 14 12:42:19 2025
    On 2025-06-14, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    IA-32 has both the segmentation
    mechanism inherited from the 80286 (and extended to 32 bit segments)
    and paging, so using the addressing modes for any form of virtual
    memory was not the intention for providing this addressing mode.

    I see this as clearly a case of wanting to have it both ways.
    The segmentation mechanism was never loved by anyone. The 8086
    rudiments were laughable. The 286 was a better try, but still
    a far cry from 1960s experiements. 386 fixed most of the holes,
    but by then, everybody just wanted IBM370-style paging in a big,
    beautiful flat memory space. But they needed to support a lot of
    legacy code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jun 14 15:40:28 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why?

    "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks

    # Now the question was: How much capacity was to be made directly
    # addressable, and how much addressable only via base registers? Some
    # early uses of base register techniques had been fairly unsuccessful,
    # principally because of awkward transitions between direct and
    # base addressing. It wasdecided to commit the system completely
    # to a base-register technique; the direct part of the address,
    # the displacement, was made so small (12 bits, or 4096 characters)
    # that direct addressing is a practical programming technique only
    # on very small models.

    Up to that point I was thinking of MIPS, Alpha, and RISC-V with their
    reg+const addressing, and I thought: Ok, these machines actually
    support absolute (aka direct) addressing by using the zero register as
    reg. But of course nobody ever uses the zero register for addressing,
    and absolute addressing is not used.

    # This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    And here it becomes obvious that they had a completely different usage
    in mind than what these addressing modes are used for on s390x. And I
    guess that already on S/370 and probably even on S/360 they were
    usually not used as this sentence suggests: load constants in some
    registers at the start, never change them, and use only those
    registers as base registers.

    That they got wrong, egregiously so, as the example with passing
    a pointer to something from a COMMON block shows.

    I missed or did not understand that example. What's the issue?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Jun 14 15:53:52 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    For the 8086 architecture, the effective addresses are reg, reg+const
    or reg+reg (with severe restrictions on the registers usable for that;
    the 8086 does not have GPRs).

    There is also absolute addressing on the 8086, and IA-32 (that
    encoding was repurposed for RIP-relative addressing on AMD64 IIRC).

    And I completely ignored the segment registers, which were intended to
    to provide a virtual-memory/relocation mechanism, but in MS-DOS
    programs were used as a cumbersome way to access more than 64KB of
    memory. Only in stuff like PC/IX was it used as intended AFAICT.

    The tragedy continued with the 80286, which now supported protected
    segments with up to 64KB, well suited for the kind of usage in PC/IX
    (but apparently that never had a 286 port) and actually used in Xenix,
    but which was against the grain for the kind of usage in MS-DOS.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Sat Jun 14 09:45:23 2025
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address
    instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer
    placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form
    between routines, because they knew even then that it would never work.
    At least not when subroutines are *compiled separately*, which was the
    normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why. When the program is linked, the COMMON
    block is at some fixed displacement from the start of the program. So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Finch on Sat Jun 14 16:05:53 2025
    Robert Finch <robfi680@gmail.com> writes:
    I think the IA-64 has a lot of interesting features.

    Certainly. But it's interesting how OoO makes each of them
    unnecessary.

    It looks like a
    processor that was designed a while ago before the current batch of >superscalars machines became popular.

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were
    pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore
    it.

    If the register file were used as a flat register file, instead
    of one that rotates the registers it might be simpler to use.

    Do the register rotation at the front end, and the OoO engine just
    sees flat register names. The register rename table will be big (and
    you will probably want to keep it with every branch or potentially
    trapping instruction), but Oracle eventually found a way to deal with
    that (but our measurements show that even the fastest SPARCs are still
    slow).

    I have been working with m68k code recently. The issues with it become >apparent when looking at the output of compiled code. A lot of memory to >memory moves. I see that it has great code density, but I wonder how
    that correlates to performance, given all the memory ops. A RISC
    architecture may have 30% worse code density, but it might run 5x as fast.

    If you compare a MIPS R3000 to a 68020, possibly yes. If you compare
    an Onyx or M4 to a Zen5 or Lion Cove (CISC stand-ins for the 68K,
    which does not have a modern implementation), the code density is
    similar, and the performance, too.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Sat Jun 14 09:24:02 2025
    On 6/14/2025 8:40 AM, Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why?

    "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks

    # Now the question was: How much capacity was to be made directly
    # addressable, and how much addressable only via base registers? Some
    # early uses of base register techniques had been fairly unsuccessful,
    # principally because of awkward transitions between direct and
    # base addressing. It wasdecided to commit the system completely
    # to a base-register technique; the direct part of the address,
    # the displacement, was made so small (12 bits, or 4096 characters)
    # that direct addressing is a practical programming technique only
    # on very small models.

    Up to that point I was thinking of MIPS, Alpha, and RISC-V with their reg+const addressing, and I thought: Ok, these machines actually
    support absolute (aka direct) addressing by using the zero register as
    reg. But of course nobody ever uses the zero register for addressing,
    and absolute addressing is not used.

    # This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    And here it becomes obvious that they had a completely different usage
    in mind than what these addressing modes are used for on s390x. And I
    guess that already on S/370 and probably even on S/360 they were
    usually not used as this sentence suggests: load constants in some
    registers at the start, never change them, and use only those
    registers as base registers.

    On S/360, that is exactly what you did. The first instruction in an
    assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register. The
    next thing was an assembler directive "Using" with the register number
    as an argument. This told the assembler to use that register, which now contained the address of the first "useful" instruction as the base
    register in future instructions. This allowed the OS to load the
    program to any address in real memory, thus to have more than one
    program resident in real memory at the same time and the CPU could
    switch among them. By the time virtual memory came along with the S/370
    (and OK, the 367/67) this was, of course no longer needed, but it was
    kept for upward compatibility.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 16:49:14 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/14/2025 8:40 AM, Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why?

    "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks

    # Now the question was: How much capacity was to be made directly
    # addressable, and how much addressable only via base registers? Some
    # early uses of base register techniques had been fairly unsuccessful,
    # principally because of awkward transitions between direct and
    # base addressing. It wasdecided to commit the system completely
    # to a base-register technique; the direct part of the address,
    # the displacement, was made so small (12 bits, or 4096 characters)
    # that direct addressing is a practical programming technique only
    # on very small models.

    Up to that point I was thinking of MIPS, Alpha, and RISC-V with their
    reg+const addressing, and I thought: Ok, these machines actually
    support absolute (aka direct) addressing by using the zero register as
    reg. But of course nobody ever uses the zero register for addressing,
    and absolute addressing is not used.

    # This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    And here it becomes obvious that they had a completely different usage
    in mind than what these addressing modes are used for on s390x. And I
    guess that already on S/370 and probably even on S/360 they were
    usually not used as this sentence suggests: load constants in some
    registers at the start, never change them, and use only those
    registers as base registers.

    On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register. The
    next thing was an assembler directive "Using" with the register number
    as an argument. This told the assembler to use that register, which now contained the address of the first "useful" instruction as the base
    register in future instructions.

    And woe betide the programmer who got this wrong, the assembler
    would then generate wrong offsets.

    From the S/360 assembler manual:

    The USING instruction indicates that one
    or more general registers are available for
    use as base registers. This instruction
    also states the base address values that
    the assenbler may assume will be in the
    registers at object time. Note that a
    USING instruction does not load the reg-
    isters specified. It is the programmer's
    responsibility to see that the specified
    base address values are placed into the
    registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 14 16:56:24 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    That they got wrong, egregiously so, as the example with passing
    a pointer to something from a COMMON block shows.

    I missed or did not understand that example. What's the issue?

    <102hnqs$3hv4m$3@dont-email.me>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Sat Jun 14 16:50:09 2025
    On Sat, 14 Jun 2025 16:12:22 +0000, quadibloc wrote:

    On Sat, 14 Jun 2025 16:04:42 +0000, quadibloc wrote:

    so let's go to "Annie Get Your Gun" for the other song... "Anything You
    Can Do".

    Although, in my case, it's more like anything almost any other computer
    can do, Concertina II can do _almost_ as well, rather than better. Its
    level of versatility means that it loses a little in code density.

    So an all-out implementation would presumably have a lot of cache in
    addition to a lot of pins, to support a wide data path. Unfortunately,
    while chips can put their floating-point ALU to sleep during integer
    code, there's probably no practical way to put OoO circuitry to sleep
    during VLIW code, because it's too intimately tied into everything - but maybe one could have two control units sharing the same ALUs so that
    this could be managed.

    Special Note:: When a vVM loop is running, FETCH and DECODE are
    quiescent.
    The loaded Reservation Station fires off the instructions multiple times
    at possibly multi-lanes of width. FETCH-DECODE remains primed with inst- ructions after the loop exits, and is enabled when loop terminates.

    So, while you are unlikely to de-power the integer section, you can
    depower
    FETCH-DECODE and save a bunch of power (~1/3rd).

    But then, today's microprocessors have thousands of pins, and yet they
    don't have enormously wide data paths. Apparently their control
    interfaces had to get way more complex than, say, what worked back in
    the Socket 7 days.

    GPUs have ~1024 "pins" in and another 1024 "pins" out PER shader core.
    If GPUs can afford this pin count, so can CPUs.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Sat Jun 14 17:00:08 2025
    On Sat, 14 Jun 2025 6:10:56 +0000, Stephen Fuld wrote:

    On 6/13/2025 3:01 PM, MitchAlsup1 wrote:
    On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote: >>--------------------

    The VAX was in a different situation. Being a virtual memory design,
    they didn't need it for the reason that the S/360 did. I am not an
    expert, but ISTM that the VAX designers wanted to include almost
    anything in the ISA to close the "semantic gap", and certainly didn't
    feel constrained to keep instructions within 32 bits, so adding the 3
    input address calculation, with potentially large offsets seemed
    reasonable to them. For various reasons, this all proved not to be a
    good choice eventually.

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now
    while this makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was
    an instruction in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    As for the X86, I freely confess to not knowing the constraints its
    designers were operating under, so I can't really comment.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jun 14 17:02:23 2025
    On Sat, 14 Jun 2025 9:26:04 +0000, Anton Ertl wrote:


    There is no X86.

    Certainly not with a capital X

    But they sell 100 M with a small x86 every year.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Sat Jun 14 16:39:08 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 6/14/2025 8:40 AM, Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks
    ...
    # This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    And here it becomes obvious that they had a completely different usage
    in mind than what these addressing modes are used for on s390x. And I
    guess that already on S/370 and probably even on S/360 they were
    usually not used as this sentence suggests: load constants in some
    registers at the start, never change them, and use only those
    registers as base registers.

    On S/360, that is exactly what you did. The first instruction in an >assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register.

    That's not loading once and leaving it alone, but yes, that can work,
    too, as shown in modern dynamic-linking ABIs.

    The
    next thing was an assembler directive "Using" with the register number
    as an argument. This told the assembler to use that register, which now >contained the address of the first "useful" instruction as the base
    register in future instructions. This allowed the OS to load the
    program to any address in real memory, thus to have more than one
    program resident in real memory at the same time and the CPU could
    switch among them. By the time virtual memory came along with the S/370
    (and OK, the 367/67) this was, of course no longer needed, but it was
    kept for upward compatibility.

    An interesting development is that, e.g., on Ultrix on DecStations
    programs were statically linked for a specific address. Then dynamic
    linking became fashionable; on Linux at first dynamically-linked
    libraries were compiled for specific addresses, which required quite a
    bit of coordination, so they got rid of that (IIRC in the transition
    to libc5) and the libraries had to be position-independent, but the
    binaries still were fixed-address. Finally, one wanted to make life
    harder for attackers with adress-space layout randomization (ASLR), so everything should become position-independent, and different pieces
    would start at random offsets relative to other pieces as well.

    So all these techniques got a new life, with ABIs on MIPS and Alpha
    where a global pointer is loaded from the link register at the start
    of a function and after each "far" call (at the very least for calls
    to a different dynamically linked library). The MIPS instruction set
    certainly was not designed for that kind of environment, and Alpha has
    no new features in that respect AFAICT, but on both such ABIs could be implemented; it took some additional instructions.

    AMD64 and ARM A64 were designed when these requirements were already
    there, and they added PC-relative addressing, which results in reduced instruction counts (but, as mentioned elsewhere, possibly increased
    hardware implementation headaches on modern cores).

    Anyway, even in modern paged virtual-memory architectures, we
    recreated the need for having several program pieces, each with position-independent code, with calls between them. And the ISAs look
    much more like S/360 than like ISAs with "hidden" base registers.
    IA-32 has the segment registers with could serve as semi-hidden base
    registers, but AMD64 de-emphasized segment registers; AFAIK they are
    only used for thread-local data these days, if at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Sat Jun 14 17:30:36 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:

    Packing and unpacking decimal floats can be done inexpensively and fast
    relative to the size, speed of the decimal float operations. For my own
    implementation I just unpack and repack for all ops and then registers
    do not need any more than 128-bits.

    I also unpack the hidden first bit on IEEE-754 floats.

    Do you have 65-bit registers, then?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stephen Fuld on Sat Jun 14 13:56:02 2025
    Stephen Fuld wrote:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address
    instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer
    placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form
    between routines, because they knew even then that it would never work.
    At least not when subroutines are *compiled separately*, which was the
    normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.

    If the program was relocated after the call to BAR but before using
    the reference to access argument A then it reads the wrong location.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 18:51:44 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address
    instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer
    placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form
    between routines, because they knew even then that it would never work.
    At least not when subroutines are *compiled separately*, which was the
    normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.

    Guaranteed, with a 12-bit offset?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jun 14 18:39:42 2025
    On Sat, 14 Jun 2025 17:30:00 +0000, Thomas Koenig wrote:

    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
    On 6/13/2025 4:52 AM, quadibloc wrote:

    I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
    memory superseded. They also existed simply to avoid having to have a
    displacement field large enough to address all of memory in every
    instruction.

    No. If you wanted to address larger than the displacement field, you
    still had the index register. And remember that the need for that is
    reduced because you could have a 16 bit displacement by using the four
    bits free'd up by eliminating the base register field.

    Certainly, you can use the index register to address an area larger than
    the displacement field. Otherwise, RISC CPUs wouldn't work. However,
    then if you want to do an array access in that wider range, once more
    you need extra instructions to calculate the index value.

    There is _nothing_ wrong with having base + (scaled) index register instructions. That is just 15 bits for three registers (assuming
    32 GPRs), which leads ample space for opcodes, scaling and
    maybe, if you feel so inclined, a small offset.

    Clairifying:: 15-bis is 3×5-bits of Rdest, Rbase, and Rindex.
    and 2-bits of <constant> scale.

    There is _everything_ wrong with mandating a base register for
    every load/store operation, and trying to cram in a large offset
    as well.

    My 66000 mandates a base (i/e., pointing) register for each access.
    However, R0 is a proxy for IP, so while R0 cannot be a base (point)
    register, there is a point register none-the-less.

    Note:: address constants are optional when scaled index is in play.

    If you want to step through arrays, you can also use something
    like POWER's "load or store with update". ldu puts the effective
    address of the memory instruction into the address register,
    so you can use that with arbitrary step sizes.

    You CAN do this, but if your memory references have scaled indexing
    you generally SAVE looping (induction) ADD instructions.

    for( i = 0; i <max; i++ )
    doubleword[i] = word[i]+halfword[i];

    With scaled indexing::

    MOV Ri,#0
    top:
    LDH R3,[IP,Ri<<1,halfword-.]
    LDW R4,[IP,Ri<<2,word-.]
    ADD R5,R3,R4
    STD R5,[IP,Ri<<3,doubleword-.]
    ADD Ri,Ri,#1
    CMP Rt,Ri,Rmax
    BLT Rt,top

    without scaled indexing::

    MOV Ri,#0
    MOV Rt,Ri
    MOV Rs,Ri
    MOV Rq,Ri
    SL Rmax,Rmax,#3 // could be #1,#2,#3
    // depending on R{t,s,q}
    top:
    LDH R3,[IP,Rt,halfword-.]
    LDW R4,[IP,Rs,word-.]
    ADD R5,R3,R4
    STD R5,[IP,Rr,doubleword-.]
    ADD Rt,Rt,#2
    ADD Rs,Rs,#4
    ADD Rq,Rq,#8
    CMP Rw,Rq,Rmax // Rq is compared to Rmax<<3
    BLT Rw,top

    4 more setup instructions, 2 more loop ADD instructions, 4 more
    registers
    in use--and that is without getting LUI or AUPIC for the address
    constants.
    And for the coupé de-grassé:

    With vVM::

    MOV Ri,#0
    VEC R6,{}
    top:
    LDH R3,[IP,Ri<<1,halfword-.]
    LDW R4,[IP,Ri<<2,word-.]
    ADD R5,R3,R4
    STD R5,[IP,Ri<<3,doubleword-.]
    LOOP LT,Ri,#1,Rmax

    One can perform ADD-CMP-BC in 2-3 gate delays longer than the ADD instruction--since a 3-input add is a 3-2 compressor (1 gate) longer
    than a 2-input ADD--the first ADD is real, the second ADD is a
    subtract of comparand--and you are looking (mainly) at carry out
    and sign bit to determine whether the loop continues or terminates.

    VEC {} is telling the HW that none of the loop registers is "live"
    out of the loop; so, in this case, R[3..5] need not be written
    in the loop or exiting the loop. SW programmed write-elision !!

    So, we have a 9 instruction loop being compared to a 5 instruction
    loop which does the same amount of work, but writes fewer registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to EricP on Sat Jun 14 12:23:59 2025
    On 6/14/2025 10:56 AM, EricP wrote:
    Stephen Fuld wrote:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

           SUBROUTINE FOO
           REAL A,B
           COMMON /COM/ A,B
           REAL C
           CALL BAR(B)
    C ....
           END

           SUBROUTINE BAR(A)
           REAL A
           A = A + 1
           END

    What should FOO pass to BAR?  A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why.  When the program is linked, the
    COMMON block is at some fixed displacement from the start of the
    program.  So the program can "compute" the real address of the data in
    common blocks from the address in its base register.

    If the program was relocated after the call to BAR but before using
    the reference to access argument A then it reads the wrong location.

    That is precisely my point. The mechanism that IBM chose effectively *prevents* program relocation. That is why I believe it was a mistake
    to choose that mechanism.




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 20:06:11 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/14/2025 11:51 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>>> instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer >>>>> placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form >>>>> between routines, because they knew even then that it would never work. >>>>> At least not when subroutines are *compiled separately*, which was the >>>>> normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why. When the program is linked, the COMMON >>> block is at some fixed displacement from the start of the program. So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.

    Guaranteed, with a 12-bit offset?

    First let me say that I may have misinterpreted your recent comments.
    The visible base register mechanism IBM chose prevents any relocation of
    the program once it is first loaded.

    Then we're in agreement. Good :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Sat Jun 14 12:33:21 2025
    On 6/14/2025 11:51 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

    Suppose you're passing an argument from a COMMON block, a common
    occurence back then (pun intended).

    SUBROUTINE FOO
    REAL A,B
    COMMON /COM/ A,B
    REAL C
    CALL BAR(B)
    C ....
    END

    SUBROUTINE BAR(A)
    REAL A
    A = A + 1
    END

    What should FOO pass to BAR? A straight pointer to B is the
    obvious choce, but there is no base register in sight that the OS
    can know about.

    FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.

    BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.

    No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.

    Correct.

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why. When the program is linked, the COMMON
    block is at some fixed displacement from the start of the program. So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.

    Guaranteed, with a 12-bit offset?

    First let me say that I may have misinterpreted your recent comments.
    The visible base register mechanism IBM chose prevents any relocation of
    the program once it is first loaded.

    As for the common area, the program can compute, using normal 32 bit
    arithmetic registers, the starting address of the common block. It
    knows the starting address of where the program was loaded from the
    BALR instruction executed as the first instruction, to which it can add
    the offset of the common block from the program start as given by the
    linker. If it then puts that address in a register, subsequent
    references to at least the first 4K bytes of the block can be referenced
    using that register as the base. Blocks larger than 4K require either
    saving then changing the base register contents, or using another base register.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Sat Jun 14 14:37:39 2025
    On 6/14/2025 2:26 PM, quadibloc wrote:
    On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

    That is precisely my point.  The mechanism that IBM chose effectively
    *prevents* program relocation.  That is why I believe it was a mistake
    to choose that mechanism.

    It prevents relocation of programs currently in use that are already in memory.

    Yes.

    It facilitates loading programs from object files on disk into any
    desired part of memory, which is the usual meaning of "program
    relocation" among System/360 programmers, perhaps because they had no
    other type of it available.

    OK. But note that they did have the concept of unloading an active
    program from memory, called Rollout/Rollin, but you had to roll the
    program back in to the same memory location that it was rolled out from.


    Implementing the 360 architecture with the addition of a base and bounds mechanism instead of full-blown virtual memory was perfectly possible. However, the System/360 was originally conceived as a computer for use
    in batch processing.

    But batch processing didn't mean only one program at a time loaded into
    memory. If only one program was loaded in memory at a time, you
    wouldn't need any base registers, as all addresses could be zero
    relative. Even IBM's two major operating systems, DOS, and OS supported multi-programming.


    Hence, TSS/360 was a kludge and ran slowly, and it
    took the 360/67 with special hardware to facilitate timesharing for IBM
    to have something that addressed that function effectively.

    True, but irrelevant.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Jun 15 01:07:27 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 6/14/2025 8:40 AM, Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
    Brooks
    ...
    # This commitment implies that all programs
    # are location-independent, except for constants used to load the
    # base registers.

    And here it becomes obvious that they had a completely different usage
    in mind than what these addressing modes are used for on s390x. And I
    guess that already on S/370 and probably even on S/360 they were
    usually not used as this sentence suggests: load constants in some
    registers at the start, never change them, and use only those
    registers as base registers.

    On S/360, that is exactly what you did. The first instruction in an >>assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register.

    That's not loading once and leaving it alone, but yes, that can work,
    too, as shown in modern dynamic-linking ABIs.

    The
    next thing was an assembler directive "Using" with the register number
    as an argument. This told the assembler to use that register, which now >>contained the address of the first "useful" instruction as the base >>register in future instructions. This allowed the OS to load the
    program to any address in real memory, thus to have more than one
    program resident in real memory at the same time and the CPU could
    switch among them. By the time virtual memory came along with the S/370 >>(and OK, the 367/67) this was, of course no longer needed, but it was
    kept for upward compatibility.

    An interesting development is that, e.g., on Ultrix on DecStations
    programs were statically linked for a specific address. Then dynamic
    linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked
    at a specific address. Care was required to ensure that libraries which
    were used in the same application were statically link at unique
    and non-overlapping addresses. Difficult when you only had half
    the 386 linear address space available.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Sat Jun 14 20:31:35 2025
    On 6/13/2025 11:42 AM, MitchAlsup1 wrote:
    On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures.  Scott gave one example, the Burroughs Medium systems.
    The Univac 1108 and followons is another.  There may be others, perhaps >>> the some of CDC systems as they and the Univac systems shared a common
    designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space.

    A base register not part of GPRs is a descriptor (or segment).
    And we don't want to go there.

                The other machines made it invisible from user space >> and added its contents to every memory access.  This does not take
    up opcode space and allows swapping in and out of the whole process,

    It also fails when one has 137 different things to track with those descriptors.

    Fails isn't the correct word, but more awkward certainly is. I can't
    speak to the Burroughs machines (I am sure Scott can), but on the Univac
    1100 series it was a single instruction to change the base register to
    any other entry from a table of them that was set up at link time. The
    table could contain a lot of entries (it varied over time), but
    certainly many more than 137 (could be thousands)


    which was a much better solution for the early 1960s. (Actually, I
    believe the UNIVAC had two, one for program and one for data).

    Still insufficient for modern use cases.

    See above. The current (now emulated) systems can have up to 16
    descriptors currently active, with thousands an instruction or two away.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Sat Jun 14 20:51:34 2025
    On 6/14/2025 2:26 AM, Anton Ertl wrote:

    snip

    In any case, it's no problem to add a virtual-memory mechanism that is
    not visible to user-level, or maybe even kernel-level (does the
    original S/360 have that?) programs, whether it's paged virtual memory
    or a simple base+range mechanism.

    Virtual memory is no problem for you to say now, but this was the early
    1960s and no, S/360 didn't have anything like that. The CPU was
    implemented in

    https://en.wikipedia.org/wiki/Solid_Logic_Technology

    and the memory was real cores. Paging would have just been too much to
    ask at the time. One of the big changes from S/360 to the S/370 was the addition of paged virtual memory.

    As for the base and range mechanism, that is what much of this
    discussion is about. S/360 used arbitrary GPRs for the base registers,
    which prevented programs from being moved once they were initially
    loaded, whereas other contemporaneous systems used hidden base registers visible only to the OS. That is precisely what I regard, and have
    stated before as a mistake.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Sat Jun 14 20:34:37 2025
    On 6/14/2025 2:49 PM, quadibloc wrote:
    On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:

    On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

    That is precisely my point.  The mechanism that IBM chose effectively
    *prevents* program relocation.  That is why I believe it was a mistake
    to choose that mechanism.

    It prevents relocation of programs currently in use that are already in
    memory.

    Actually, to be more precise, it prevents doing this _in a manner that
    is fully transparent to the programmer_.

    So IBM could have created a time-sharing operating system that ran on
    models of the System/360 other than the model 67 with its Dynamic
    Address Translation hardware as follows:

    Require that programs only use one set of static base registers for
    their entire run;

    Require that programs describe the base registers they use in a standard header;

    Require that programs set a flag when they have finished initializing
    those base registers (and do so very quickly after being started).

    If those conditions are met, then a program in memory can indeed be
    moved to somewhere else in memory, as the operating system will know
    which base registers to adjust.

    Well, sort of. Such programs would not be able to use flat addresses to
    pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
    changes to calling conventions; for example, all routines in a program
    might need to share a common area for data values, and always use the
    same base register to point to it.

    So you would have special time-sharing versions of all the compilers.

    And this is better than what I proposed, and what other vendors did??????


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Sun Jun 15 07:37:44 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:

    I have found out that I was mistaken in my earlier posting.

    TSS/360 may have been a slow, inefficient, and poorly received
    time-sharing operating system for the System/360 by IBM.

    However, it only ran on the System/360 Model 67, and so it did *not*
    attempt the kind of kludge I described as a desperate way of working
    without the availability of address translation. Its poor performance
    must have been the result of other causes.

    IBM also had something called TSO, for Time-Sharing Option, and that did
    run on System/360 models other than the Model 67, and so IBM may
    actually have used the kind of kludge I had described after all.

    IIRC, TSO came later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Sun Jun 15 08:01:32 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were
    pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore
    it.

    Surely that is an unfair characterization.

    After all, as Ivan Godard has reminded us on several occasions, out of
    order execution has a very large cost in transistors. So, while it is a
    way of achieving high performance, it comes at a cost both in die size
    and in power consumption.

    If the same benefits could be obtained through VLIW techniques without
    those costs - but with an overhead cost of extra bits in the
    instructions - that would be a very promising technology. So their
    problem wasn't that they forgot what they knew about OoO, but rather
    perhaps that their knowledge of the limitations of VLIW was
    insufficient.

    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Also, for development projects, there are always differences in
    opinion if choosing path A or B is the right way, because both will
    have advantages and disadvantages, and people will have different
    opinions of what is likely to succeed. Even after termination of
    a project, you will in all likelyhood find people who say "But it
    could have succeeded, we should have tried this or that".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sun Jun 15 07:10:39 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    An interesting development is that, e.g., on Ultrix on DecStations
    programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked
    at a specific address.

    I guess you mean dynamically linked libraries. My contact with SVR3
    has not been intimate enough to experience that. I guess that DG/UX,
    which I worked with in 1990 and 1991 was a SVR3 derivative, but I
    never noticed that the libraries were dynamically linked (were they)?

    Anyway, it also happened in Linux.

    Care was required to ensure that libraries which
    were used in the same application were statically link at unique
    and non-overlapping addresses. Difficult when you only had half
    the 386 linear address space available.

    Address space was not the problem. HDD sizes in the early 1990s were
    well below 1GB, so all the libraries plus all the executables
    installed on one system (or available in one Linux distribution) could
    easily fit in that address space with ample address space left for
    data. The problem was that it required a lot of coordination, at
    least in the way that was used on Linux, don't know about SVR3.

    Every library binary was linked for a specific address, so those
    producing the library binaries had to coordinate which addresses they
    could use. When a library grew to need more space than allocated for
    it, my guess is that this resulted in work for a lot of people; when a
    new library was to be added, its maintainer needed to ask the
    coordinator for address space. That approach did not scale, so they
    switched from a.out to ELF and either position-independent code or
    relocation at library-loading time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Sun Jun 15 07:00:56 2025
    On 6/15/2025 12:37 AM, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:

    I have found out that I was mistaken in my earlier posting.

    TSS/360 may have been a slow, inefficient, and poorly received
    time-sharing operating system for the System/360 by IBM.

    However, it only ran on the System/360 Model 67, and so it did *not*
    attempt the kind of kludge I described as a desperate way of working
    without the availability of address translation. Its poor performance
    must have been the result of other causes.

    IBM also had something called TSO, for Time-Sharing Option, and that did
    run on System/360 models other than the Model 67, and so IBM may
    actually have used the kind of kludge I had described after all.

    IIRC, TSO came later.

    I used TSO on a S/360 model 65 in 1972. It was a dog. In contrast to
    TSO on the later virtual systems, it was a separate batch program that
    ran under the OS. That program controlled the terminals and swapped
    user programs to/from its own memory. So it added another layer of "OS
    like" functionality and the resulting overhead. IIRC the site specified
    the size and number of user areas within TSO, and users competed for one
    of those. It may be (I don't remember) that once a user program was
    assigned to a user slot, if it got swapped out (by the TSO program), it
    had to be swapped back into the same user area. This eliminated the
    relocation problem we have been discussing. It was very slow with even
    a few users.

    Later, when virtual memory came along with the S/370s, they kept the
    same name for a totally different implementation, one totally integrated
    into the OS.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to quadibloc on Sun Jun 15 09:48:24 2025
    quadibloc wrote:
    On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:

    On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

    That is precisely my point. The mechanism that IBM chose effectively
    *prevents* program relocation. That is why I believe it was a mistake
    to choose that mechanism.

    It prevents relocation of programs currently in use that are already in
    memory.

    Actually, to be more precise, it prevents doing this _in a manner that
    is fully transparent to the programmer_.

    So IBM could have created a time-sharing operating system that ran on
    models of the System/360 other than the model 67 with its Dynamic
    Address Translation hardware as follows:

    Require that programs only use one set of static base registers for
    their entire run;

    Require that programs describe the base registers they use in a standard header;

    Require that programs set a flag when they have finished initializing
    those base registers (and do so very quickly after being started).

    If those conditions are met, then a program in memory can indeed be
    moved to somewhere else in memory, as the operating system will know
    which base registers to adjust.

    Well, sort of. Such programs would not be able to use flat addresses to
    pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
    changes to calling conventions; for example, all routines in a program
    might need to share a common area for data values, and always use the
    same base register to point to it.

    So you would have special time-sharing versions of all the compilers.

    John Savard

    Won't work.
    The 360 program counter contains an absolute physical address that
    already includes the program base offset. Also any BAL link registers
    (there could be many, different for each subroutine) plus any spilled
    link registers (which may be saved at static addresses,
    or could be saved at runtime allocated locations).
    To relocate, all those links would have to be found and patched.

    To be easily relocatable there must be a clear distinction between
    program's logical addresses and physical ones, and physical addresses
    are only generated during the actual memory access (as MMU's do).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jun 15 13:51:32 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were
    pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore
    it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically. In particular, for
    high-budget decisions that also involve many other groups, in
    particular marketing (which did a perfect job of hyping IA-64), but
    also chipset and board groups, and also involve decisions not to
    pursue other projects (such as canceling the other P7 project, and not
    to pursue an AMD64-like architecture inside Intel), and also involve high-budget decisions from other corporations, that's something that
    is decided at the top of the hierarchy.

    I have read about meetings about IA-64 where top management and people
    from different groups were present. Accoring to what I read (from one
    of the IA-32 implementors), the IA-64 people showed hand-optimized
    assembly code for some inner loops, and the account gave the
    impression that that's the only performance results that Intel
    management used to decide for IA-64 and against extending IA-32 to 64
    bits (and other IA-32 projects, such as the original P7).

    Also, for development projects, there are always differences in
    opinion if choosing path A or B is the right way, because both will
    have advantages and disadvantages, and people will have different
    opinions of what is likely to succeed. Even after termination of
    a project, you will in all likelyhood find people who say "But it
    could have succeeded, we should have tried this or that".

    Certainly. Even for the EPIC ideas which in the case of IA-64
    certainly did not fail for lack of funding or marketing, some people
    just fail to accept that they just produce worse performance for many
    programs than OoO.

    For projects that were killed at an earlier stage rather than pushed
    through and failed in the marketplace, there are many more "but what
    if"s. Also for projects that are pushed through and fail in the
    marketplace, but look technically superior (e.g., Alpha).

    But for IA-64, the verdict is pretty clear: For the kind of market it
    targets, OoO is superior in performance. And this means that Intel
    could not pull people over from IA-32 with better performance, so the
    market disadvantage of introducing a new software ecosystem also came
    in full force.

    Could Intel and HP management have known this at the time? They
    certainly knew about the difficulties of introducing a new software
    ecosystem, especially Intel, which had enjoyed the benefits of
    compatibility for so many years.

    Could they have known about OoO performance benefits? I think that
    with their in-house OoO projects, they could, pretty early, and
    certainly when the Pentium Pro was released; it showed that OoO is
    does not slow down the clock, on the contrary (1.5 times faster clock
    than the P54C at the same time); it also showed that IA-32 can compete
    with RISCs. As for the EPIC side, if they really only showed
    hand-optimized kernels, the management should have been pretty
    sceptical. If they showed something more real-worldish, the results
    should also have made the management sceptical.

    HP certainly hedged their bets and did not cancel Onyx (PA-8000) or
    disable the 64-bitness of Onyx, but then that was pretty far along at
    the time, and the competetive pressure to have a 64-bit architecture
    was too large in Unix space to disable the 64-bitness. But they also
    followed it with a long sequence PA-8x00 machines until PA-8800/8900 with
    up to 1/1.1GHz in 2004/2005 (up from the PA-8000 with up to 180MHz).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jun 15 15:07:38 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were
    pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore >>>> it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically.

    Have you ever worked in a large corporation? (Just asking).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Jun 15 16:01:41 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    An interesting development is that, e.g., on Ultrix on DecStations >>>programs were statically linked for a specific address. Then dynamic >>>linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked
    at a specific address.

    I guess you mean dynamically linked libraries.

    No, I meant statically linked libraries. Shared objects were
    introduced in SVR4 with the SUN Solaris collaboration. I may have
    misunderstood your statement about linux vis-a-vis static shared libraries.


    Anyway, it also happened in Linux.

    Care was required to ensure that libraries which
    were used in the same application were statically link at unique
    and non-overlapping addresses. Difficult when you only had half
    the 386 linear address space available.

    Address space was not the problem. HDD sizes in the early 1990s were
    well below 1GB, so all the libraries plus all the executables
    installed on one system (or available in one Linux distribution) could
    easily fit in that address space with ample address space left for
    data. The problem was that it required a lot of coordination, at
    least in the way that was used on Linux, don't know about SVR3.

    Yes, it took a lot of coordination. For some applications (e.g.
    X11), all the X11 libraries would be combined into a single static
    library rather than trying to independently load them.


    Every library binary was linked for a specific address, so those
    producing the library binaries had to coordinate which addresses they
    could use.

    Yes, this was the same problem in SVR3.2. SVR4 showed up around
    1990 with shared objects and static libraries went the way of
    the Dodo bird.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Sun Jun 15 15:55:24 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 6/13/2025 11:42 AM, MitchAlsup1 wrote:
    On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Even with that provision, it would not have worked.

    All I can say is that it worked in several other contemporaneous
    architectures.  Scott gave one example, the Burroughs Medium systems. >>>> The Univac 1108 and followons is another.  There may be others, perhaps >>>> the some of CDC systems as they and the Univac systems shared a common >>>> designer (Seymour Cray).

    The problem of the /360 was that they put their base registers in
    user space.

    A base register not part of GPRs is a descriptor (or segment).
    And we don't want to go there.

                The other machines made it invisible from user space >>> and added its contents to every memory access.  This does not take
    up opcode space and allows swapping in and out of the whole process,

    It also fails when one has 137 different things to track with those
    descriptors.

    Fails isn't the correct word, but more awkward certainly is. I can't
    speak to the Burroughs machines (I am sure Scott can), but on the Univac
    1100 series it was a single instruction to change the base register to
    any other entry from a table of them that was set up at link time. The
    table could contain a lot of entries (it varied over time), but
    certainly many more than 137 (could be thousands)

    The Burroughs medium systems supported 8 active base registers
    backed by a set of in-memory translation tables. A task
    (thread in modern parlance) could have up to a million environments,
    each of which could have from two to eight memory areas[**]. When
    the MCP dispatched the task, it used a special instruction called
    'branch reinstate virtual (BRV)' which would load the eight base registers
    from a selected environment (the hardware task table entry for
    the task had a field which described the current environment for
    the task). The virtual enter (VEN) instruction would call
    a function/subroutine in either the same environment (within
    the code segment) or in a new environment (after loading the
    appropriate set of base registers[*]). Negative environment
    numbers (indicated by a 0b1101 in the most significant digit)
    were reserved for the MCP. The HCL (Hypercall) instruction
    was used to request service from the MCP.

    [*] and rolling the segment in from disk if it had been
    rolled out.

    [**] up to 500 kilobytes each (1 million digits). This was
    to support legacy B3500/B4700/B4800 compatability.

    The original B3500 had one base register, loaded by the
    MCP when dispatching a task. This limited the application
    to 500Kbytes in total memory size (which at the time was
    plenty, but by the late 70's, the design for an architecture
    that allowed more addressability while maintaining binary
    compatability with existing applications.

    I'm not as familiar with the Burrough large systems, but as
    it was a stack-based machine (48-bit) using protected descriptors
    for all data items with the descriptors stored on the stack
    (in specially tagged words), there was no real concept of a
    base register, other than the root of the current stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Sun Jun 15 16:05:06 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically.

    Have you ever worked in a large corporation? (Just asking).

    Define large.

    When Burroughs bought Sperry in 1986, Unisys had 120,000 employees.

    (a decade later, that was down to 20,000, three decades after
    that, it's up to 22,000).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Sun Jun 15 16:53:03 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    An interesting development is that, e.g., on Ultrix on DecStations >>>>programs were statically linked for a specific address. Then dynamic >>>>linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>at a specific address.

    I guess you mean dynamically linked libraries.

    No, I meant statically linked libraries.

    Static linking does not require any coordination. Every executable
    gets its own copy of the library parts it uses linked to fit into the executable's address space, i.e., with static linking libraries are
    not shared.

    I may have
    misunderstood your statement about linux vis-a-vis static shared libraries.

    I have never heard "static shared libraries". When I search for
    "static shared libraries" (without quotes) in duckduckgo, the first
    ten links I get treat "static" and "shared" as separate terms and
    usually as opposites. When I search for the term with quotes, it
    redirects me to google, which actually gives me a link:

    https://people.cs.nycu.edu.tw/~shieyuan/course/spb/lectures/sp15.ppt

    and a snippet:

    |With static shared libraries, symbols are still bound to addresses at
    |link time, but library code is not bound to the executable code until
    |run time. [...]

    Certainly in Linux the terminology was that they were all shared
    libraries (which was synonymous to dynamic linking), and that there
    was a transition from a.out to ELF (also often called the transition
    from libc4 to libc5).

    I had contact with various proprietary Unixes (HP/UX, DG/UX, Ultrix,
    and very little with SunOS), but I only remember that Ultrix did not
    support dynamic linking.

    Yes, this was the same problem in SVR3.2. SVR4 showed up around
    1990 with shared objects and static libraries went the way of
    the Dodo bird.

    In Linux, the transition was in 1995. Solaris (Sun's port of SVR4)
    appeared in 1992.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Sun Jun 15 17:59:29 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically.

    Have you ever worked in a large corporation? (Just asking).

    Define large.

    Let's say > 10000 employees, for the sake of a definition.

    When Burroughs bought Sperry in 1986, Unisys had 120,000 employees.

    (a decade later, that was down to 20,000, three decades after
    that, it's up to 22,000).

    Both would qualify, I think. Intel had ~ 32500 employees at year's
    end in 1994, so it would also qualify.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Jun 15 18:42:42 2025
    On Sun, 15 Jun 2025 18:21:48 +0000, BGB wrote:

    On 6/12/2025 7:00 PM, MitchAlsup1 wrote:
    On Thu, 12 Jun 2025 21:30:39 +0000, BGB wrote:

    On 6/12/2025 2:13 PM, MitchAlsup1 wrote:
    ------------------------------
    The modern interpretation is that the dynamic rounding mode can be set >>>> prior to any FP instruction. So, you better be able to set it rapidly
    and without pipeline drain, and you need to mark the downstream FP
    instructions as dependent on this.

    Errm, there is likely to be a delay here, otherwise one will get a stale >>> rounding mode.

    RM is "just 3-bits" that get read from control register and piped
    through instruction queue to function unit. Think of the problem
    one would have if a hyperthreaded core had to stutter step through
    changing RM ...


    To do it more quickly, one would likely need special logic in the
    pipeline for getting the updated RM to the FPU in a more timely manner.

    Realistically, it (RM) is no different than a condition code;
    except that RM is main effect of an instruction instead of a
    side effect of performing an instruction.

    If done (as-is) in a lax way: Held in the HOBs of GP/GBR or similar,
    which is handled as an SPR that gets broadcast out of the regfile.

    Then one has the latency issue:
    The new value needs to reach the regfile (WB stage);
    The value then needs to make its way to the relevant ID2/RF stage (next
    cycle after WB).

    Once again, this is no different than a condition code.

    A lazy option would be to add an interlock so that any dynamic rounding
    mode instruction would generate pipeline stalls for any in-flight modifications to GBR (as opposed to using a branch or a series of NOPs).
    This was not done in my existing implementation.

    Just track RM as if it were a condition code.

    But, IME, the "fenv.h" stuff, and FENV_ACCESS, is rarely used.
    So, making "fesetround()" or similar faster doesn't seem like a high priority.

    If having "fsetround()" as a function call, can also ensure the needed
    delay as-is by using a non-default register during the return (mostly to hinder the branch predictor).



    So, setting the rounding mode might be something like:
       MOV .L0, R14
       MOVTT GP, 0x8001, GP  //Set to rounding mode 1, clear flag bits
       JMP R14         //use branch to flush pipeline
       .L0:            //updated FPSR now ready
       FADDG R11, R12, R10  //FADD, dynamic mode

    Setting RM to a constant (known) value::

        HRW  rd,RM,#imm3    // rd gets old value


    It is possible,

    Of course it is possible, especially if you make it work like it
    is supposed to work an not as a hard to do thing.

    Could almost alias the bits to part of SR, where SR does generally have
    a more timely update process (could reduce latency to 2 cycles).

    I can see constant writes to RM as taking ZERO cycles.

    At present, the RM field is held in GBR(51:48), with fast update options either being a MOVTT (can replace the high 16 bits, *1) or BITMOV,

    It does not mater where it is--what maters is that it can be overwritten
    every execution cycle, and that instructions dependent on its current
    value are properly sequenced. Reservation stations do that for register
    and memory data, why not add RM (and carry) to them ??

    *1: There is a MOVTT Imm5/Imm6 variant, currently can only modify
    (63:60) though.


    Though, this strategy is only directly usable in XG3 (where GBR is
    mapped to R3/X3), N/A in XG1 or XG2, where GBR is in CR space and so
    would require 3 instructions.

    Implicitly, the fragment assumed XG3, but then this leaves open the
    issue of whether to use my former ASM syntax or RISC-V style ASM syntax (BGBCC can sorta accept either, with my newer X3VM experiment defaulting
    to RISC-V syntax).


    Can note that the RISC-V F/D instructions define a fixed rounding modes
    in the instruction, with rounding modes for a dynamic rounding mode
    (though, IIRC, no way to update the dynamic RM within the scope of the
    base ISA; so one needs Zicsr and similar to pull it off).

    The IEEE 754-2019 specification causes languages that adopt 754
    semantics to follow 754--which has a rather kludgy means to modify
    RM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Sun Jun 15 19:24:27 2025
    On Sun, 15 Jun 2025 16:53:03 GMT, Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    Yes, this was the same problem in SVR3.2. SVR4 showed up around 1990
    with shared objects and static libraries went the way of the Dodo bird.

    In Linux, the transition was in 1995. Solaris (Sun's port of SVR4)
    appeared in 1992.

    There is a section on the SVR3.2 shared library implementation in this:

    <https://www.mirrorservice.org/sites/www.bitsavers.org/pdf/att/unix/ System_V_386_Release_3.2/ UNIX_System_V_386_Release_3.2_Programmers_Guide_Vol1_1989.pdf>

    It was using COFF object files.

    I think that a lot of the design of current dynamic shared libraries came
    from pre-SVR4 SunOS using a.out object files.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Mon Jun 16 14:15:07 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    An interesting development is that, e.g., on Ultrix on DecStations >>>>>programs were statically linked for a specific address. Then dynamic >>>>>linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>>at a specific address.

    I guess you mean dynamically linked libraries.

    No, I meant statically linked libraries.

    Static linking does not require any coordination. Every executable
    gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
    not shared.

    They were shared between multiple processes on SVR3.2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Jun 16 12:17:21 2025
    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those
    purposes.
    This is going to hurt register allocation.

    I vaguely remember reading somewhere that it doesn't have to be too bad:
    e.g. if you reduce register-specifiers to just 4bits for a 32-register architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon Jun 16 23:37:05 2025
    On Mon, 16 Jun 2025 22:14:14 +0000, BGB wrote:

    On 6/12/2025 7:11 PM, quadibloc wrote:
    On Thu, 12 Jun 2025 19:24:36 +0000, MitchAlsup1 wrote:
    On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote: -----------------------

    But, this sort of thing is semi-common in VLIWs, along with things like
    not having either pipeline interlocks or register forwarding (so, say,
    every instruction has a 3 cycle latency, so you need to wait 2 bundles
    to use a result else you get a stale value, etc).


    Contrast, spending 1 or 2 bits per instruction word or similar (to daisy chain groups of instructions), and still having things like forwarding
    and interlocks, does not result in same the severe hit to code density.

    General purpose ISAs use zero bits to daisy chain instructions, they
    use register specifiers, and a memory ordering logic block.


    However, register forwarding does have a dark side: It has a fairly
    steep cost curve. So, with forwarding, once you try to cross ~ 2 or 3
    wide, the costs here grow out of control.

    The cost of register forwarding is:: Results × Operands or sometimes
    SUM( Results[type] × Operands[type] ); hardly more than quadratic.

    So, as the core gets wider, the cost of the register file will exceed
    that of the function units (and one may find that it is cheaper to go multi-core than to make the core wider).

    Now, guess what happens with an execution window of size 300
    instructions
    and a GPR file, an FPR file, and an SIMD file ??? Do you facilitate the
    files for 300 instructions each, 100 instructions each, or something in
    between ???

    Or, to try to go wider while keeping cost under control, give up on
    niceties like register forwarding.

    Apple has show it is not just possible but can be power effective. --------------------

    One of the dominant use-cases for VLIW is in GPUs and similar.

    But, then seemingly battles for control against "lots of in-order
    superscalar RISC cores".

    So, for more traditional 3D rendering tasks, VLIW did well, but for
    things like GPGPU or ray-tracing, the "crapton of in-order cores"
    strategy works well.

    There are intermediate choices, too::

    Blocks not quite worthy of CPU status, but perform workloads
    nonetheless::

    A) texture
    B) interpolation
    C) rasterization
    D) WARP initialization
    E) WARP rebalancing
    F) Transcendental calculations

    The main merit of OoO is when the overriding priority is maximizing per-thread performance. But, in other cases, cramming more cores on the
    die may offer more performance than one can get from a smaller number of faster cores.

    G) Crypto engines
    H) programmable DMA engines

    Well, and all the battles over things like memory coherence.
    For a small number of fast cores, coherence makes sense.
    For large number of cores, weaker models may be preferable (or, say, essentially treating parts of the memory map as read-only to most of the cores).

    Why not BOTH !! and switch between coherent and incoherent as side
    effect of a memory instruction touching "something other than DRAM".

    Where, LIW (in a partial contrast to VLIW) can have merit if the goal is
    to optimize for per-core cheapness. The per-core cost for a LIW can be
    lower than that of an in-order superscalar, but with the drawback that
    the compiler will need to be aware of pipeline specifics.

    Say one could have cores designed like, say:
    2 or 3 wide;
    Explicit parallelism;
    No register forwarding;
    Maybe optional interlocks;
    Weak memory coherence;
    ...

    6 to 10 wide
    Explicit Concurrency
    Standard Register dependence order
    Standard Memory dependence order
    Lamport Atomic dependence order
    Lessened Memory consistency When reasonable with
    Sequential consistency When required and
    Strongly consistency When absolutely required

    And then trying to optimize for fitting as many cores as possible on the chip, even if per-thread performance is relatively low, and trying to prioritize having very high memory bandwidth.

    Currently, inter-core signals/messages are too expensive: especially
    when a HyperVisor has to get in the way. Then again, SW over the last
    15 years has demonstrated no ability to use "lots" of small cheap cores.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Stefan Monnier on Mon Jun 16 18:26:01 2025
    On 6/16/2025 9:17 AM, Stefan Monnier wrote:
    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those
    purposes.
    This is going to hurt register allocation.

    I vaguely remember reading somewhere that it doesn't have to be too bad:
    e.g. if you reduce register-specifiers to just 4bits for a 32-register architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.

    I can see that it isn't too hard on the logic for the register
    allocator, but I suspect it will lead to more register saving and
    restoring. Consider a two instruction sequence where the output of the
    first instruction is an input to the second. The first instruction has
    only a choice of 16 registers, not 32. And the second instruction also
    has 16 registers, but on average only half of them will be in the 16
    included in the first instruction. So instead of 32 registers to chose
    from you only have 8. So the odds increase that you must save one of
    those 8 and perhaps restore it after the two instructions have completed.

    It sure seems ugly to me.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Tue Jun 17 11:44:33 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Without data, it's all speculation. Given, however, that there
    doesn't seem to be a rush to replace x86 or arm64 with
    armhf or riscv64, I don't believe that the text size is
    particularly interesting to the general user.

    Probably not, but I don't think the reason is that "working set size"
    would produce significantly different results.

    However, apparently code size is important enough in some markets that
    ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
    followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
    (which allows mixing 16-bit and 32-bit encodings); Power specified VLE
    (are there any implementations of that?); and RISC-V specified the C extension, which is implemented widely AFAICT.

    AFAICS main target of those are small embedded microcontrollers
    running code mostly from flash. Apparently flash size have important
    impact on microcontroller cost. Also, biggest available flash
    was limited and if program exceeded available flash one would have
    to switch to different (possibly significantly more expensive)
    hardware.

    Some Linux distributions took advantage of smaller instructions
    and compiled a lot of programs to save space. But I doubt that
    possibility of such a saving would be enough to motivate
    developement of more space efficient encoding. IIUC 64-bit
    ARM droppend most of space saving features of Thumb2, so
    apparently they did not consider them important enough for
    bigger machines.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Waldek Hebisch on Tue Jun 17 16:00:34 2025
    On 17/06/2025 13:44, Waldek Hebisch wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Without data, it's all speculation. Given, however, that there
    doesn't seem to be a rush to replace x86 or arm64 with
    armhf or riscv64, I don't believe that the text size is
    particularly interesting to the general user.

    Probably not, but I don't think the reason is that "working set size"
    would produce significantly different results.

    However, apparently code size is important enough in some markets that
    ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
    followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
    (which allows mixing 16-bit and 32-bit encodings); Power specified VLE
    (are there any implementations of that?); and RISC-V specified the C
    extension, which is implemented widely AFAICT.

    Power PC cores with VLE have definitely been implemented - I used one
    many years ago. You find them in PPC-based microcontrollers such as the MPC5534 with the e200z3 core, made by Freescale (now part of NXP). The
    PPC core microcontrollers are popular in the automotive industry, but I
    don't know off-hand if any of the modern families have VLE.


    AFAICS main target of those are small embedded microcontrollers
    running code mostly from flash. Apparently flash size have important
    impact on microcontroller cost. Also, biggest available flash
    was limited and if program exceeded available flash one would have
    to switch to different (possibly significantly more expensive)
    hardware.


    Flash size can be a very real part of the cost of small
    microcontrollers. There are ARM (typically Cortex-M0+) core
    microcontrollers down to 4 KB flash, though for current devices, there
    is rarely much money to save going below 16 KB. Flash size is also a
    limiting factor for future development - typically, you can get the same microcontroller in the same packing with different flash and ram sizes,
    for relatively cheap upgrades. But once you exceed the largest flash
    size in the family, you face a costly board redesign.

    Small microcontrollers also don't have normal caches - but they might
    have a very small amount of buffer attached to the flash memory
    controller. Code density is definitely significant for these devices.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 17 17:41:10 2025
    On Tue, 17 Jun 2025 1:20:40 +0000, quadibloc wrote:

    On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:

    Then again, SW over the last
    15 years has demonstrated no ability to use "lots" of small cheap cores.

    And as long as that remains true, out-of-order execution will continue
    to be popular, and there will also be strong pressure to find exotic materials that can be used to make faster transistors - and faster interconnects between them.

    While I am willing to agree that we can do better in using multiple
    cores, I also think that even after we do all that we can in that area,
    a single core that is N times faster will still be better than N cores.

    But that is NOT the arithmatic you are looking at::

    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

    So the arithmetic is: can you OoO core be 10× faster than the LBIO core
    ??
    And the answer is NO.

    The highest performing OoO is M4 right now and it is 2× faster than
    Opteron Rev F (after normalizing frequency)--perhaps I should say
    2× more instructions per clock. If M4 area was equal to Opteron area
    (highly doubtful after normalizing) it would still be a factor of 5-6×
    more area than 12 LBIO cores.

    But on the other hand, no matter what new technologies we discover to
    make cores faster, there will still be a hard limit to how fast a core
    can be.

    Right now, you could make transistors infinitely fast, and clock speeds
    would not move much due to the dominance of wire delay in CPU design.

    So both faster cores, and more efficient ways to use multiple cores,
    will always be important.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Jun 17 17:45:23 2025
    On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

    On 6/16/2025 9:17 AM, Stefan Monnier wrote:
    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those >>>> purposes.
    This is going to hurt register allocation.

    I vaguely remember reading somewhere that it doesn't have to be too bad:
    e.g. if you reduce register-specifiers to just 4bits for a 32-register
    architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.

    I can see that it isn't too hard on the logic for the register
    allocator,

    You are missing the BIG problem::

    Register allocator allocated Rk for calculation j and later allocates
    Rm for instruction p, then a few instructions later the code generator
    notices that Rk and RM need to be paired or shared and they were not originally. How does on fix this kind of problem without adding more
    passes over the intermediate representation ??

    but I suspect it will lead to more register saving and
    restoring.

    And reg-reg MOVment.

    Consider a two instruction sequence where the output of the
    first instruction is an input to the second. The first instruction has
    only a choice of 16 registers, not 32. And the second instruction also
    has 16 registers, but on average only half of them will be in the 16
    included in the first instruction. So instead of 32 registers to chose
    from you only have 8. So the odds increase that you must save one of
    those 8 and perhaps restore it after the two instructions have
    completed.

    It sure seems ugly to me.

    It has been under study by compiler people since at least 1963 without
    much forward progress.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 17 17:52:50 2025
    On Tue, 17 Jun 2025 13:12:27 +0000, quadibloc wrote:

    On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or sometimes
    SUM( Results[type] × Operands[type] ); hardly more than quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the cost of
    the forwarding exceeds exceeds that of the function units (such as an
    additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding stops
    making sense.

    Out of order execution can cost more than the functional units in a CPU.
    But since a faster CPU - or, more specifically, a CPU with faster
    single-thread performance - is much more useful than more CPUs, it's
    still well worth the cost.

    In fact, just as I've included VLIW as a basic feature in the Concertina
    II, as a way to explicitly do what OoO does transparently for the
    programmer, one advanced feature I wish to include in the Concertina II
    - I think I took a stab at it in the original Concertina - is dataflow computing.

    Dataflow computing is where the program explicitly states how arithmetic units are to be connected together to perform multiple operations in a chained fashion, usually taking vectors as input and producing vectors
    as output.

    Do you remember WHY data-flow failed ???

    It failed because it exposed TOO MUCH ILP and then this, in turn,
    required
    too much logic to manage efficiently--often running into queue overflow problems (Reservation station entries) that could cause lock up if not
    managed correctly.

    GBOoO machines don't have this problem since DECODE will stall if the instruction queues overflow.

    The ENIAC, "before von Neumann ruined it", worked that way. So data
    isn't even being put in registers, let alone memory, between severalas operations, thus making the computer faster.

    In Concertina, unlike Concertina II, I didn't worry about having
    instructions that had awkward and special rules for length decoding,
    though. A dataflow instruction would involve a chain of operations with
    an arbitrary length up to some limit.

    However, the solution suggests itself.

    I used pointers to pseudo-immediates to prevent the variation in length
    of immediate values from making length decoding for instructions
    complicated.

    In some early iterations of Concertina II, I used a similar pointer
    mechanism as my method of allowing instructions longer than 32 bits. The pointers were four bits long for that, instead of five, since now they
    were halfword addresses instead of byte addresses. This could be brought
    back - but just for dataflow instructions and any similar exotic cases.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Jun 17 11:09:33 2025
    On 6/17/2025 10:45 AM, MitchAlsup1 wrote:
    On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

    On 6/16/2025 9:17 AM, Stefan Monnier wrote:
    Therefore, I reduced the index register and base register fields to
    three bits each, using only some of the 32 integer registers for those >>>>> purposes.
    This is going to hurt register allocation.

    I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-register
    architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.

    I can see that it isn't too hard on the logic for the register
    allocator,

    You are missing the BIG problem::

    Register allocator allocated Rk for calculation j and later allocates
    Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally. How does on fix this kind of problem without adding more
    passes over the intermediate representation ??

    Good point. Thanks.



               but I suspect it will lead to more register saving and >> restoring.

    And reg-reg MOVment.

    Yes. I should have mentioned that as well.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 17 18:14:19 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 17 Jun 2025 13:12:27 +0000, quadibloc wrote:

    On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or sometimes >>>>> SUM( Results[type] × Operands[type] ); hardly more than quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the cost of
    the forwarding exceeds exceeds that of the function units (such as an
    additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding stops
    making sense.

    Out of order execution can cost more than the functional units in a CPU. >>> But since a faster CPU - or, more specifically, a CPU with faster
    single-thread performance - is much more useful than more CPUs, it's
    still well worth the cost.

    In fact, just as I've included VLIW as a basic feature in the Concertina
    II, as a way to explicitly do what OoO does transparently for the
    programmer, one advanced feature I wish to include in the Concertina II
    - I think I took a stab at it in the original Concertina - is dataflow
    computing.

    Dataflow computing is where the program explicitly states how arithmetic
    units are to be connected together to perform multiple operations in a
    chained fashion, usually taking vectors as input and producing vectors
    as output.

    Do you remember WHY data-flow failed ???

    It failed because it exposed TOO MUCH ILP and then this, in turn,
    required
    too much logic to manage efficiently--often running into queue overflow >problems (Reservation station entries) that could cause lock up if not >managed correctly.

    Yes. Dataflow (something my advisor specialized in in the late 70's
    and 80s) works best with macro operations, not micro operations.

    Was part of a startup in 2000 that did dataflow using XML datagrams
    as the unit of transport (instead of a bit). A XML document would
    be received by the "system" and routed through a series of
    transformations using a dataflow engine resulting in an XML document
    (in degenerate form, HTML) as output (with side effects such as
    database updates along the way).

    The transformations included applying XSL stylesheets, making
    database accesses and updating fields in the XML with the results,
    and a few other macro-operations.

    There was a nice GUI to create the flow graph for the engine.

    Was eventually purchased by Verisign at the end of the dot-bomb.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Jun 17 18:52:19 2025
    On Tue, 17 Jun 2025 18:34:20 +0000, BGB wrote:

    On 6/17/2025 7:58 AM, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or sometimes
    SUM( Results[type] × Operands[type] ); hardly more than quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the cost of
    the forwarding exceeds exceeds that of the function units (such as an
    additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding stops
    making sense.

    Out of order execution can cost more than the functional units in a CPU.
    But since a faster CPU - or, more specifically, a CPU with faster
    single-thread performance - is much more useful than more CPUs, it's
    still well worth the cost.


    It depends on the use case.

    But, at least in my experience with ARM hardware, in-order actually
    seems to hold up pretty well here.

    Like, if there were a 200% or more difference for OoO performance vs
    in-order performance relative to clock speed; maybe...

    GBOoO like Opteron is actually 2× faster than LBIO
    GBOoO like M4 is actually 4× faster than LBIO

    But, seemingly the delta is often modest enough that one can still make
    a case for in-order in cases where you don't actually need maximum single-thread performance.

    Forwarding decreases latency. Lower Latency is ALLWAYS better when
    measured in picoseconds (not clocks).
    -----------------
    Often, as noted, each population member (or test member, or whatever it
    is called) would usually be represented in some bit-redundant format
    (such as each bit expanded out to a full byte for majority-8, or 3
    parallel copies for majority-3).
    Majority-8 was usually lookup table driven.
    Majority-3 was usually (A&B)|(B&C)|(A&C).

    Majority-3 is 1 gate delay (inverting) 222AOI
    ------------
    Can note, the Zen+ in my main PC seemingly has an odd property:
    Under 25% CPU load, per thread performance is at maximum;
    Around 25-50%, per-thread drops, but still often positive benefit;
    Over 50%, per-thread drops notably,
    so 100% isn't much better than 50%.

    Granted, the 50-100% domain is mostly hyperthreading territory.

    If it hurts, stop doing it. That is turn of HypoThreading.

    But, it seems like there is some shared resource that becomes a
    bottleneck by around the time one hits 4 threads.

    Scheduling across multiple cores is known to be more than cubic.

    Had noted that it seems to apply mostly to memory-medium and
    memory-heavy use-cases, where:
    memory-medium: ~ 10 to 100MB of working data;
    memory-heavy: over 100MB of working data.
    Where, most of the data is touched continuously.

    Suspect cache hierarchy.

    If the task is primarily bound by things like branching or ALU/FPU,
    there does not seem to be a fall-off.

    Suspect cache.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB on Tue Jun 17 21:58:17 2025
    On Tue, 17 Jun 2025 13:34:20 -0500
    BGB <cr88192@gmail.com> wrote:

    On 6/17/2025 7:58 AM, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or
    sometimes SUM( Results[type] × Operands[type] ); hardly more than
    quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the cost
    of the forwarding exceeds exceeds that of the function units (such
    as an additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding stops
    making sense.

    Out of order execution can cost more than the functional units in a
    CPU. But since a faster CPU - or, more specifically, a CPU with
    faster single-thread performance - is much more useful than more
    CPUs, it's still well worth the cost.


    It depends on the use case.

    But, at least in my experience with ARM hardware, in-order actually
    seems to hold up pretty well here.

    Like, if there were a 200% or more difference for OoO performance vs in-order performance relative to clock speed; maybe...

    But, seemingly the delta is often modest enough that one can still
    make a case for in-order in cases where you don't actually need
    maximum single-thread performance.



    For Arm architecture, the dufference in single-thread performance
    between the fastest available in-order cores (ARM Cortex-A520) and
    the fastest available OoO cores (Apple M4, Qualcomm Oryon) is huge.
    Probably, over 5x. Even Arm's own ARM Cortex-X925 is several times
    faster than A520.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB on Tue Jun 17 22:33:36 2025
    On Tue, 17 Jun 2025 14:13:02 -0500
    BGB <cr88192@gmail.com> wrote:

    On 6/17/2025 1:58 PM, Michael S wrote:
    On Tue, 17 Jun 2025 13:34:20 -0500
    BGB <cr88192@gmail.com> wrote:

    On 6/17/2025 7:58 AM, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or
    sometimes SUM( Results[type] × Operands[type] ); hardly more
    than quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the
    cost of the forwarding exceeds exceeds that of the function
    units (such as an additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding
    stops making sense.

    Out of order execution can cost more than the functional units in
    a CPU. But since a faster CPU - or, more specifically, a CPU with
    faster single-thread performance - is much more useful than more
    CPUs, it's still well worth the cost.


    It depends on the use case.

    But, at least in my experience with ARM hardware, in-order actually
    seems to hold up pretty well here.

    Like, if there were a 200% or more difference for OoO performance
    vs in-order performance relative to clock speed; maybe...

    But, seemingly the delta is often modest enough that one can still
    make a case for in-order in cases where you don't actually need
    maximum single-thread performance.



    For Arm architecture, the dufference in single-thread performance
    between the fastest available in-order cores (ARM Cortex-A520) and
    the fastest available OoO cores (Apple M4, Qualcomm Oryon) is huge. Probably, over 5x. Even Arm's own ARM Cortex-X925 is several times
    faster than A520.


    For ARM, main reference points I had was A53 vs A72.
    A72 was faster, but but not drastically...


    Arm Cortex A15/A57/A72 family of OoO cores designed in Austin,Tx was no
    good.
    Arm Cortex A9/A12/A73/A75 family of OoO cores designed in Sophia
    Antipolis was significantly better.
    The next Austin-designed family (all current middle and high end Arm
    cores starting from A76) is better yet. Less so in perf/W and perf/area,
    more so in absolute performance.

    Cambridge-designed Cortex-A53 is very good design, but it was never
    meant to have top performance. And it is old.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jun 17 16:51:17 2025
    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.
    ^^
    still?

    I understand that was your sales pitch, and I assume you had good
    reasons to think it was indeed true, but is it still the case now?

    AFAICT (see for example Anton's benchmarks in this regard) with current
    CPUs, "LBIO cores" are not terribly more power-efficient than big OoO cores.

    Or at least, it seems that the big OoO cores are not significantly less power-efficient when they are computing at the same speed as LBIO cores.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jun 17 16:43:00 2025
    MitchAlsup1 [2025-06-17 17:45:23] wrote:
    On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:
    On 6/16/2025 9:17 AM, Stefan Monnier wrote:
    I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-register
    architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.
    I can see that it isn't too hard on the logic for the register
    allocator,
    You are missing the BIG problem::

    Register allocator allocated Rk for calculation j and later allocates
    Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally.

    What do you mean by "a few instructions later"? The above was stated in
    the context of a register allocator based on something like Chaitin's algorithm, which does not proceed "instruction by instruction" but
    instead takes a whole function (or basic bloc), builds an interference
    graph from it, then chooses registers for the vars looking only at that interference graph.

    but I suspect it will lead to more register saving and
    restoring.
    And reg-reg MOVment.

    Of course. The point is simply that in practice (for some particular
    compiler at least), the cost of restricting register access by using
    only 4bits despite the existence of 32 registers was found to be small.

    Note also that you can reduce this cost by relaxing the constraint and
    using 5bit for those instructions where there's enough encoding space.
    (or inversely, increase the cost by using yet fewer bits for those
    instructions where the encoding space is really tight).

    There's also a good chance that you can further reduce the cost by using
    a sensible mapping from 4bit specifiers instead a randomized one.

    IOW, the point is that just because you have chosen to have 2^N
    registers in your architecture doesn't mean you have to offer access to
    all 2^N registers in every instruction that can access registers.
    It's clearly more convenient if you can offer that access, but if needed
    you can steal a bit here and there without having too serious an impact
    on performance.

    Consider a two instruction sequence where the output of the
    first instruction is an input to the second. The first instruction has
    only a choice of 16 registers, not 32. And the second instruction also
    has 16 registers, but on average only half of them will be in the 16
    included in the first instruction. So instead of 32 registers to chose
    from you only have 8.

    Right. But in practice, the register allocator can often choose the
    rest of the register assignment such that one of those 8 is available.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 17 21:11:11 2025
    On Tue, 17 Jun 2025 20:51:17 +0000, Stefan Monnier wrote:

    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.
    ^^
    still?

    I understand that was your sales pitch, and I assume you had good
    reasons to think it was indeed true, but is it still the case now?

    I have seen no dramatic change in the ratio of logic to SRAM to pins
    in the last 15 years, and if anything, the more layers of metal, the
    smaller the LBIO can be whereas GBOoO tend to use more of the layers.

    AFAICT (see for example Anton's benchmarks in this regard) with current
    CPUs, "LBIO cores" are not terribly more power-efficient than big OoO
    cores.

    I understand this point, and do not disagree. A lot of work has gone
    into decreasing the power consumed by instruction queueing--converting value-capturing reservation stations into value-free reservation stat-
    tions has done a lot of this, while, at the same time taking pressure
    off of forwarding (at a minor cost in latency).

    Execution power is up only by the instruction rate multiplier and
    some minor term in power consumed in instructions that get throw away
    via misprediction, or get run more than once due to replay.

    Or at least, it seems that the big OoO cores are not significantly less power-efficient when they are computing at the same speed as LBIO cores.

    The LBIO cores are more heavily dependent on low latency whereas
    GBOoO cores are more tolerant of latency and of order.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Jun 17 21:19:23 2025
    On Tue, 17 Jun 2025 19:04:49 +0000, BGB wrote:

    On 6/17/2025 12:59 PM, quadibloc wrote:
    On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:

    But that is NOT the arithmatic you are looking at::

    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power. >>>
    So the arithmetic is: can you OoO core be 10× faster than the LBIO core >>> ??
    And the answer is NO.

    But my code runs faster on the OoO core than on ten LBIO cores, because
    nobody knows how to make effective use of ten cores to solve the
    problem.

    So the fact that it uses 10x the electrical power, while only having 2x
    the raw power - for an embarrassingly parallel problem, which doesn't
    happen to be the one I need to solve - doesn't matter.

    That's why OoO chips sell so well.


    Errm, this doesn't agree with my experience.

    More like the OoO chips are around 20-40% faster, but depending on
    workload.

    Then you are latency bound, not compute bound.


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Jun 17 18:14:48 2025
    MitchAlsup1 [2025-06-17 21:18:29] wrote:
    On Tue, 17 Jun 2025 20:43:00 +0000, Stefan Monnier wrote:
    What do you mean by "a few instructions later"? The above was stated in
    the context of a register allocator based on something like Chaitin's
    algorithm, which does not proceed "instruction by instruction" but
    instead takes a whole function (or basic bloc), builds an interference
    graph from it, then chooses registers for the vars looking only at that
    interference graph.

    I am regurgitating conversations I have had with compile people over
    the last 40 years. Noting I have seen in ISA design has moderated
    these problems--but I, personally, have not been inside a compiler
    for 41 years, either (1983). So, find a compiler writer to set this
    record straight. I continue to be told: it is enough harder that you
    should design ISA so you don't need pairing or sharing, ever.

    Ah, well, pairing is a different problem than the "incomplete register specifiers" I'm talking about. Indeed, it can be much more difficult to
    adapt a Chaitin-style allocator to handle pairing because it can't be
    expressed simply in the interference graph.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 17 21:18:29 2025
    On Tue, 17 Jun 2025 20:43:00 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-06-17 17:45:23] wrote:
    On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:
    On 6/16/2025 9:17 AM, Stefan Monnier wrote:
    I vaguely remember reading somewhere that it doesn't have to be too bad: >>>> e.g. if you reduce register-specifiers to just 4bits for a 32-register >>>> architecture and kind of "randomize" which of the 16 values refer to
    which of the 32 registers for each instruction, it's fairly easy to
    adjust a register allocator to handle this correctly (assuming you
    choose your instructions beforehand, you simply mark, for each
    instructions, the unusable registers as "interfering"), and the end
    result is often almost as good as if you had 5bits to specify
    the registers.
    I can see that it isn't too hard on the logic for the register
    allocator,
    You are missing the BIG problem::

    Register allocator allocated Rk for calculation j and later allocates
    Rm for instruction p, then a few instructions later the code generator
    notices that Rk and RM need to be paired or shared and they were not
    originally.

    What do you mean by "a few instructions later"? The above was stated in
    the context of a register allocator based on something like Chaitin's algorithm, which does not proceed "instruction by instruction" but
    instead takes a whole function (or basic bloc), builds an interference
    graph from it, then chooses registers for the vars looking only at that interference graph.

    I am regurgitating conversations I have had with compile people over
    the last 40 years. Noting I have seen in ISA design has moderated
    these problems--but I, personally, have not been inside a compiler
    for 41 years, either (1983). So, find a compiler writer to set this
    record straight. I continue to be told: it is enough harder that you
    should design ISA so you don't need pairing or sharing, ever.

    but I suspect it will lead to more register saving and
    restoring.
    And reg-reg MOVment.

    Of course. The point is simply that in practice (for some particular compiler at least), the cost of restricting register access by using
    only 4bits despite the existence of 32 registers was found to be small.

    Note also that you can reduce this cost by relaxing the constraint and
    using 5bit for those instructions where there's enough encoding space.
    (or inversely, increase the cost by using yet fewer bits for those instructions where the encoding space is really tight).

    There's also a good chance that you can further reduce the cost by using
    a sensible mapping from 4bit specifiers instead a randomized one.

    IOW, the point is that just because you have chosen to have 2^N
    registers in your architecture doesn't mean you have to offer access to
    all 2^N registers in every instruction that can access registers.
    It's clearly more convenient if you can offer that access, but if needed
    you can steal a bit here and there without having too serious an impact
    on performance.

    Consider a two instruction sequence where the output of the
    first instruction is an input to the second. The first instruction has
    only a choice of 16 registers, not 32. And the second instruction also
    has 16 registers, but on average only half of them will be in the 16
    included in the first instruction. So instead of 32 registers to chose
    from you only have 8.

    Right. But in practice, the register allocator can often choose the
    rest of the register assignment such that one of those 8 is available.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Stephen Fuld on Tue Jun 17 16:47:55 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).

    OS/360 "relocatable" ... included address constants in executable images
    that had to be modified when first loaded into real storage (which
    continued after move to virtual storage).

    The initial decision to add virtual memory to all 370s was based on the
    fact that OS/360 "MVT" storage management was so bad that (concurrently
    loaded) executable sizes had to be specified four times larger than used
    ... so typical 1mbyte (real storage) 370/165 only ran four concurrently executing regions, insufficient to keep 165 busy and justified. Running
    MVT in a (single) 16mbyte virtual address space, aka VS2/SVS (sort of
    like running MVT in a CP67 16mbyte virtual machine) allowed concurrently running regions to be increased by a factor of four (modulo 4bit storage protection keys required for isolating each region) with little or no
    paging.

    As systems got larger they needed to run more than 15 concurrent regions (storage protect key=0 for kernel, 1-15 for regions). As a result they
    move to VS2/MVS ... a separate 16mbyte virtual address space for each
    region (to eliminate storage protect key 15 limit on concurrently
    executing regions). However since OS/360 APIs were heavily pointer
    passing, they map an 8mbyte kernel image into every virtual address
    space (allowing pointer passing kernel calls to use passed pointer
    directly) ... leaving 8mbyte for each region.

    However kernel subsystems were also mapped into their own, separate
    16mbyte virtual address space. For (pointer passing) application calls
    to subsystem, a one megabyte "common segment area" ("CSA") was mapped
    into every 16mbyte virtual address space for pointer passing API calls
    to subsystems ... leaving 7mbytes for every application.

    However, by later half of 70s & 3033 processor, since the total common
    segment API data space was somewhat proportional to number of subsystems
    and number of concurrently executing regions ... the one mbyte "common
    SEGMENT area" was becoming 5-6mbyte "common SYSTEM area", leaving only 2-3mbytes for applications ... but frequently threatening to become
    8mbyte (leaving zero bytes for applications).

    This was part of desperate need to migrate from MVS to 370/XA and MVS/XA
    with 31-bit addressing as well as "access registers" ... where call to subsystem switched the caller's address space pointer to the secondary
    address space and loads the called subsystem address space pointer into
    the primary address space ... allowing subsystem to directly address
    caller's API data in (secondary address space) private area (not needing
    to be placed in a "CSA"). The subsystem then returns to the caller
    ... and the caller's address space pointer is switched back from
    secondary to primary.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to BGB on Tue Jun 17 23:35:23 2025
    On 6/17/2025 11:16 PM, BGB wrote:
    On 6/17/2025 4:19 PM, MitchAlsup1 wrote:
    On Tue, 17 Jun 2025 19:04:49 +0000, BGB wrote:

    On 6/17/2025 12:59 PM, quadibloc wrote:
    On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:

    But that is NOT the arithmatic you are looking at::

    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the
    power.

    So the arithmetic is: can you OoO core be 10× faster than the LBIO
    core
    ??
    And the answer is NO.

    But my code runs faster on the OoO core than on ten LBIO cores, because >>>> nobody knows how to make effective use of ten cores to solve the
    problem.

    So the fact that it uses 10x the electrical power, while only having 2x >>>> the raw power - for an embarrassingly parallel problem, which doesn't
    happen to be the one I need to solve - doesn't matter.

    That's why OoO chips sell so well.


    Errm, this doesn't agree with my experience.

    More like the OoO chips are around 20-40% faster, but depending on
    workload.

    Then you are latency bound, not compute bound.


    Possibly...

    A lot of the code doesn't do that much math or dense logic on the data.
    But, a whole lot of mostly shoveling data around, often through lookup
    tables or similar.


    But, if the usual claim is that it is N times faster, this would imply
    it is N times faster across the board, rather than "N times fast, but
    only if the logic happens to have lots of complex math expressions and similar."

    I disagree. To me, N times faster doesn't mean across the board, i.e.
    on every workload, but on average across a variety of workloads.

    And, of course, one of the big advantages of OoO is better latency
    tolerance i.e. it can often do something useful while waiting for a load instruction to complete. That may explain your interpreter/compiler
    results.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jun 18 06:58:51 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.
    ^^
    still?

    I understand that was your sales pitch, and I assume you had good
    reasons to think it was indeed true, but is it still the case now?

    AFAICT (see for example Anton's benchmarks in this regard) with current
    CPUs, "LBIO cores" are not terribly more power-efficient than big OoO cores.

    I have not done any measurements on power-efficiency of in-order
    vs. OoO cores myself, but Andrei Frumusanu measured the in-order
    Cortex-A55, the OoO Cortex-A75 and the OoO Samsung M4 in the Exynos
    9820 and published those measurements on anandtech <https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/4>

    Concerning power-efficiency (on SPEC2006 Int+FP Geomean), <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
    is most relevant: If you use the A75 and the A55 at their most
    efficient points, the A55 is slightly more efficient, but more than 3
    times slower. As soon as you want more performance from the A55, it
    becomes so inefficient that the A75 will beat it at power-efficiency.

    Concerning area, in <2024Jan24.225412@mips.complang.tuwien.ac.at> I
    estimated that the A75 has 3-4 times the size of the A55 (on the same
    chip, i.e., the same number of metal layers etc.), for 3-4 times more performance. So in-order does not look more area-efficient, either.

    So why do ARM still do in-order Cortex-A cores? Maybe for the bottom
    of the smartphone market who only care about cost. And maybe the
    smartphone manufacturers want to brag about the number of cores their
    SoC has without paying the licensing and area costs for so many OoO
    cores.

    Or at least, it seems that the big OoO cores are not significantly less >power-efficient when they are computing at the same speed as LBIO cores.

    Looking at <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
    for the 5 SPEC2006 Int+FP Geomean speed, the OoO A75 is more than 3
    times more efficient and the OoO Samsung M4 is more than 2 times more
    efficient than the in-order A55. OTOH, for the 1.1 SPEC2006 Int+FP
    Geomean speed, the A75 at it's slowest speed is slightly less
    efficient, and the M4 is about 1.5 times less efficient than the A55
    (assuming in both cases that the OoO cores go into a low-power state
    after they have finished the job and consume too little additional
    energy while waiting to cause a significant change in power
    consumption).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jun 18 07:31:55 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Ah, well, pairing is a different problem than the "incomplete register >specifiers" I'm talking about. Indeed, it can be much more difficult to >adapt a Chaitin-style allocator to handle pairing because it can't be >expressed simply in the interference graph.

    I remember reading a paper about register allocation for register
    pairs, but don't find that paper right now. Anyway, what I read was
    based on graph-colouring IIRC and it looked pretty plausible and not
    too complicated. But the devil is in the details.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lynn Wheeler on Wed Jun 18 13:48:15 2025
    Lynn Wheeler <lynn@garlic.com> writes:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Yes, but as I have argued before, this was a mistake, and in any event
    base registers became obsolete when virtual memory became available
    (though, of course, IBM kept it for backwards compatibility).



    As systems got larger they needed to run more than 15 concurrent regions >(storage protect key=0 for kernel, 1-15 for regions).

    Back in the late 70's, _The Adolescence of P-1_ was published, wherein
    the protaganist uses a timing loop to obtain a storage protect key
    of zero. Which led to the development of a massively distributed
    AI. It's still a fine tale, albeit somewhat dated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 18 14:10:42 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
    On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
    On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

    The cost of register forwarding is:: Results × Operands or sometimes
    SUM( Results[type] × Operands[type] ); hardly more than quadratic.

    Quadratic sill a lot worse than linear.

    You don't have to go very far in a quadratic curve before the cost of
    the forwarding exceeds exceeds that of the function units (such as an
    additional ALU or similar).

    Yes, but that's not necessarily the point at which forwarding stops
    making sense.

    Out of order execution can cost more than the functional units in a CPU.
    But since a faster CPU - or, more specifically, a CPU with faster
    single-thread performance - is much more useful than more CPUs, it's
    still well worth the cost.

    In fact, just as I've included VLIW as a basic feature in the Concertina
    II, as a way to explicitly do what OoO does transparently for the
    programmer,

    Have enough instruction in the queue to deal with memory delays which
    cannot be determined by the compiler in a reasonably way? Howo does
    it do that?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Wed Jun 18 14:45:24 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 17 Jun 2025 1:20:40 +0000, quadibloc wrote:

    On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:

    Then again, SW over the last
    15 years has demonstrated no ability to use "lots" of small cheap cores.

    And as long as that remains true, out-of-order execution will continue
    to be popular, and there will also be strong pressure to find exotic
    materials that can be used to make faster transistors - and faster
    interconnects between them.

    While I am willing to agree that we can do better in using multiple
    cores, I also think that even after we do all that we can in that area,
    a single core that is N times faster will still be better than N cores.

    But that is NOT the arithmatic you are looking at::

    A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

    So the arithmetic is: can you OoO core be 10× faster than the LBIO core
    ??
    And the answer is NO.

    The highest performing OoO is M4 right now and it is 2× faster than
    Opteron Rev F (after normalizing frequency)--perhaps I should say
    2× more instructions per clock. If M4 area was equal to Opteron area
    (highly doubtful after normalizing) it would still be a factor of 5-6×
    more area than 12 LBIO cores.

    I think that situation is much more nuanced. I recently bought
    a few mini PCs and did comparison with bigger machines. Single
    core of 6 core Zen 2 machine (Ryzen 5 3600) is about 3 times
    faster than resonably modern Celernon, about 10 time faster
    than Celeron N3060 and about 7 times faster than a core in
    Allwiner H2 chip. Newer 12 core Ryzen 9 7900 has about 60%
    faster core than Zen2 and has TDP 65 W, that is 5.5 W per core.
    IIUC low end PC-s have TDP about 2-3 W and two cores, so
    about 1 W per core. Of course, single core performance on
    multicore processor is inflated due to increase of clock
    frequency (and power) when only one core is active. Also
    I used Dhrystone to get numbers. But I have similar
    performance ratios on my real loads (including parallel
    ones which use all cores). IIUC in-order mini PC-s
    have really poor performace, the best ones are out of
    order, but significantly more than big core.

    Anyway, my results are in line with results given by Anton,
    and do not agree with your estimate.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jun 18 11:22:12 2025
    Chris M. Thomasson [2025-06-18 00:47:51] wrote:
    On 6/17/2025 1:43 PM, Stefan Monnier wrote:
    What do you mean by "a few instructions later"? The above was stated in
    the context of a register allocator based on something like Chaitin's
    algorithm, which does not proceed "instruction by instruction" but
    [...]
    Fwiw, here is some old code of mine, a region allocator in C that should still work today... Sorry for butting in:

    Hmmm "region" != "register".


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jun 18 11:50:26 2025
    Anton Ertl [2025-06-18 07:31:55] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Ah, well, pairing is a different problem than the "incomplete register >>specifiers" I'm talking about. Indeed, it can be much more difficult to >>adapt a Chaitin-style allocator to handle pairing because it can't be >>expressed simply in the interference graph.
    I remember reading a paper about register allocation for register
    pairs, but don't find that paper right now. Anyway, what I read was
    based on graph-colouring IIRC and it looked pretty plausible and not
    too complicated. But the devil is in the details.

    Preston Briggs (who used to be a regular here) discusses such an
    allocator in his PhD thesis (https://repository.rice.edu/items/2ea2032a-0872-43a1-90c0-564c1dd2275f).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Wed Jun 18 09:55:21 2025
    On 6/15/2025 12:29 PM, quadibloc wrote:
    On Sat, 14 Jun 2025 16:45:23 +0000, Stephen Fuld wrote:
    On 6/14/2025 3:48 AM, Thomas Koenig wrote:

    Which made nonsense the concept of making data relocatable by
    always using base registers.

    Forgive me, but I don't see why.  When the program is linked, the COMMON
    block is at some fixed displacement from the start of the program.  So
    the program can "compute" the real address of the data in common blocks
    from the address in its base register.

    The purpose of a COMMON block is to share variables between the main
    program and subroutines.

    Sure.

    On the System/360, a FORTRAN compiler typically compiled each subroutine
    in a program separately from every other subroutine. They just got
    linked together by the linking loader in order to run.

    I'm not sure what you mean by "linkingloader" The linkage editor (IIRC
    IEWL) linked together all of the object modules created by the compiler.
    Loading the program was a different operation, again IIRC, done by the initiator in each partition)

    So no subroutine would know where a COMMON block created by loader for
    the main program would be unless that information was given to it - and
    the loader would give it that information, in the form of a full 24-bit address constant, so it didn't have to be passed as a parameter.

    So the linkage editor knows where the common block (and hence all of the variables within it) is located relative to the start of the program,
    but it doesn't know where in real storage the program will be loaded.
    So all references to variables in the common block are resolved,
    relative to the start of the program at link time. But you still need
    the base register scheme to resolve the address based on where in real
    storage the program gets loaded. This al has nothing to do with
    addresses being passed as parameters.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 18 18:16:37 2025
    On Wed, 18 Jun 2025 15:14:06 +0000, quadibloc wrote:

    On Wed, 18 Jun 2025 14:10:42 +0000, Thomas Koenig wrote:
    quadibloc <quadibloc@gmail.com> schrieb:

    In fact, just as I've included VLIW as a basic feature in the Concertina >>> II, as a way to explicitly do what OoO does transparently for the
    programmer,

    Have enough instruction in the queue to deal with memory delays which
    cannot be determined by the compiler in a reasonably way? Howo does
    it do that?

    VLIW only deals with one of the things OoO solves; stuff like read-after-write pipeline hazards. It doesn't address cache misses in
    any way.

    Statically scheduled VLIW is dependent on there being no variable
    latency results. So,
    a) cache misses
    b) TLB misses
    c) FDIV/SQRT taking variable cycles
    d) some kinds of Store buffering

    Are all off the table in a VLIW design that are on the table in any
    other design with discovered forwarding.

    For example, in 1991 we were working on an FDIV algorithm in FMUL
    (Goldschmidt) that always delivered correct results in 17 cycles
    (rather standard for the day). We discovered that we could deliver
    a result in 12 cycles that needed to be fixed 1/128 times. So,
    would you rather have a FDIV in 12.25 RMS clocks or 17 static clocks.

    Statically scheduled VLIW takes this game away.

    So that's total bad news, right?

    Grim, maybe, Bad, not necessarily.

    It proves VLIW is useless?

    Not at all--it demonstrates the VLIW is less than ideal when dealing
    with unpredictable latencies.


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 01:03:37 2025
    On Wed, 18 Jun 2025 22:00:54 +0000, quadibloc wrote:

    On Wed, 18 Jun 2025 18:16:37 +0000, MitchAlsup1 wrote:
    On Wed, 18 Jun 2025 15:14:06 +0000, quadibloc wrote:

    So that's total bad news, right?

    Grim, maybe, Bad, not necessarily.

    It proves VLIW is useless?

    Not at all--it demonstrates the VLIW is less than ideal when dealing
    with unpredictable latencies.

    I'm surprised, though, that you did not continue onwards, and comment on
    the part where I blamed you for finding a resolution to this problem.

    The resolution to the problem means the VLIW-ness of that ISA is no
    longer necessary.

    Because, unless my memory is very faulty, you noted that the OoO implementation of the 6600 _is_ adequate for dealing with unpredictable latencies, such as those from cache misses (even if the 6600 didn't have
    a cache; instead, it had extra memory under program control)... and so
    it seemed to me that since VLIW can theoretically handle register
    hazards almost as well as Tomasulo, it could complement a 6600-style
    pipeline to provide a match for the resource hog OoO style in common use today.

    But once you have dynamic scheduling*, you not longer need VLIW-ness
    to jam all the instructions in per clock--you can do it for a typical
    von Neumann instruction set.

    VLIW, as used in the past (MultiFlow...to...Mill) are all statically
    scheduled.

    Once you are no longer statically scheduled, the VLI part is not needed,
    and indeed serves as a arbitrary limit to the widths you can achieve in practice.

    (*) dynamic scheduling is how one tolerates unknowable latencies.
    VLIW scheduling is how one tolerates only known latencies.


    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jun 19 08:56:34 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Anton Ertl [2025-06-18 07:31:55] wrote:
    I remember reading a paper about register allocation for register
    pairs, but don't find that paper right now. Anyway, what I read was
    based on graph-colouring IIRC and it looked pretty plausible and not
    too complicated. But the devil is in the details.

    Preston Briggs (who used to be a regular here) discusses such an
    allocator in his PhD thesis

    Given that I found nothing else, that was probably it.

    (https://repository.rice.edu/items/2ea2032a-0872-43a1-90c0-564c1dd2275f).

    Oh boy, not only did Rice produce the pdf from the scan of a paper
    copy, one of the pages in the pairing chapter was also not properly
    scanned. Fortunately, I have a digital copy of Briggs' Thesis, and I
    have now temporarily put it online. I will send a copy to Rice, mayby
    they will update their copy.

    http://www.complang.tuwien.ac.at/anton/tmp/briggs-thesis.ps.gz

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:17:39 2025
    On Thu, 19 Jun 2025 1:32:07 +0000, quadibloc wrote:

    On Thu, 19 Jun 2025 1:03:37 +0000, MitchAlsup1 wrote:

    The resolution to the problem means the VLIW-ness of that ISA is no
    longer necessary.

    That may be.

    But because I'm not the expert on things like this that you are, I don't
    feel that I can dispute the conventional wisdom. The conventional
    wisdom, as practiced by Intel and AMD and pretty much the whole CPU
    industry is that the dynamic scheduling design as used in the Control
    Data 6600 is inadequate, and one has to go to register rename or the equivalent Tomasulo Algorithm in order to achieve acceptable
    performance.

    Nothing prevents a Scoreboard from using a renamed registers.

    The 6600 doesn't cover all register hazards.

    It Covers: RAW, WAR, WAW, but ignores the RAR hazard.

    VLIW can deal with register hazards, but it doesn't help at all with cache misses - as I have no
    reason to doubt your claim that the 6600 mechanism is adequate to deal
    with cache misses, though, that's why I noted combining the two as an
    option.

    Maybe you are right that this is useless, but I'm not in a position to dispute what Patterson and Hennessey have proclaimed and the industry
    has accepted.

    But I'm saying that even if Patterson and Hennessey _are_ right, adding
    VLIW provides a method by which your goal - getting rid of the bulk of
    the transistor and power overhead of OoO by going to the 6600 design -
    would _still_ be achievable, since adding VLIW is essentially trivial.

    That is not my goal. My goal is VAX-instruction count with RISC-like pipelineability.

    Sure, I could be wrong - and 6600 by itself is plenty good enough. But
    given all the naysayers out there, a way out of the GBOoO rut that
    people might be willing to believe could work has got to have some
    value.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:26:28 2025
    On Thu, 19 Jun 2025 2:29:27 +0000, quadibloc wrote:

    I've decided that this would be a good time to review the difference
    between the 6600 scoreboard and modern OoO.

    Having refreshed my memory, I see the issue is that when there is a WAR hazard, a 6600-style scoreboard simply stalls the functional unit.

    The Scoreboard stalls DELIVERY of the result (W) until all consumers
    of the previous value have consumed that value (AR) something Tomasulo
    does not even try to do.

    Tomasulo or register renaming provides extra storage, either in the reservation stations or in rename registers, so that if the desired
    result register is not yet available, the result can just go in an extra place.

    One can use a Skid-Buffer to hold calculated results awaiting write
    permission per function unit to avoid the WAR hazard stalls--in
    getting instruction calculations started.

    This suggests that, just as some caches are designed in a very simple fashion, one could have a "stupid" form of register rename - say each register has its own rename register - that could be added to a
    scoreboard. I would have thought that people are already doing this, but they're calling it register rename full OoO and not scoreboarding plus, because that's better marketing wise if nothing else.

    Nothing in the scoreboard prevents you from using renamed registers.

    RISC mitigates WAR hazards by having 32 registers instead of, say, 8 or
    16.

    And since the Scoreboard is quadratic in area, gong from 8 registers
    (6600) to 32 (R3000) makes the gate count in scoreboard go up by 16×.

    VLIW marks out groups of instructions that don't have RAW or WAR
    hazards. A scoreboard keeps track of dependencies, so it can delay only
    those instructions affected by a cache miss. Since a 6600 scoreboard
    does have to _detect_ WAR hazards, even if it doesn't handle them as
    well as Tomasulo, indeed putting a bit in to indicate one is present
    isn't needed, so you are right there... at least for an older-style
    computer.

    But a lot of computers these days have multiple copies of each
    functional unit; that is, they're superscalar. So indicating that
    several instructions can be executed together with no need for any
    thought would seem to make things go faster.

    Except, of course, when there's a chance one is trying to execute instructions at a time when all dependencies are not resolved - some registers aren't loaded yet with the data some of those instructions
    will need. They all have to go through the scoreboard to check that. But
    the instructions in a group are guaranteed not to depend on *each
    other*, so they can be checked against the scoreboard _in parallel_.
    That's what the VLIW bits can help with.

    So you say ...

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Jun 20 05:56:56 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 12 Jun 2025 8:38:06 +0000, Anton Ertl wrote:
    [program counter as GPR]
    Nowadays, you can afford it, but the question still is whether it is
    cost-effective. Looking at recent architectures, none has PC
    addressable like a GPR, so no, it does not look to be cost-effective
    to have the PC addressable like a GPR. AMD64 and ARM A64 have
    PC-relative addressing modes, while RISC-V does not.

    Consider that in a 8-wide machine, IP gets added to 8 times per cycle, >whereas no GPR has a property anything like that.

    Actually the 6-wide renamer of Golden Cove (Alder Lake P-core) can
    handle 6 dependent adds of small constants to GPRs per cycle, and I
    would be surprised if the 8-wide Lion Cove (Arrow Lake P-Core) would
    not be able to do 8 dependent adds of small constants to GPRs.

    However, in the case of the PC, I think you can produce the 8 PC
    values with less effort. The decoder knows where each instruction has
    started, so it just needs to propagate this information as the instruction-specific PC value to the execution-engine.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to quadibloc on Fri Jun 20 17:11:31 2025
    quadibloc wrote:
    On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:

    Packing and unpacking decimal floats can be done inexpensively and fast
    relative to the size, speed of the decimal float operations. For my own
    implementation I just unpack and repack for all ops and then registers
    do not need any more than 128-bits.

    I also unpack the hidden first bit on IEEE-754 floats.

    The idea is that the ISA may be used for a wide variety of
    implementations, and on at least some of them, anything that takes an
    amount of time above zero may make a difference.

    You might not be aware that by unpacking the hidden bit, you at the same
    time destroy the very nice feature of a maximally dense packing, where a
    simple unsigned increment always brings you to the next possible
    floating point value, and on rounding up you get from exp + 0xfff..f
    mantissa directly to exp+1 + 0x000..0

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 20 15:59:14 2025
    On Fri, 20 Jun 2025 15:30:59 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

    Ideally, one has an ISA where nearly all registers are the same:
    No distinction between base/index/data registers;
    No distinction between integer and floating point registers;
    No distinction between general registers and SIMD registers;
    ...

    ------------------------
    But I felt that this was OK, since as everybody knows, strings really
    only have to be able to be at least 80 characters long. Hmm... wait a
    moment, aren't 132-character strings sometimes needed?

    Line printers are/were 132 characters wide.

    Oh, well.

    Oh well, indeed!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Fri Jun 20 17:12:51 2025
    On 20/06/2025 16:59, MitchAlsup1 wrote:
    On Fri, 20 Jun 2025 15:30:59 +0000, quadibloc wrote:

    On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

    Ideally, one has an ISA where nearly all registers are the same:
       No distinction between base/index/data registers;
       No distinction between integer and floating point registers;
       No distinction between general registers and SIMD registers;
       ...

    ------------------------
    But I felt that this was OK, since as everybody knows, strings really
    only have to be able to be at least 80 characters long. Hmm... wait a
    moment, aren't 132-character strings sometimes needed?

    Line printers are/were 132 characters wide.

    Some were. Others were only 120 wide and others were 160 wide.

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jun 20 12:43:29 2025
    quadibloc [2025-06-14 15:22:20] wrote:
    I also unpack the hidden first bit on IEEE-754 floats.
    The idea is that the ISA may be used for a wide variety of
    implementations, and on at least some of them, anything that takes an
    amount of time above zero may make a difference.

    Do you have any evidence that hiding the leading 1 bit takes more time
    than not hiding? I can think of reasons why either of the two options
    could be marginally cheaper than the other, but in all cases I can think
    of, it would make *very little* difference, if any.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Fri Jun 20 18:34:33 2025
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
    it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically.

    Have you ever worked in a large corporation? (Just asking).

    Hydro had 77K employees in 130 countries, there was no such thing as a
    simple hierarchical setup.

    Rather more like a loose federation across varying local environments.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jun 20 17:38:59 2025
    On Fri, 20 Jun 2025 16:43:29 +0000, Stefan Monnier wrote:

    quadibloc [2025-06-14 15:22:20] wrote:
    I also unpack the hidden first bit on IEEE-754 floats.
    The idea is that the ISA may be used for a wide variety of
    implementations, and on at least some of them, anything that takes an
    amount of time above zero may make a difference.

    Do you have any evidence that hiding the leading 1 bit takes more time
    than not hiding? I can think of reasons why either of the two options
    could be marginally cheaper than the other, but in all cases I can think
    of, it would make *very little* difference, if any.

    Creating the hidden bit is 2-gates of delay (H,F,D,Q).
    Hiding the top bit after rounded number is 0-gates.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jun 20 13:48:20 2025
    Creating the hidden bit is 2-gates of delay (H,F,D,Q).

    How come it's not free in hardware?
    Is it only because of denormalized?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Jun 20 18:47:53 2025
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    quadibloc <quadibloc@gmail.com> schrieb:
    On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

    The project started in earnest in 1994, when IBM had sold OoO
    mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
    knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.
    ...
    It always surprises me that people think of corporations as
    monolithic entities, when they are in fact comprised of very
    different groups with very different tasks and very different
    agendas and interests.

    Corporations are organized hierarchically.

    Have you ever worked in a large corporation? (Just asking).

    Hydro had 77K employees in 130 countries, there was no such thing as a
    simple hierarchical setup.

    Rather more like a loose federation across varying local environments.

    My point was that, even if the org chart shows a hierarchy, actual
    dynamics are _much_ more complex.

    That there are a lot of personal and group interests, a lot of
    communications upwards are targeted towards what people think the
    respecive manager wants to hear. There are people trying to prove
    that their work is valuable (and most people think theirs is).
    Plus, there is a lot of talk like "X likes this" or "Y likes that",
    or "don't let management worry". Plus, there are managers who
    routinely ignore feedback because they don't have the spine to
    tell their own managers bad news, or who discourage it (later
    complaining that people don't inform them).

    One classic example was something that I only heard about,
    that was way above my level.

    At a meeting, some higher-up told his direct reports that anybody
    who said "Yes, but" would be removed from his leadership position,
    the only acceptable expression was "Yes, and". And guess what -
    said manager later sunk a huge project because, ahem, people didn't
    tell him the bad news, and ignoring a problem usually doesn't make
    it better.

    And now I'll stop before I get to the bad stuff :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jun 20 20:46:44 2025
    On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

    Creating the hidden bit is 2-gates of delay (H,F,D,Q).

    How come it's not free in hardware?
    Is it only because of denormalized?

    hidden = operand.exponent != 0

    Which is an 11-input NAND gate. I suspect you could assume it is 1
    and special case the result, but even special casing the result
    cannot be less than 1-gate (a multiplexer).

    You DO end up special casing Infinities and NaNs; anyway.

    Special = operand.exponent == 0b11111111111

    Which is an 11-input AND gate.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Jun 20 18:13:56 2025
    MitchAlsup1 [2025-06-20 20:46:44] wrote:
    On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:
    Creating the hidden bit is 2-gates of delay (H,F,D,Q).
    How come it's not free in hardware?
    Is it only because of denormalized?

    hidden = operand.exponent != 0

    IOW denormalized.
    I'd attribute the cost to "denormalized" rather than to "hidden bit", then 🙂

    You DO end up special casing Infinities and NaNs; anyway.

    Special = operand.exponent == 0b11111111111

    Which is an 11-input AND gate.

    Right, but this one is quite different since the result doesn't depend
    on the actual numerical computation on the mantissa (I'd assume that the
    number of cycles (or gate delays) to determine the desired output in the
    case of Inf/NaN inputs is smaller than to compute an add/mul, so this
    test of those 11bits is not on the critical path).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Sat Jun 21 01:48:51 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

    Creating the hidden bit is 2-gates of delay (H,F,D,Q).

    How come it's not free in hardware?
    Is it only because of denormalized?

    hidden = operand.exponent != 0

    Which is an 11-input NAND gate. I suspect you could assume it is 1
    and special case the result, but even special casing the result
    cannot be less than 1-gate (a multiplexer).

    You DO end up special casing Infinities and NaNs; anyway.

    Special = operand.exponent == 0b11111111111

    Which is an 11-input AND gate.

    But do explicit bit lead to a difference? IIUC FPU need to
    special cases anyway. I would guess that a flag normal/special
    could save some time, but once FPU knows that it deals with
    normal numbers hidden bit should be effectively free.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Waldek Hebisch on Sat Jun 21 02:51:25 2025
    On Sat, 21 Jun 2025 1:48:51 +0000, Waldek Hebisch wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

    Creating the hidden bit is 2-gates of delay (H,F,D,Q).

    How come it's not free in hardware?
    Is it only because of denormalized?

    hidden = operand.exponent != 0

    Which is an 11-input NAND gate. I suspect you could assume it is 1
    and special case the result, but even special casing the result
    cannot be less than 1-gate (a multiplexer).

    You DO end up special casing Infinities and NaNs; anyway.

    Special = operand.exponent == 0b11111111111

    Which is an 11-input AND gate.

    But do explicit bit lead to a difference? IIUC FPU need to
    special cases anyway. I would guess that a flag normal/special
    could save some time, but once FPU knows that it deals with
    normal numbers hidden bit should be effectively free.

    FADD/SUB starts out with an exponent subtract, so you have time to
    "invent" the hidden bit at almost zero cost.

    FMUL/MAC/DIV/SQRT starts out with a big multiplexer (float,double)
    that allows one to invent the hidden bit at near zero cost.

    FCMP does not need to "invent" the hidden bit because it is a sign-
    magnitude integer compare (with special cases).

    So, the cost is between actual zero, and nearly zero if your
    circuit designer is worth being on the payroll.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Jun 24 08:15:35 2025
    MitchAlsup1 wrote:
    On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

    Creating the hidden bit is 2-gates of delay (H,F,D,Q).

    How come it's not free in hardware?
    Is it only because of denormalized?

        hidden = operand.exponent != 0

    Which is an 11-input NAND gate. I suspect you could assume it is 1
    and special case the result, but even special casing the result
    cannot be less than 1-gate (a multiplexer).

    You DO end up special casing Infinities and NaNs; anyway.

       Special = operand.exponent == 0b11111111111

    Which is an 11-input AND gate.

    The way I understand hardware, you would do this in parallel with the
    regular fp op, so that all special inputs have a short-circuit result,
    and then a final mux which selects either the normal or the special result?

    I.e. only that single mux (one or two gate delays?) is part of the
    actual latency for fpu ops?


    This was the way I implemented FPU emulation on the Mill which has a
    "free" mux as one of the phases of all operations.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to BGB on Wed Jun 25 22:24:00 2025
    BGB <cr88192@gmail.com> wrote:

    But, if the usual claim is that it is N times faster, this would imply
    it is N times faster across the board, rather than "N times fast, but
    only if the logic happens to have lots of complex math expressions and similar."

    I have a contrived program which on machines from about 2010 peaked
    at about 10 MIPS, on earlier machines starting form about 1990 it
    was closer to 2 MIPS. Basicaly program is doing pointer chasing
    in somewhat irregular pattern covering whole memory. AFAICS it
    needed 2 RAM accesses per instruction (one for second level page
    table entry, one for actual data), on modern machines with multilevel
    page tables it may be more (but modern machines tend to have quite
    large caches and few top levels of page tables may fit in the on
    chip cache).

    While this is very unnatural program it clearly shows that modern
    machines are fast only when caching/prefetching works as expected
    and badly behaving programs may be much slower than execution
    speed of the core. And of course, withing the core there are
    more factors that can cause slowdown.

    So any speed claim are probabilistic and implicitly or explicitely
    assume some program behaviour.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Waldek Hebisch on Wed Jun 25 23:00:14 2025
    On Wed, 25 Jun 2025 22:24:00 +0000, Waldek Hebisch wrote:

    BGB <cr88192@gmail.com> wrote:

    But, if the usual claim is that it is N times faster, this would imply
    it is N times faster across the board, rather than "N times fast, but
    only if the logic happens to have lots of complex math expressions and
    similar."

    I have a contrived program which on machines from about 2010 peaked
    at about 10 MIPS, on earlier machines starting form about 1990 it
    was closer to 2 MIPS. Basicaly program is doing pointer chasing
    in somewhat irregular pattern covering whole memory. AFAICS it
    needed 2 RAM accesses per instruction (one for second level page
    table entry, one for actual data), on modern machines with multilevel
    page tables it may be more (but modern machines tend to have quite
    large caches and few top levels of page tables may fit in the on
    chip cache).

    In 1992 over the July 4th weekend, I forgot to stop a simulation
    of my 6-wide Mc 88120 processor running Matrix 300. The below
    numbers are for the 16KB DM cache::

    The first ~2B instructions ran at 5.98 IPC
    The next. ~2B instructions ran at 2.4 IPC
    The next. ~2B instructions ran at 0.6 IPC
    And the disk on the VAX ran out of space on the last transpose of
    Matrix300.

    When we traced it all down, it was due to TLB thrashing. Converting from 64-entry FA TLB to 256 Entry DM TLB made the anomalies go away
    completely.

    While this is very unnatural program it clearly shows that modern
    machines are fast only when caching/prefetching works as expected
    and badly behaving programs may be much slower than execution
    speed of the core. And of course, withing the core there are
    more factors that can cause slowdown.

    Caches, TLBs, and predictors all have to work as anticipated, or
    performance drops precipitously.

    So any speed claim are probabilistic and implicitly or explicitely
    assume some program behaviour.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sat Jun 28 14:32:44 2025
    On Thu, 12 Jun 2025 19:24:36 +0000, MitchAlsup1 wrote:
    On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote:

    VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
    promise of achieving OoO level performance without the costs of OoO.

    Pick a VLIW that was successful like x86 or ARM in the marketplace.

    I now can see your point more clearly; I've researched the Itanium and the Intel i860 after watching a YouTube video on the i860 in order to update
    my web pages on the history of computers.

    Intel sold a few of its Touchstone Delta prototypes of its Paragon supercomputer to NASA and the like. This supercomputer was based on the
    Intel i860, and apparently it worked well enough there, having an
    appropriate instruction load.

    According to the video, the i860 failed because the interrupts that
    would be encountered frequently on a general-purpose computer severely
    degraded performance on the i860 with its long pipeline. Even CISC
    machines get longer and longer pipelines in order to improve performance;
    but then, Intel's Pentium IV _also_ failed in the marketplace, and was
    replaced by Intel Core microprocessors, which were an improved version of
    the Pentium III microarchitecture.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc on Tue Jul 22 04:30:28 2025
    On Tue, 10 Jun 2025 22:53:27 +0000, quadibloc wrote:

    Include pairs of short instructions as part of the ISA, but make the
    short instructions 14 bits long instead of 15 so they get only 1/16 of
    the opcode space. This way, the compromise is placed in something that's
    less important. In the CISC mode, 17-bit short instructions will still
    be present, after all.

    After this change, I have been busily making minor tweaks to the ISA.

    The latest one involved a header format which allowed room for fourteen alternate 17-bit short instructions in a block, in order to permit
    a higher level of superscalar operation.

    I made opcode space for this header by using two opcodes from the standard memory-reference instruction set for it; they were the ones formerly used
    for load address and jump to subroutine with offset.

    I was not happy with doing this, however. Right now, I am engaging in a
    mighty struggle to squeeze the available opcode space to avoid doing this. However, try as I may, it may well be that the cost of this will turn out
    to be too great. But if I can manage it, a significant restructuring of
    the opcodes of this iteration of Concertina II may be coming soon.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Tue Jul 22 16:29:28 2025
    On Tue, 22 Jul 2025 04:30:28 +0000, John Savard wrote:

    However, try as I may, it may well be that the cost of this will turn
    out to be too great. But if I can manage it, a significant restructuring
    of the opcodes of this iteration of Concertina II may be coming soon.

    I have now revised my pages on Concertina II to reflect this latest
    change. Its most shocking result is that the three-operand arithmetic instructions in the basic 32-bit instruction set now only have six-bit
    opcodes. However, this didn't actually result in the omission of any
    useful instructions that had been defined for them when they had seven-bit opcodes.

    And the header mechanism, of course, allows the instruction set to be
    massively extended. Thus, I shouldn't really view this as an unacceptable
    cost requiring me to do a major rollback of the design... I think.

    But I'm not sure; cramming more and more stuff in has brought me to a point
    of being uneasy.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc on Wed Jul 23 16:03:33 2025
    On Fri, 20 Jun 2025 19:46:42 +0000, quadibloc wrote:

    More importantly, I need 256-character strings if I'm using them as
    translate tables. Fine, I can use a pair of registers for a long string.

    I've realized now that I can have eight 256-character string registers
    if I instead use the extended register bank of 128 floating-point
    registers for the string registers; this provides another use for a
    set of registers that would otherwise be little used outside of VLIW
    code.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Thu Jul 24 10:47:09 2025
    On Sat, 14 Jun 2025 09:24:02 -0700, Stephen Fuld wrote:

    On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register.

    That's almost right. However, you can't really "BALR to the next
    instruction", because BALR is a register-to-register instruction.
    Therefore, it doesn't reference memory.

    It's the register-to-register version of BAL, the jump to subroutine instruction (Branch and Link), and because of that, it doesn't do any branching, and has no branch target.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sat Jul 26 05:57:49 2025
    On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2 data
    types are being supported {36-bits, 48-bits, 60-bits} along with the
    standard power of 2 lengths {8, 16, 32, 64}.

    As this is such a fondly remembered feature, I have finally gotten
    around to adding one header type to the ISA that enables it. I do,
    however, carefully note that this is a highly specialized feature,
    and thus it is not expected to be included in most implementaions.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sat Jul 26 06:14:47 2025
    On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:

    This creates "interesting" situations with respect to instruction
    formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.

    Oh, there are indeed challenges, but they're hardly insurmountable.

    Compilers are the obvious case. Since the instruction set is built
    around 32-bit instructions, obviously the architecture will need to
    be running in conventional mode for compilation.

    The data width is, of course, specified by the block header. It
    isn't a switchable mode. So a program can have memory allocated to
    it of different widths, put pointers to those regions of memory in
    different base registers, and include code operating on data of
    those various lengths.

    So the compiler can call subroutines designed to craft things like
    36-bit floats for inclusion in object modules. From data placed in
    registers by normal code.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon Jul 28 23:18:52 2025
    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.

    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to BGB on Fri Aug 1 04:42:28 2025
    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Chris M. Thomasson on Fri Aug 1 04:31:00 2025
    On Tue, 17 Jun 2025 12:45:44 -0700, Chris M. Thomasson wrote:
    On 6/17/2025 10:59 AM, quadibloc wrote:

    So the fact that it uses 10x the electrical power, while only having 2x
    the raw power - for an embarrassingly parallel problem, which doesn't
    happen to be the one I need to solve - doesn't matter.

    Can you break your processing down into units that can be executed in parallel, or do you get into an interesting issue where step B cannot
    proceed until step A is finished?

    I'm assuming that the latter case is true often enough for real-world
    programs that out-of-order processors with massive overhead and power consumption are worth using instead of many small processors in
    parallel with greater throughput.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Fri Aug 1 05:03:07 2025
    John Savard <quadibloc@invalid.invalid> schrieb:
    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    Which is what everybody does. Loading a register with the address
    of a small array on the stack is a simple addition, usually one
    cycle latency. If the array came as an argument, it is (usually)
    in a register to start with. If you allocate the array dynamically,
    you get its address for free after the function call. If you have
    enough GP registers, chances are it will still be in the array;
    otherwise you can spill it to stack and restre, with restoring
    needing one L1 cache access.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Tue Aug 26 21:46:24 2025
    BGB <cr88192@gmail.com> posted:

    On 7/28/2025 6:18 PM, John Savard wrote:
    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.


    The use of addressing modes drops off pretty sharply though.

    Like, if one could stat it out, one might see a static-use pattern
    something like:
    80%: [Rb+disp]
    15%: [Rb+Ri*Sc]
    3%: (Rb)+ / -(Rb)
    1%: [Rb+Ri*Sc+Disp]
    <1%: Everything else

    Since RISC-V only has [Rb+dips12] the other 20% take at least 2 instructions. Simple math indicates this requires 1.2+ instructions/mem-ref instead of 1.0 instructions/mem-ref. disp12 does not help either.

    My 66000 does not have (Rb)+ or -(Rb), and most RISC-machines don't either.
    On the other hand, I see more [Rb+Ri<<s+disp] than 1%--more like 3%-4%--
    this is partially due to using indexing than incrementation when doing
    loops::

    MOV Ri,#0
    VEC R15,{}
    LDD R9,[R3,Ri<<3+disp]
    calk
    LOOP LT,Ri,#1,Rn
    instead of:
    MOV Ri,#0
    LDA R14,[R3+disp]
    VEC R15,{}
    LDD R9,(R14)+
    calk
    LOOP LT,Ri,#1,Rn
    {and the second loop has an additional ADD in it}

    Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.

    Granted, the dominance of [Rb+Disp] does drop off slightly when
    considering dynamic instruction use. Part of it is due to the
    prolog/epilog sequences.

    I have a lot of [IP,DISP] due to the way the compile places data.

    If one had instead used (SP)+ and -(SP) addressing for prologs and
    epilogs, then one might see around 20% or so going to these instead.
    Or, if one had PUSH/POP, to PUSH/POP.

    ENTER and EXIT compress prologues and epilogues to a single instruction
    each. They also have the option of placing the preserved registers in
    a place where the called subroutine cannot damage them.

    The discrepancy then between static and dynamic instruction counts them
    being mostly due to things like loops and similar.

    Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
    seemed to be in the area. Many loops end up unreached in many
    iterations, or only running a few times, so possibly counter-intuitively
    it is often faster to assume that a loop body will likely only cycle 2
    or 3 times rather than 100s or 1000s, and trying to aggressively
    optimize loops by assuming large N tends to be detrimental to performance.

    VAX compilers set the loop-count = 10 and did OK for their era. A
    low count (like 10) ameliorates the small loops (letters in a name)
    against the larger loops like Matrix300.

    Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.


    -----------------------

    One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
    ISA has a lot of registers, the relative benefit of LoadOp is reduced.

    LoadOp being mostly a benefit if the value is loaded exactly once, and
    there is some other ALU operation or similar that can be fused with it.

    Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
    z=arr[i]+x;


    But, the relative incidence of things like this is low enough as to not
    save that much.

    The other thing is that one has to implement it in a way that does not increase pipeline length,

    This is the key point about LD-OPs:: if you build a pipeline to support
    them, then you will suffer when instruction stream is independent RISC-
    like instructions--conversely; if you build the pipeline for RISC-like instructions, LD-OPs take a penalty unless you by off on Medium OoO, at
    least.

    since if one makes the pipeline linger for
    sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.

    And thus, this is why RISC-machines largely avoid LD-OPs.

    One can be like, "But what if the local variables are not in registers?"
    but on a machine with 32 or 64 registers, most likely your local
    variable is already going to be in a register.

    So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".

    So does poking your eye with a hot knife.

    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.


    Yeah.

    There are some living descendants of that family, but pretty much
    everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.


    John Savard


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Wed Aug 27 01:01:21 2025
    John Savard <quadibloc@invalid.invalid> posted:

    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That is actually what we did on Mc88100, and while a lot better than
    just [Base+Disp] it is still not as good as [RB+Ri<<s+Disp]; the later
    saving instructions that merely create constants.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)