• The Impending Return of Concertina III

    From Quadibloc@21:1/5 to All on Tue Jan 23 04:07:50 2024
    As I have noted, the original Concertina architecture was not a
    serious proposal for a computer architecture, but merely a
    description of an architecture intended to illustrate how
    computers work.

    Concertina II was a step above that; somewhat serious, but
    not fully so; still too idiosyncratic to be taken seriously
    as an alternative.

    But in a discussion of Concertina II - or, rather, in a thread
    that started with Concertina II, but went on to discussing
    other things - it was noted that RISC-V is badly flawed.

    In that case, an alternative is needed. I need to go beyond
    Concertina II - with which I am satisfied now as meeting its
    goals, finally - to something that could be considered genuinely
    serious.

    At the moment, only a link to Concertina III is present on my
    main page, no content is yet present.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Tue Jan 23 06:41:20 2024
    On Tue, 23 Jan 2024 04:07:50 +0000, I wrote:

    At the moment, only a link to Concertina III is present on my
    main page, no content is yet present.

    The first few pages, with diagrams of this ultimate simplification
    of Concertina II, are now present, starting at

    http://www.quadibloc.com/arch/ct19int.htm

    I've gone to 15-bit displacements, in order to avoid compromising
    addressing modes, while allowing 16-bit instructions without
    switching to an alternate instruction set.

    Possibly using only three base registers is also sufficiently
    non-violent to the addressing modes that I should have done that
    instead, so I will likely give consideration to that option in
    the days ahead.

    Unfortunately, since pseudo-immediate values are something
    of which I have been convinced of the necessity, I could not
    get rid of block structure, which is, of course, as noted
    the major impediment to this ISA being considered for widespread
    adoption.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to All on Tue Jan 23 09:50:29 2024
    On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:

    I've gone to 15-bit displacements, in order to avoid compromising
    addressing modes, while allowing 16-bit instructions without
    switching to an alternate instruction set.

    Possibly using only three base registers is also sufficiently
    non-violent to the addressing modes that I should have done that
    instead, so I will likely give consideration to that option in
    the days ahead.

    I have indeed decided that using three base registers for the
    basic load-store instructions is much preferable to shortening the
    length of the displacement even by one bit.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Robert Finch on Tue Jan 23 13:07:49 2024
    On Tue, 23 Jan 2024 07:06:47 -0500, Robert Finch wrote:

    Packing and unpacking DFP numbers does not take a lot of logic, assuming
    one of the common DPD packing methods.

    Well, I'm thinking of the method used by IBM. It is true that method
    was designed to use a minimal amount of logic.

    The number of registers handling
    DFP values could be doubled if they were unpacked and packed for each operation.

    Not doubled, only increased from 24 to 32.

    Since DFP arithmetic has a high latency anyway, for example
    Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
    number. So, registers only need be 128-bit.

    I don't believe in wasting any time. And the latency of DFP operations
    can be reduced; it is possible to design a Wallace Tree multiplier for
    BCD arithmetic.

    256 bits seems a little narrow for a vector register.

    The original Concertina architecture, which had short vector registers
    of that size, was designed before AVX-512 was invented. Rather than attempting to keep revising the size of the short vector registers to keep up, the
    ISA also includes long vector registers.

    These are patterned after the vector registers of the Cray I, and have room
    for 64 double-precision floating-point numbers each.

    I have seen
    several other architectures with vector registers supporting 16+ 32-bit values, or a length of 512-bits. This is also the width of a typical
    cache line.

    Having the base register implicitly encoded in the instruction is a way
    to reduce the number of bits used to represent the base register.

    Instead of base registers, then, there would be a code segment register
    and a data segment register, like on x86. But then how do I access data belonging to another subroutine? Without variable length instructions,
    segment prefixes like on x86 aren't an option. (There actually are
    instruction prefixes in the ISA, but they're not intended to be
    _common_!)

    There
    seems to be a lot of different base register usages. Will not that make
    the compiler more difficult to write?

    I suppose it could. The idea is basically that a program would pick
    one memory model and stick with it - a normal program would use the
    base registers connected with 16-bit displacements for everything...
    except that, where different routines share access to a small area of
    memory, then that pointer can be put in a base register for 12-bit displacements.

    Does array addressing mode have memory indirect addressing? It seems
    like a complex mode to support.

    It does indeed use indirect addressing. The idea is that if your
    program has a large number of arrays which are over 64K in size,
    it shouldn't be necessary to either consume a base register for
    each array, or freshly load a base register with the array address
    every time it's referenced.

    Using the mode is simple enough; basically, the address in the
    instruction is effectively the name of the array instead of its
    address, and the array is indexed normally.

    Of course, there's the overhead of indirection on every access.

    So in Concertina II, I had added a new addressing mode which
    simply uses the same feature that allows immediate values to
    tack a 64-bit absolute address on to an instruction. (Since it
    looks like a 64-bit number, the linking loader can relocate it.)
    That fancy feature, though, was too much complication for this
    stripped-down ISA.

    Block headers are tricky to use. They need to follow the output of the instructions in the assembler so that the assembler has time to generate
    the appropriate bits for the header. The entire instruction block needs
    to be flushed at the end of a function.

    I don't see an alternative, though, to block structure to allow instructions
    to have, in the instruction stream, immediate values of any length, and yet allow instructions to be rapidly decoded in parallel as if they were all
    32 bits long.

    And block structure also allows instruction parallelism to be explicitly indicated.

    If you decide not to use the block header feature, though, what you have
    left is still a perfectly good ISA. So people can support the architecture
    with a basic compiler which doesn't make full use of the chip's features,
    and then a fancier compiler which produces more optimal code can make the effort to handle the block headers.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Jan 23 14:01:07 2024
    On Tue, 23 Jan 2024 13:07:49 +0000, Quadibloc wrote:

    So in Concertina II, I had added a new addressing mode which
    simply uses the same feature that allows immediate values to
    tack a 64-bit absolute address on to an instruction. (Since it
    looks like a 64-bit number, the linking loader can relocate it.)
    That fancy feature, though, was too much complication for this
    stripped-down ISA.

    This discussion has convinced me that this addressing mode,
    although relegated to an alternate instruction set in Concertina II,
    is important enough for maximizing performance that it does need
    to be included in Concertina III, and the appropriate changes
    have been made.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to BGB on Tue Jan 23 21:00:01 2024
    On Tue, 23 Jan 2024 13:56:32 -0600, BGB wrote:

    Agreed. Would not be in favor of block-headers or block structuring.
    Linear instruction formats are preferable, preferably in 32-bit chunks.

    The good news is that, although Concertina III still has block structure,
    it gives you a choice. The ISA is similar to a RISC architecture, but
    with a number of added features, if you just use 32-bit instructions.

    On Concertina II, you need to use block structure for:

    - 17-bit instructions
    - Immediate constants other than 8-bit or 16-bit
    - Absolute array addresses
    - Instruction prefixes
    - Explicit indication of parallelism
    - Instruction predication

    On Concertina III, you need to use block structure for immediate constants other
    than 8 bit, but the 16-bit instructions and the absolute array addresses are available without block structure.

    As it stands, Concertina III doesn't have instruction predication at all, which is a deficiency I will need to see if I can remedy.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Jan 23 22:10:21 2024
    BGB wrote:

    On 1/23/2024 6:06 AM, Robert Finch wrote:


    IME, the main address modes are:
    (Rm, Disp) // ~ 66% +/- 10%
    (Rm, Ro*FixSc) // ~ 33% +/- 10%
    Where: FixSc matches the element size.
    Pretty much everything else falls into the noise.

    With dynamically linked libraries one needs:: k is constant at link time

    LD Rd,[IP,GOT[k]] // get a pointer to the external variable
    and
    CALX [IP,GOT[k]] // call external entry point

    But now that you have the above you can easily get::

    CALX [IP,Ri<<3,Table] // call indexed method
    // can also be used for threaded JITs

    RISC-V only has the former, but kinda shoots itself in the foot:
    GCC is good at eliminating most SP relative loads/stores;
    That means, the nominal percentage of indexed is even higher...

    A funny thing happens when you get rid of the "extra instructions"
    most IRSC ISAs cause you to have in your instruction stream::
    a) the number of instructions goes down
    b) you get rid of the easy instructions
    c) leaving all the complicated ones remaining

    As a result, the code is basically left doing excessive amounts of
    shifts and adds, which (vs BJX2) effectively dethrone the memory
    load/store ops for top-place.

    These are the easy instructions that are not necessary when ISA is
    properly conceived.

    Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
    also shoots itself in the foot. Because, not only has one hit the limits
    of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
    range constants.

    My 66000 has constants of all sizes for all instructions.

    If my compiler, with its arguably poor optimizer and barely functional register allocation, is beating GCC for performance (when targeting
    RISC-V), I don't really consider this a win for some of RISC-V's design choices.

    When you benchmark against a strawman, cows get to eat.

    And, if GCC in its great wisdom, is mostly loading constants from memory (having apparently offloaded most of them into the ".data" section),
    this is also not a good sign.

    Loading constants:
    a) pollutes the data cache
    b) wastes energy
    c) wastes instructions

    Also, needing to use shift-pairs to sign and zero extend things is a bit
    weak as well, ...

    See cows eat above.



    Also, as a random annoyance, RISC-V's instruction layout is very
    difficult to decipher from a hexadecimal view. One basically needs to
    dump it in binary to make it viable to mentally parse and lookup instructions, which sucks.

    When you consume 3/4ths of the instruction space for 16-bit instructions;
    you create stress in other areas of ISA>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Tue Jan 23 21:27:38 2024
    On Tue, 23 Jan 2024 21:00:01 +0000, Quadibloc wrote:

    the absolute array addresses are
    available without block structure.

    No; they may not be in an alternate instruction set, but
    they still are like pseudo-immediates, so they do need
    the block structure.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Chris M. Thomasson on Wed Jan 24 14:45:34 2024
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 1/23/2024 4:11 PM, Brian G. Lucas wrote:
    On 1/23/24 16:10, MitchAlsup1 wrote:

    When you benchmark against a strawman, cows get to eat.

    Not a farm boy I'll bet.  Cows eat hay, but not straw.

    https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It

    Although a strawman can be made from hay or leaves and twigs, or any
    other stuffing, straw, as a waste product from grain production,
    is traditional.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Jan 24 20:23:56 2024
    BGB wrote:

    On 1/23/2024 4:10 PM, MitchAlsup1 wrote:

    Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
    also shoots itself in the foot. Because, not only has one hit the
    limits of the ALU and LD/ST ops, there are no cheap fallbacks for
    intermediate range constants.

    My 66000 has constants of all sizes for all instructions.

    ------------------------
    And, if GCC in its great wisdom, is mostly loading constants from
    memory (having apparently offloaded most of them into the ".data"
    section), this is also not a good sign.

    Loading constants:
    a) pollutes the data cache
    b) wastes energy
    c) wastes instructions


    Yes.

    But, I guess it does improve code density in this case... Because the constants are "somewhere else" and thus don't contribute to the size of '.text'; the program just puts a few kB worth of constants into '.data' instead...

    Consider the store of a constant to a constant address::

    array[7] = bigFPconstant;

    RISC-V
    text
    aupic Ra,high(&bigFPconstant)
    ldd Rd,[Ra+low(&bigFPconstant)]
    aupic Ra,high(&array+48)
    std Rd,[Ra+low(&array+48)]
    data
    double bigFPconstant

    4 instructions 6 words of memory 2 registers

    My 66000:
    STD #bigFPconstant,[IP,,&array+48]

    1 instruction 4 words of memory all in .text 0 registers

    Also note: RISC-V has no real way to support 64-bit displacements other
    than resorting to LDs of pointers (ala GOT and similar).

    Does make the code density slightly less impressive.

    Granted, one can argue the same of prolog/epilog compression in my case:
    Save some space on prolog/epilog by calling or branching to prior
    versions (since the code to save and restore GPRs is fairly repetitive).

    ENTER and EXIT eliminate the additional control transfers and can allow
    FETCH of the return address to start before the restores are finished.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 17:18:08 2024
    BGB wrote:

    On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
    BGB wrote:

    Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
    versions (since the code to save and restore GPRs is fairly repetitive).

    ENTER and EXIT eliminate the additional control transfers and can allow
    FETCH of the return address to start before the restores are finished.

    Possible, but branches are cheaper to implement in hardware, and would
    have been implemented already...

    Are you intentionally misreading what I wrote ??

    There is a se

    Granted, it is a similar thing to the recent addition of a memcpy()
    slide for intermediate-sized memcpy.

    Where, if one expresses the slide in reverse order, copying any multiple
    of N bytes can be expressed as a branch into the slide (with less
    overhead than a loop).


    But, I guess in theory, the memcpy slide could be implemented in plain C
    with a switch.
    uint64_t *dst, *src;
    uint64_t li0, li1, li2, li3;
    ... copy final bytes ...
    switch(sz>>5)
    {
    ...
    case 2:
    li0=src[4]; li1=src[5];
    li2=src[6]; li3=src[7];
    dst[4]=li0; dst[5]=li1;
    dst[6]=li2; dst[7]=li3;
    case 1:
    li0=src[0]; li1=src[1];
    li2=src[2]; li3=src[3];
    dst[0]=li0; dst[1]=li1;
    dst[2]=li2; dst[3]=li3;
    case 0:
    break;
    }

    Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 17:26:50 2024
    BGB wrote:

    On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
    BGB wrote:

    Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
    versions (since the code to save and restore GPRs is fairly repetitive).

    ENTER and EXIT eliminate the additional control transfers and can allow
    FETCH of the return address to start before the restores are finished.

    Possible, but branches are cheaper to implement in hardware, and would
    have been implemented already...

    Are you intentionally misreading what I wrote ??

    Epilogue is a sequence of loads leading to a jump to the return address.

    Your ISA cannot jump to the return address while performing the loads
    so FETCH does not get the return address and can't start fetching
    instructions until the jump is performed.

    Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
    the return address from the stack and fetch the instructions at the
    return address while still loading the preserved registers (that were
    saved) so that the instructions are ready for execution by the time
    the last LD is performed.

    In addition, If one is performing an EXIT and fetch runs into a CALL;
    it can fetch the Called address and if there is an ENTER instruction
    there, it can cancel the remainder of EXIT and cancel some of ENTER
    because the preserved registers are already on the stack where they are supposed to be.

    Doing these with STs and LDs cannot save those cycles.

    Granted, it is a similar thing to the recent addition of a memcpy()
    slide for intermediate-sized memcpy.

    Where, if one expresses the slide in reverse order, copying any multiple
    of N bytes can be expressed as a branch into the slide (with less
    overhead than a loop).


    But, I guess in theory, the memcpy slide could be implemented in plain C
    with a switch.
    uint64_t *dst, *src;
    uint64_t li0, li1, li2, li3;
    ... copy final bytes ...
    switch(sz>>5)
    {
    ...
    case 2:
    li0=src[4]; li1=src[5];
    li2=src[6]; li3=src[7];
    dst[4]=li0; dst[5]=li1;
    dst[6]=li2; dst[7]=li3;
    case 1:
    li0=src[0]; li1=src[1];
    li2=src[2]; li3=src[3];
    dst[0]=li0; dst[1]=li1;
    dst[2]=li2; dst[3]=li3;
    case 0:
    break;
    }

    Looks like Duff's device.

    But why not just::

    MM Rto,Rfrom,Rcount

    Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 21:25:26 2024
    BGB wrote:

    On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
    BGB wrote:

    On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
    BGB wrote:

    Granted, one can argue the same of prolog/epilog compression in my
    case:
    Save some space on prolog/epilog by calling or branching to prior
    versions (since the code to save and restore GPRs is fairly
    repetitive).

    ENTER and EXIT eliminate the additional control transfers and can allow >>>> FETCH of the return address to start before the restores are finished.

    Possible, but branches are cheaper to implement in hardware, and would
    have been implemented already...

    Are you intentionally misreading what I wrote ??


    ?? I don't understand.



    Epilogue is a sequence of loads leading to a jump to the return address.

    Your ISA cannot jump to the return address while performing the loads
    so FETCH does not get the return address and can't start fetching
    instructions until the jump is performed.


    You can put the load for the return address before the other loads.
    Then, if the epilog is long enough (so that this load is no-longer in
    flight once it hits the final jump), the branch-predictor will lead to
    it start loading the post-return instructions before the jump is reached.

    Yes, you can read RA early.
    What you cannot do is JMP early so the FETCH stage fetches instructions
    at return address early.
    {{If you JMP early, then the rest of the LDs won't happen}}

    This is likely a non-issue as I see it.

    It is only really an issue if one demands that reloading the return
    address be done as one of the final instructions in the epilog, and not
    one of the first instructions.

    I make no such demand--I merely demand the JMP RA is the last instruction.

    Granted, one would have to do it as one of the final ops, if it were implemented as a slide, but it is not. There are "practical reasons" why
    a slide would not be a workable strategy in this case.

    So, generally, these parts of the prolog/epilog sequences are emitted
    for every combination of saved/restored registers that had been encountered.

    Though, granted, when used, does mean that any such function needs to effectively two two sets of stack-pointer adjustments:
    One set for the save/restore area (in the reused part);
    One part for the function (for its data and local/temporary variables
    and similar).


    Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
    the return address from the stack and fetch the instructions at the
    return address while still loading the preserved registers (that were
    saved) so that the instructions are ready for execution by the time
    the last LD is performed.

    In addition, If one is performing an EXIT and fetch runs into a CALL;
    it can fetch the Called address and if there is an ENTER instruction
    there, it can cancel the remainder of EXIT and cancel some of ENTER
    because the preserved registers are already on the stack where they are
    supposed to be.

    Doing these with STs and LDs cannot save those cycles.


    I don't see why not, the branch-predictor can still do its thing
    regardless of whether or not LD/ST ops were used.

    Consider::

    main:
    ...
    CALL funct1
    CALL funct2

    funct2:
    SUB Sp,SP,stackArea2
    ST R0,[SP,offset20]
    ST R0,[SP,offset20]
    ST R30,[SP,offset230]
    ST R29,[SP,offset229]
    ST R28,[SP,offset228]
    ST R27,[SP,offset227]
    ST R26,[SP,offset226]
    ST R25,[SP,offset225]
    ...

    funct1:
    ...
    LD R0,[SP,offset10]
    LD R30,[SP,offset130]
    LD R29,[SP,offset129]
    LD R28,[SP,offset128]
    LD R27,[SP,offset127]
    LD R26,[SP,offset126]
    LD R25,[SP,offset125]
    LD R24,[SP,offset124]
    LD R23,[SP,offset123]
    LD R22,[SP,offset122]
    LD R21,[SP,offset121]
    ADD SP,SP,stackArea1
    JMP R0

    The above would have to observe that all offset1's are equal to all
    offset2's in order to short circuit the data movements. A single::

    LD R26,[SP,someotheroffset]

    ruins the short circuit.

    Whereas:

    funct2:
    ENTER R25,R0,stackArea2
    ...

    funct1:
    ...
    EXIT R21,R0,stackArea1

    will have registers R0,R25..R30 in the same positions on the stack
    guaranteed by ISA definition!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Fri Jan 26 08:21:04 2024
    On Tue, 23 Jan 2024 09:50:29 +0000, Quadibloc wrote:

    I have indeed decided that using three base registers for the
    basic load-store instructions is much preferable to shortening the
    length of the displacement even by one bit.

    Another change has been made to Concertina III, based on the work
    done for Concertina IV. The instruction prefix has been eliminated
    as a possible meaning of the header word; instead, instruction
    predication can be specified by the header.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Fri Jan 26 21:30:58 2024
    Robert Finch wrote:

    On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:


    Whereas:

    funct2:
         ENTER   R25,R0,stackArea2
         ...

    funct1:
         ...
         EXIT    R21,R0,stackArea1

    will have registers R0,R25..R30 in the same positions on the stack
    guaranteed by ISA definition!!

    I like the ENTER / EXIT instructions and safe stack idea, and have incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of program exit(). They can improve code density. I gather that the stack
    used for ENTER and EXIT is not the same stack as is available for the
    rest of the app. This means managing two stack pointers, the regular
    stack and the safe stack. Q+ could have the safe stack pointer as a
    register that is not even accessible by the app and not part of the GPR
    file.

    LEAVE has older x86 connotations, so I used a different word.

    Registers R16..R31 go on the safe stack (when enabled) SSP
    Registers R01..R15 go on the regular stack SP

    When safe stack is enabled, Return Address goes directly on safe stack
    without passing through R0; and comes off of safe-stack without passing
    through R0.

    SSP requires privilege to access.
    The safe stack pages are required to have RWE = 3'B000 rights; so SW
    cannot read or write these containers directly or indirectly.

    For ENTER/LEAVE Q+ has the number of registers to save specified as a four-bit number and saves only the saved registers, link register and
    frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2, the frame-pointer, link register and allocate 64 bytes plus the return
    block on the stack. The return block contains the frame-pointer, link register and two slots that are zeroed out intended for exception
    handlers. The saved registers are limited to s0 so s9.

    I specify start and stop registers in ENTER and EXIT. In addition the
    16-bit immediate field is used to allocate/deallocate space other than
    the save/restored registers. Since the stack is always doubleword
    aligned, the low order 3 bits are used "for special things"::
    bit<0> decides if SP is saved on the stack (or not 99%)
    bit<1> decides if FP is saved and updated (or restored)
    bit<2> decides if a return is performed (used when SW walks a stack
    back when doing try-throw-catch stuff.)

    I use the HoB of register index to signal select stack pointer.

    Q+ also has a PUSHA / POPA instructions to push or pop all the
    registers, meant for interrupt handlers. PUSH and POP instructions by themselves can push or pop up to five registers.

    By the time control arrives at interrupt dispatched, the old registers
    have been saved and the registers of the ISR have been loaded; so have
    ASID and ROOT,..... Thus an ISR can keep pointers in its register file
    to quicken access when invoked.

    Some thought has been given towards modifying ENTER and LEAVE to support interrupt handlers, rather than have separate PUSHA / POPA instructions. ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
    them all and return using an interrupt return.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Sat Jan 27 17:25:59 2024
    Robert Finch wrote:

    On 2024-01-26 11:10 p.m., BGB wrote:
    On 1/26/2024 10:58 AM, Robert Finch wrote:
    <snip>

    Admittedly, it can make sense for an ISA intended for higher-end
    hardware, but not necessarily something intended to aim for similar
    hardware costs to something like an in-order RISC-V core.

    Once there is micro-code or a state machine to handle an instruction
    with multiple micro-ops, it is not that costly to add other operations.
    The Q+ micro-code cost something like < 1k LUTs. Many early micro's use micro-code.

    The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
    ENTER and EXIT.

    <snip>

    Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
    the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
    and the 64B cache line is tagged with a single tag. The instruction /
    data cache controller takes care of adjusting the bus size between the
    cache and system.

    A four (4) Beat burst is de rigueur for FPGA implementations.

    I think I suggested this before, and the idea got shot down, but I
    cannot find the post. It is mystery operations where the opcode comes
    from a register value. I was thinking of adding an instruction modifier
    to do this. The instruction modifier would supply the opcode bits for
    the next instruction from a register value. This would only be applied
    to specific classes of instructions. In particular register-register
    operate instructions. Many of the register-register functions are not
    decoded until execute time. The function code is simply copied to the execution unit. It does not have to run through the decode and rename
    stage. I think this field could easily come from a register. Seems like
    it would be easy to update the opcode while the instruction is sitting
    in the reorder buffer.

    Classic 360 EXECUTE instruction ??
    Basically, it sounds dangerous. {Side channels in plenty}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)