• Re: Tonights Tradeoff - Carry and Overflow

    From MitchAlsup1@21:1/5 to Robert Finch on Sat Oct 12 18:50:35 2024
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

    add r9,r1,r5,r0
    addgc r13,r1,r5,r0
    add r10,r2,r6,r13
    addgc r13,r2,r6,r13
    add r11,r7,r3,r13
    addgc r13,r7,r3,r13
    add r12,r8,r4,r13

    My 66000 version::

    CARRY R8,{{IO}{IO}{IO}{O}}
    ADD R4,R12,R16
    ADD R5,R13,R17
    ADD R6,R14,R18
    ADD R7,R15,R19
    // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Sat Oct 12 23:28:26 2024
    On Sat, 12 Oct 2024 22:20:50 +0000, Robert Finch wrote:

    On 2024-10-12 4:14 p.m., BGB wrote:
    On 10/12/2024 1:50 PM, MitchAlsup1 wrote:
    On Sat, 12 Oct 2024 9:38:01 +0000, Robert Finch wrote:

    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to >>>> generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

        add r9,r1,r5,r0
        addgc r13,r1,r5,r0
        add r10,r2,r6,r13
        addgc r13,r2,r6,r13
        add r11,r7,r3,r13
        addgc r13,r7,r3,r13
        add r12,r8,r4,r13

    My 66000 version::

           CARRY   R8,{{IO}{IO}{IO}{O}}
           ADD     R4,R12,R16
           ADD     R5,R13,R17
           ADD     R6,R14,R18
           ADD     R7,R15,R19
                // R{8,7,6,5,4} contain the 257-bit result.

    256-bit add giving 257-bit result.

    BJX2 / XG2, assuming in-register (A/D=R4..R7, B=R20..R23):
      CLRT
      ADDC  R20, R4
      ADDC  R21, R5
      ADDC  R22, R6
      ADDC  R23, R7

    Or, D=R16..R19
      MOV.X R4, R16
      MOV.X R6, R18
      CLRT
      ADDC  R20, R16
      ADDC  R21, R17
      ADDC  R22, R18
      ADDC  R23, R19

    ADDC is itself mostly a holdover from SH.

    Could almost make sense to make it have a 3R form though and move it to
    updating SR.S instead, since SR.T is likely better left exclusively to
    predication (vs mostly predication, and obscure edge-case ops like ADDC/
    SUBC/ROTCL/...).

    Could almost add an ADDC.X op which operates 128 bits at a time, say:
      CLRT
      ADDC.X R4, R20, R16
      ADDC.X R6, R22, R18

    Except that it would be rarely used enough to make its existence
    debatable at best.



    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    BJX2 / XG2 has destroys the value of the one source operand, I noted the extra code to preserve the one operand. Is that only for the ADDC instruction?

    What is the limit on the My66000 CARRY modifier for the number of
    carries? Assuming the sequence is interruptible there must be a few bits
    of state that need to be preserved.

    CARRY casts its modification over 8 subsequent instructions using its
    16-bit immediate.

    I found incorporating modifiers have a tendency to turn my code into spaghetti. Maybe my grasp of implementation is not so great though.

    DECODE has a shift register to attach 2-bits to subsequent instructions
    each. However, the Rd provided by CARRY carries 64-bits from instruction
    to instruction--which makes 256×64 -bit multiplication straightforward.

    The add, addgc can execute at the same time. So, it is 4 clocks at the
    worst to add two 256-bit numbers. (The first / last instructions may
    execute at the same time as other instructions).
    I wanted to avoid using instruction modifiers and special flags
    registers as much as possible. It is somewhat tricky to have a carry
    flag in flight. Q+ is not very code dense, but the add can be done. It
    is also possible to put the carry bit in a predicate register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Robert Finch on Tue Oct 15 09:49:42 2024
    Robert Finch wrote:
    On 2024-10-09 6:44 a.m., Robert Finch wrote:
    On 2024-10-05 5:43 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    On 2024-10-04 2:19 a.m., Anton Ertl wrote:
    4) Keep the flags results along with GPRs: have carry and overflow as >>>>> bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
    The advantage is that you do not have to track the flags separately
    (and, in case of AMD64, track each of C, O, and NZP separately), but >>>>> instead can use the RAT that is already there for the GPRs. You can >>>>> find a preliminary paper on that on
    <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
    ...
    One solution, not mentioned in your article, is to support arithmetic
    with two bits less than the number of bit a register can support, so
    that the carry and overflow can be stored. On a 64-bit machine have all >>>> operations use only 62-bits. It would solve the issue of how to load or >>>> store the carry and overflow bits associated with a register.

    Yes, that's a solution, but the question is how well existing software
    would react to having no int64_t (and equivalent types, such as long
    long), but instead an int62_t (or maybe int63_t, if the 64th bit is
    used for both signed and unsigned overflow, by having separate signed
    and unsigned addition etc.). I expect that such an architecture would
    have low acceptance. By contrast, in my paper I suggest an addition
    to existing 64-bit architectures that has fewer of the same
    disadvantages as the widely-used condition-code-register approach has,
    but still has a few of them.

    Sometimes
    arithmetic is performed with fewer bits, as for pointer representation. >>>> I wonder if pointer masking could somehow be involved. It may be useful >>>> to have a bit indicating the presence of a pointer. Also thinking of
    how
    to track a binary point position for fixed point arithmetic. Perhaps
    using the whole upper byte of a register for status/control bits
    would work.

    There are some extensions for AMD64 in that direction.

    It may be possible with Q+ to support a second destination register
    which is in a subset of the GPRs. For example, one of eight registers
    could be specified to holds the carry/overflow status. That effectively >>>> ties up a second ALU though as an extra write port is needed for the
    instruction.

    Needing only one write port is an advantage of my approach.

    - anton

    Been thinking some about the carry and overflow and what to do about
    register spills and reloads during expression processing. My thought
    was that on the machine with 256 registers, simply allocate a
    ridiculous number of registers for expression processing, for example
    25 or even 50. Then if the expression is too complex, have the
    compiler spit out an error message to the programmer to simplify the
    expression. Remnants of the ‘expression too complex’ error in BASIC.
    So, there are no spills or reloads during expression processing. I
    think the storextra / loadextra registers used during context
    switching would work okay. But in Q+ there are 256 regs which require
    eight storextra / loadextra registers. I think the store extra / load
    extra registers could be hidden in the context save and restore
    hardware. Not even requiring access via CSRs or whatever. I suppose
    context loads and stores could be done in blocks of 32 registers. An
    issue is that the load extra needs to be done before registers are
    loaded. So, the extra word full of carry/overflow bits would need to
    be fetched in a non-sequential fashion. Assuming for instance, that
    saving register values is followed by a save of the CO word. Then it
    is positioned wrong for a sequential load. It may be better to have
    the wrong position for a store, so loads can proceed sequentially.
    It strikes me that there is no real good solution, only perhaps an
    engineered one. Toyed with the idea of having 16 separate flags
    registers, but not liking that as a solution as much as the store/load
    extra.

    Another thought is to store additional info such as a CRC check of the
    register file on context save and restore.

    *****

    Finally wrote the SM to walk the ROB backwards and restore register
    mappings for a checkpoint restore. Cannot get Q+ to do more than light
    up one LED in SIM. Register values are not propagating properly.


    Mulled over carry and overflow in arithmetic operations. Looked at
    widening the datapath to 66-bits to hold carry and overflow bits.
    Thinking it may increase the size of the design by over 3% just to
    support carry and overflow. For now, an instruction, ADDGC, was added to generate the carry bit as a result. A 256-bit add looks like:

    ; 256 bit add
    ; A = r1,r2,r3,r4
    ; B = r5,r6,r7,r8
    ; S = r9,r10,r11,r12

    add r9,r1,r5,r0
    addgc r13,r1,r5,r0
    add r10,r2,r6,r13
    addgc r13,r2,r6,r13
    add r11,r7,r3,r13
    addgc r13,r7,r3,r13
    add r12,r8,r4,r13

    Not very elegant a solution, but it is simple. I think it requires
    minimal hardware. Three input ADD is already present and ADDGC just
    routes the carry bit to the output.

    I started with that ADDGC approach too.
    But if you want to both generate and propagate carrys in an integer
    register it implies a 3 source register ADD.
    Since a general 3 integer register ADD can generate 2 bits of carry-out,
    it implies two dest registers.
    Which is how I end up with various forms of double-wide adds.

    The carry bits propagate through back-to-back register forwarding.

    ; A = (r4,r3,r2,r1)
    ; B = (r8,r7,r6,r5)
    ; S = (r13,r12,r11,r10,r9)

    addw2 (r10,r9) = r1 + r5
    addw3 (r11,r10) = r2 + r6 + r10
    addw3 (r12,r11) = r3 + r7 + r11
    addw3 (r13,r12) = r4 + r8 + r12

    Unsigned carry out if r13 != 0.
    Signed overflow is still detected with the idiom
    overflow = (((r12 xor r4) and (r12 xor r8)) < 0);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)