• Decrement And Branch

    From Lawrence D'Oliveiro@21:1/5 to All on Tue Aug 13 09:00:25 2024
    I thought loop-control instructions had fallen out of favour in the RISC
    era. But reading some IBM POWER (and PowerPC) docs has reminded me that
    that family does have such instructions. I don’t think any other RISC architecture does, though. POWER even has a special register (CTR, the “counter” register) for use with loop instructions, though it could also (along with LR, the “link” register) be used for indirect branches. (Obviously you need at least two registers with this property.)

    The original designers of POWER clearly thought there was a point to
    having such instructions; do you agree?

    The most common form of these will decrement the counter register, and
    only branch back to the top of the loop if the counter has not reached
    zero; if it is now zero, then fall through. However, the good old VAX (in
    its usual kitchen-sink fashion) had a whole set of variations, including
    one that decremented down to -1 instead of zero. And the Motorola 68000
    family only had the decrement down to -1 version.

    This seemed to mystify quite a few assembly-language programmers. I wonder
    why it wasn’t a more popular idea ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 13:15:10 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    I thought loop-control instructions had fallen out of favour in the RISC
    era. But reading some IBM POWER (and PowerPC) docs has reminded me that
    that family does have such instructions. I don’t think any other RISC >architecture does, though. POWER even has a special register (CTR, the >“counter” register) for use with loop instructions, though it could also >(along with LR, the “link” register) be used for indirect branches. >(Obviously you need at least two registers with this property.)

    The original designers of POWER clearly thought there was a point to
    having such instructions; do you agree?

    The most common form of these will decrement the counter register, and
    only branch back to the top of the loop if the counter has not reached
    zero;

    PDP-11 SOB (Subtract One and Branch).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 13:28:07 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    The original designers of POWER clearly thought there was a point to
    having such instructions; do you agree?

    Sure. The question is what it was. Maybe they wanted to look good on
    some kernels. In the same vein they also added loads and stores with
    update (i.e., autoincrement/decrement addressing), and in one version
    of the architecture reference manual I found the warning that these
    may be as slow as a separate load and update.

    AMD64 has LOOP. I looked at it here several times. Theoretically one
    can branch-predict it perfectly, but when I measured that <2016Jun16.103617@mips.complang.tuwien.ac.at> <2017Mar14.183125@mips.complang.tuwien.ac.at>, I found that they just
    use history-based branch prediction for these instructions like
    everybody else.

    I think that the major reason is that in an OoO CPU the OoO part would
    need to move the count to the front end, and either let the front end
    wait until that is done, or introduce some mechanism to let the front
    end run ahead and, when the count finally becomes available to the
    front end, update it to the right value where the front end is now.

    Moreover, at least some AMD64 CPUs take more cycles for a LOOP than
    for the equivalent "sub; jne" sequence <2017Mar15.141411@mips.complang.tuwien.ac.at>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Aug 13 17:18:13 2024
    On Tue, 13 Aug 2024 13:28:07 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    The original designers of POWER clearly thought there was a point to
    having such instructions; do you agree?

    Sure. The question is what it was. Maybe they wanted to look good on
    some kernels. In the same vein they also added loads and stores with
    update (i.e., autoincrement/decrement addressing), and in one version
    of the architecture reference manual I found the warning that these
    may be as slow as a separate load and update.

    AMD64 has LOOP. I looked at it here several times. Theoretically one
    can branch-predict it perfectly, but when I measured that <2016Jun16.103617@mips.complang.tuwien.ac.at> <2017Mar14.183125@mips.complang.tuwien.ac.at>, I found that they just
    use history-based branch prediction for these instructions like
    everybody else.

    I think that the major reason is that in an OoO CPU the OoO part would
    need to move the count to the front end, and either let the front end
    wait until that is done, or introduce some mechanism to let the front
    end run ahead and, when the count finally becomes available to the
    front end, update it to the right value where the front end is now.

    Actually that is not necessary, but there are additional advantages.

    Imagine a GBOoO machine with reservation stations and one runs into
    a recognizable loop. Once the RSs are setup, one turns off the FETCH
    stage, adds an increment to each station, and then each time the
    loop instruction is encountered, you just fire off the RSs again.
    This saves around 1/3 of the power being consumed at no loss in
    perf.

    Moreover, at least some AMD64 CPUs take more cycles for a LOOP than
    for the equivalent "sub; jne" sequence <2017Mar15.141411@mips.complang.tuwien.ac.at>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 17:15:00 2024
    On Tue, 13 Aug 2024 9:00:25 +0000, Lawrence D'Oliveiro wrote:

    I thought loop-control instructions had fallen out of favour in the RISC
    era. But reading some IBM POWER (and PowerPC) docs has reminded me that
    that family does have such instructions. I don’t think any other RISC architecture does, though. POWER even has a special register (CTR, the “counter” register) for use with loop instructions, though it could also (along with LR, the “link” register) be used for indirect branches. (Obviously you need at least two registers with this property.)

    The original designers of POWER clearly thought there was a point to
    having such instructions; do you agree?

    Yes, there is a point !

    One can calculate ADD-CMP-BC in 1 gate delay longer than ADD. Thus,
    the loop instruction can perform 3 instructions for you.

    My 66000 has 3 looping instructions::
    a) for( ; i<max; i++),
    b) for( ; x != y; i++),
    c) for( ; i<max && x ; i++)
    With these almost every subroutine in /lib/str* and /lib/mem* vectorize.

    The most common form of these will decrement the counter register, and

    I made mine go in either direction by allowing a constant as the loop increment.

    only branch back to the top of the loop if the counter has not reached
    zero; if it is now zero, then fall through. However, the good old VAX
    (in
    its usual kitchen-sink fashion) had a whole set of variations, including
    one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.

    This seemed to mystify quite a few assembly-language programmers. I
    wonder
    why it wasn’t a more popular idea ...

    VVM is based entirely on LOOP[123], and the architectural semantics
    allows
    this to provide for vectorization and SIMDization. Thus, My 66000 gets
    2,000 instructions at the price of 2 actual instruction (4 if you are
    picky)

    A byte-copy loop can move 16-bytes per clock--effectivley 40
    instructions
    per clock (5/c if you could write it in 64-bit form--but you don't have
    to write it in 64-bit form to get 64-bit performance. The above is on
    an IO 1-wide machine. Multiply by 4 for the 6-wide OoO machine.

    The logic is simple--these are frequent enough to warrant "doing a bit
    more than 'nothing'" but not so much you crater the whole architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Aug 13 22:00:12 2024
    On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

    However, the good old VAX (in
    its usual kitchen-sink fashion) had a whole set of variations, including
    one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.

    VAX example of how to use SOBGEQ instead of SOBGTR:

    movl «loop count», Rn
    br bottom_of_loop
    top_of_loop:
    .... body of loop ...
    bottom_of_loop:
    sobgeq Rn, top_of_loop

    Like I said, I wondered why this sort of thing wasn’t more common ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Aug 14 01:33:32 2024
    On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

    However, the good old VAX (in
    its usual kitchen-sink fashion) had a whole set of variations, including
    one that decremented down to -1 instead of zero. And the Motorola 68000
    family only had the decrement down to -1 version.

    VAX example of how to use SOBGEQ instead of SOBGTR:

    movl «loop count», Rn
    br bottom_of_loop
    top_of_loop:
    .... body of loop ...
    bottom_of_loop:
    sobgeq Rn, top_of_loop

    Like I said, I wondered why this sort of thing wasn’t more common ...

    Perhaps the RISC mantra has permeated the minds of ISA designers.

    Mark Horowitz: Decode should be as simple as possible.

    Albert Einstein: Everything should be as simple as possible,
    but no simpler.

    One of the above got it right...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Wed Aug 14 08:53:22 2024
    On Wed, 14 Aug 2024 01:33:32 +0000, MitchAlsup1 wrote:

    On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

    However, the good old VAX (in its usual kitchen-sink fashion) had a
    whole set of variations, including one that decremented down to -1
    instead of zero. And the Motorola 68000 family only had the decrement
    down to -1 version.

    VAX example of how to use SOBGEQ instead of SOBGTR:

    movl «loop count», Rn br bottom_of_loop
    top_of_loop:
    .... body of loop ...
    bottom_of_loop:
    sobgeq Rn, top_of_loop

    Like I said, I wondered why this sort of thing wasn’t more common ...

    Perhaps the RISC mantra has permeated the minds of ISA designers.

    Would you prefer it with a decrement+separate conditional-jump instruction pair?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Aug 14 09:10:01 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution. Performing an ALU op before the branch did not fit that
    kind of pipeline.

    However, having a branch-and-subtract would have been possible. But
    how would that have interacted with the branch delay slots that many
    of them had? I guess one could perform the subtract before the
    instruction in the delay slot, and take the branch afterwards (if it
    is taken).

    So it would actually fit. Why was it not done? Maybe the idea was
    that induction-variable elimination would usually eliminate the
    subtract anyway, so why complicate the architecture with such an
    instruction?

    For over a decade, Intel decoders have decoded many sequences of ALU
    and branch instructions into one uop, so they can do at a
    microarchitectural level what you are asking about at the architecture
    level. Other microarchitectures have followed this pattern, and
    RISC-V seems to make a philosophy out of this.

    ARM A64 OTOH seems to put everything into an instruction that fits in
    32 bits, and while they have instructions (TBNZ and TBZ) that tests a
    specific bit in a register and branch if the bit is set or clear, they
    have not added a subtract-and-branch or branch-and-subtract
    instruction. Apparently the uses for such an instruction are not that frequent.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Aug 14 23:58:58 2024
    On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution.

    Note that I was referring to the decrement-down-to-minus-1 form, as
    opposed to the decrement-down-to-zero form.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 00:25:01 2024
    On Wed, 14 Aug 2024 23:58:58 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution.

    Note that I was referring to the decrement-down-to-minus-1 form, as
    opposed to the decrement-down-to-zero form.

    Once one has FMAC with 3 source operands, one has encoding to have
    ADD-CMP-BC as 1 instruction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 00:15:40 2024
    On Wed, 14 Aug 2024 8:53:22 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 14 Aug 2024 01:33:32 +0000, MitchAlsup1 wrote:

    On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

    However, the good old VAX (in its usual kitchen-sink fashion) had a
    whole set of variations, including one that decremented down to -1
    instead of zero. And the Motorola 68000 family only had the decrement
    down to -1 version.

    VAX example of how to use SOBGEQ instead of SOBGTR:

    movl «loop count», Rn br bottom_of_loop
    top_of_loop:
    .... body of loop ...
    bottom_of_loop:
    sobgeq Rn, top_of_loop

    Like I said, I wondered why this sort of thing wasn’t more common ...

    Perhaps the RISC mantra has permeated the minds of ISA designers.

    Would you prefer it with a decrement+separate conditional-jump
    instruction pair?

    I have real LOOP instructions:: ADD-CMP-BC and access to constants
    so one can::
    ADD #{1,2,3...31}, ADD #-{1,2,3,...31}, ADD register,
    CMP #{1,2,3...31}, CMP #-{1,2,3...31}, CMP register,
    BC {EQ, NE, LE, LT, GE, GT, LO, LS, HI, HS}
    Any way you want.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 00:23:41 2024
    On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution. Performing an ALU op before the branch did not fit that
    kind of pipeline.

    MIPS would disagree.

    However, having a branch-and-subtract would have been possible. But
    how would that have interacted with the branch delay slots that many
    of them had? I guess one could perform the subtract before the
    instruction in the delay slot, and take the branch afterwards (if it
    is taken).

    MIPS pipeline performed Branch Target Calculation by pasting bits
    from the instruction onto bits vacated from IP.

    Most of the rest of us performed BTC in the Decode stage of the
    pipeline.

    So it would actually fit. Why was it not done? Maybe the idea was
    that induction-variable elimination would usually eliminate the
    subtract anyway, so why complicate the architecture with such an
    instruction?

    For over a decade, Intel decoders have decoded many sequences of ALU
    and branch instructions into one uop, so they can do at a
    microarchitectural level what you are asking about at the architecture
    level. Other microarchitectures have followed this pattern, and
    RISC-V seems to make a philosophy out of this.

    On the Intel side they mostly depend on prediction.

    On the RISC-V side they mostly depend on fusion. As far as I understand,
    They only fuse pairs not ADD-CMP-BCs.

    ARM A64 OTOH seems to put everything into an instruction that fits in
    32 bits, and while they have instructions (TBNZ and TBZ) that tests a specific bit in a register and branch if the bit is set or clear, they
    have not added a subtract-and-branch or branch-and-subtract
    instruction. Apparently the uses for such an instruction are not that frequent.

    My 66000 finds use cases all the time, and I also have Branch on bit instructions and have my CMP instructions build bit-vectors of outcomes.

    I subscribe to the notion that what one can fit into an instruction
    should fit in an instruction--where I differ is access to constants
    as operands {immediates and displacements} of all convenient sizes;
    with the disclaimer that not everything should be an instruction.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 10:29:11 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:
    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution.

    Note that I was referring to the decrement-down-to-minus-1 form, as
    opposed to the decrement-down-to-zero form.

    I guess what you want to point out is that

    x = x-1
    if (x!=-1) goto ...

    is equivalent to

    flag = x!=0; x = x-1; if (flag) goto ...

    but in the latter the branch does not need to wait for the decrement
    to complete. As for x!=0 vs. x!=1, the CPU may already have special
    circuits for x!=0.

    Ok, so this is not the reason for not having this instruction. Which
    leaves: It is not useful that often.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 15 10:39:28 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Like I said, I wondered why this sort of thing wasn't more common ...

    For the early RISCs, the pipeline was designed for early branch
    execution. Performing an ALU op before the branch did not fit that
    kind of pipeline.

    MIPS would disagree.

    In nearly all of the MIPS history, there is no ALU op before the
    branch, only a comparison of two registers for equality. They revised
    the branches significantly in 2014, but that's not early MIPS, and by
    that time branch predictors were so good that resolving the branch one
    cycle later was not a big issue.

    MIPS pipeline performed Branch Target Calculation by pasting bits
    from the instruction onto bits vacated from IP.

    Conditional branches in MIPS are relative. Only J and JAL have this misfeature.

    For over a decade, Intel decoders have decoded many sequences of ALU
    and branch instructions into one uop, so they can do at a
    microarchitectural level what you are asking about at the architecture
    level. Other microarchitectures have followed this pattern, and
    RISC-V seems to make a philosophy out of this.

    On the Intel side they mostly depend on prediction.

    Every high-performance CPU depends on prediction. Your point is what?

    On the RISC-V side they mostly depend on fusion. As far as I understand,
    They only fuse pairs not ADD-CMP-BCs.

    RISC-V has compare-and-branch instructions; I don't know if any
    implementations fuse that with a preceding addition/subtraction, but
    if so, it's a fusion of a pair of instructions.

    As for only fusing pairs, one of the patterns, in a section called
    "Fusion Pair Candidates" Celio et al.
    <https://arxiv.org/pdf/1607.02318> give the sequence

    slli rd, rs1, {1,2,3}
    add rd, rd, rs2
    ld rd, 0(rd)

    However, as they point out, this may be the result of first pairing
    the first two instructions and then pairing the result with the third instruction.

    The paper does not describe any implementation that actually performs
    such instruction fusions, so any real implementation may perform the
    fusions shown there, or more or fewer fusion patterns.

    ARM A64 OTOH seems to put everything into an instruction that fits in
    32 bits, and while they have instructions (TBNZ and TBZ) that tests a
    specific bit in a register and branch if the bit is set or clear, they
    have not added a subtract-and-branch or branch-and-subtract
    instruction. Apparently the uses for such an instruction are not that
    frequent.

    My 66000 finds use cases all the time, and I also have Branch on bit >instructions and have my CMP instructions build bit-vectors of outcomes.

    If an architecture has the 88000-style treatment of comparison results
    (fill a GPR with conditions, one bit per condition), instructions like
    TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
    register with NZCV flags for dealing with conditions, so what is TBNZ
    and TBZ used for on this architecture? Looking at a binary I have at
    hand, I see a lot of checking bit #63 and some checking of #31, #15,
    #7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
    There are also a number of uses coming from libgcc, e.g.,

    6f0a8: 37e001c3 tbnz w3, #28, 6f0e0 <__aarch64_sync_cache_range+0x50>
    6f0e8: 37e801e2 tbnz w2, #29, 6f124 <__aarch64_sync_cache_range+0x94>
    6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c>
    6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c>
    6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14>
    70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728>
    7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0>
    70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4>
    7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728>
    70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728>

    The tf3 stuff probably is the implementation of long doubles. In any
    case, in this binary with 26473 instructions, there are 30 occurences
    of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
    count).

    Apparently the usefulness of decrement-and-branch is even lower.

    Certainly in my code most loops count upwards.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 20:00:20 2024
    On Thu, 15 Aug 2024 10:39:28 +0000, Anton Ertl wrote:

    As for only fusing pairs, one of the patterns, in a section called
    "Fusion Pair Candidates" Celio et al.
    <https://arxiv.org/pdf/1607.02318> give the sequence

    slli rd, rs1, {1,2,3}
    add rd, rd, rs2
    ld rd, 0(rd)

    The second half of the title is:: "Removing ISA-bloat with Op-Fusion"

    And RISC-V ends up with over 448 instructions whereas My 66000 has but
    65.

    I wonder how much Ozempic they are taking....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 20:02:09 2024
    On Thu, 15 Aug 2024 10:39:28 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.

    If an architecture has the 88000-style treatment of comparison results
    (fill a GPR with conditions, one bit per condition), instructions like
    TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
    register with NZCV flags for dealing with conditions, so what is TBNZ
    and TBZ used for on this architecture?

    if( x & (1<<7) )

    if( !(x & (1<<7) )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Aug 16 01:52:42 2024
    On Thu, 15 Aug 2024 10:29:11 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Note that I was referring to the decrement-down-to-minus-1 form, as
    opposed to the decrement-down-to-zero form.

    I guess what you want to point out is ...

    That the example I gave will correctly handle the case where the loop
    count is initially zero (fall out the bottom without executing the loop
    once), without the need for a separate test.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Aug 16 05:23:30 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    And RISC-V ends up with over 448 instructions

    How do you count this? Looking at chapter 19 of https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, I
    count for RV64G:

    47 RV32I
    15 RV64I additional instructions
    8 RV32M
    5 RV64M additional instructions
    11 RV32A
    11 RV64A additional instructions
    26 RV32F
    4 RV64F additional instructions
    26 RV32D
    6 RV64D additional instructions
    ---------------------------------
    159 RV64G

    whereas My 66000 has but 65.

    There are also One-instruction set computer designs <https://en.wikipedia.org/wiki/One-instruction_set_computer>, and by
    that metric they are the best, no?

    The main thing I dislike about Celio's talk and work is that he uses
    the same metric for advocating his approach without giving any reason
    why it should be relevant.

    He also makes the mistake of using instruction count for discerning
    between RISC and non-RISC (which would make the PDP-11, 6502 and
    probably 8086 more RISC than RV64G) instead of using John Masheys
    approach of identifying common traits; and instruction count was not
    among the criteria that John Mashey identified as discerning between
    RISC and non-RISC (not surprising given non-RISCs like PDP-11).

    Patterson (who is also on that paper and who failed to define RISC
    when he wrote the papers that introduced the term) makes the same
    mistake when arguing for his vector approach (which, I think, resulted
    in RV64V) over the approach taken in, e.g., AVX512. So maybe Celio
    just was Patterson's voice in his talk, but he appeared to speak his conviction.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Aug 16 07:06:11 2024
    On Fri, 16 Aug 2024 05:23:30 GMT, Anton Ertl wrote:

    ... instruction count was not
    among the criteria that John Mashey identified as discerning between
    RISC and non-RISC (not surprising given non-RISCs like PDP-11).

    Why is that particular criterion, of all of them, in the name, then?

    At one point I thought it should be “IRSC”, for “Increased Register Set Computer” ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri Aug 16 07:43:31 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Fri, 16 Aug 2024 05:23:30 GMT, Anton Ertl wrote:

    ... instruction count was not
    among the criteria that John Mashey identified as discerning between
    RISC and non-RISC (not surprising given non-RISCs like PDP-11).

    Why is that particular criterion, of all of them, in the name, then?

    It is not. It's not Reduced InstructionS Computer, but "Reduced
    Instruction Set Computer", and Mashey argued convincingly that this
    should be read as "reduced-instruction set computer", not as "reduced instruction-set computer".

    If it was "reduced instruction-set computer", then the RISCs should
    have kept the VAX shift instruction, which shifted in either
    direction, depending on the sign of the shift count. Instead, RISCs
    generally split this instruction into a shift-left and shift-right
    instruction, increasing the instruction count.

    At one point I thought it should be “IRSC”, for “Increased Register Set >Computer” ...

    This is one commonality of RISCs, but does not discern between RISCs
    like the original IBM 801 (16 registers) and ARM A32 on one hand, and
    S/360, VAX and AMD64 on the other hand (and especially not AMD64 with
    APX). In any case, number of registers certainly is one of the
    criteria that John Mashey uses, but he uses a number of criteria, and
    these work well for classifying architectures that he did not classify
    in his original postings
    <2024Jan12.145502@mips.complang.tuwien.ac.at>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Aug 16 10:00:32 2024
    On Fri, 16 Aug 2024 5:23:30 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    And RISC-V ends up with over 448 instructions

    How do you count this? Looking at chapter 19 of https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, I
    count for RV64G:

    https://en.wikipedia.org/wiki/RISC-V

    47 RV32I
    15 RV64I additional instructions
    8 RV32M
    5 RV64M additional instructions
    11 RV32A
    11 RV64A additional instructions
    26 RV32F
    4 RV64F additional instructions
    26 RV32D
    6 RV64D additional instructions
    43 B
    40 C
    187 V
    43 Zk
    15 H
    ---------------------------------
    159 RV64G
    492

    whereas My 66000 has but 65.

    There are also One-instruction set computer designs <https://en.wikipedia.org/wiki/One-instruction_set_computer>, and by
    that metric they are the best, no?

    Everything should be as simple as possible, but no simpler. A.E.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Fri Aug 16 18:40:23 2024
    On Fri, 16 Aug 2024 10:54:43 +0000, quadibloc wrote:


    But if it's programmed in a higher-level language, usually what a loop construct does is not the same as what a loop instruction does, so the instruction is not used.

    I designed My 66000 LOOP instructions to cover 3 main cases::
    a) std iterated loop where iteration can be + or -, constant or
    register, and comparison can be any of the 10 integer CMPs
    against a constant or register.
    b) std early out iterated loop:: strncpy()
    c) both


    These LOOPs come with different execution semantics of the insts
    inside::
    a) Cache allocation is relaxed when the loop is "long enough"
    so that vector strip-mines do not erase the current cache
    footprint.
    b) Multiple iterations can be performed simultaneously (SIMD)
    c) The width of execution is primarily the width of the cache
    port(s) not the register ports.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kent Dickey@21:1/5 to Anton Ertl on Mon Sep 9 03:31:00 2024
    In article <2024Aug15.123928@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Like I said, I wondered why this sort of thing wasn't more common ... [snip]
    My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.

    If an architecture has the 88000-style treatment of comparison results
    (fill a GPR with conditions, one bit per condition), instructions like
    TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
    register with NZCV flags for dealing with conditions, so what is TBNZ
    and TBZ used for on this architecture? Looking at a binary I have at
    hand, I see a lot of checking bit #63 and some checking of #31, #15,
    #7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
    There are also a number of uses coming from libgcc, e.g.,

    6f0a8: 37e001c3 tbnz w3, #28, 6f0e0
    <__aarch64_sync_cache_range+0x50>
    6f0e8: 37e801e2 tbnz w2, #29, 6f124
    <__aarch64_sync_cache_range+0x94>
    6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c>
    6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c>
    6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14>
    70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728>
    7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0>
    70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4>
    7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728>
    70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728>

    The tf3 stuff probably is the implementation of long doubles. In any
    case, in this binary with 26473 instructions, there are 30 occurences
    of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
    count).

    Apparently the usefulness of decrement-and-branch is even lower.

    Certainly in my code most loops count upwards.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    PA-RISC had "ADDIB,cond,n imm,reg,target". Add a 5-bit signed
    immediate to reg, and then branch on comparing the result to 0
    (effectively), allowing branching on <, <=, =, >, >=, overflow, carry,
    etc. And a non-immediate version ADDB. The target was +/-8KB.

    Really simple loops could be done with the loop operation in the delay
    slot of ADDIB.

    The HP C/C++ Compiler pretty much converted all for() loops to count down
    to 0, when it wasn't too awkward. So:

    for(i = 0; i < 100; i++) {
    array[i] = 0;
    }

    would be effectively transformed to:

    ptr = &array[0];
    for(i = 99, i >= 0; i--) {
    *ptr++ = 0;
    }

    Which becomes (PA-RISC has target register listed last, and delay slots,
    and nullification where on branches it nullifies next instruction if it
    is not taken):

    MOV array,r8
    LDI 99,r9
    LOOP: ADDIB,>=,n -1,r9,LOOP ; r9=r9-1. If r9 >= 0, jump to LOOP
    STD,ma r0,8(r8) ; (r8)=r0; r8=r8+8

    So it could use ADDIB for many "for" loops. The way nullification works,
    it works properly even if the loop should never execute. If r9 starts
    at 0, no STD will be done. There was no reason to change the source
    code, the compiler would do the transform for you. PA-RISC also had
    CMPIB which just does the compare and branch. ADDIB is a very simple instruction which costs very little to add, and saves 2 instructions for
    many loops (ADDI,CMP_0,Bcc -> ADDIB). I think it is a mistake for ARM to
    not have it. I see a lot of "ADD, CMP, Bcc" in ARM assembly code.
    To avoid inverting the counter, "ADD1CMPBcc" would ADD 1 to a counter,
    compare the counter to another register, and branch on condition.

    As for ARM TBNZ and TBZ, I see it used all the time in my code where I
    often use single bit flags in control variables:

    if(flags & FLAG_SPECIAL1) { // FLAG_SPECIAL1 = 0x40
    // Do "SPECIAL1" stuff
    }

    In one program I've written on ARM, 2.3% of all instructions are TBZ or
    TBNZ.

    Kent

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)