Forum: >>> Magnum BBS <<<

Decrement And Branch

From Lawrence D'Oliveiro@21:1/5 to All on Tue Aug 13 09:00:25 2024

I thought loop-control instructions had fallen out of favour in the RISC
era. But reading some IBM POWER (and PowerPC) docs has reminded me that
that family does have such instructions. I don’t think any other RISC architecture does, though. POWER even has a special register (CTR, the “counter” register) for use with loop instructions, though it could also (along with LR, the “link” register) be used for indirect branches. (Obviously you need at least two registers with this property.)

The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?

The most common form of these will decrement the counter register, and
only branch back to the top of the loop if the counter has not reached
zero; if it is now zero, then fall through. However, the good old VAX (in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000
family only had the decrement down to -1 version.

This seemed to mystify quite a few assembly-language programmers. I wonder
why it wasn’t a more popular idea ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 13:15:10 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

I thought loop-control instructions had fallen out of favour in the RISC
era. But reading some IBM POWER (and PowerPC) docs has reminded me that
that family does have such instructions. I don’t think any other RISC >architecture does, though. POWER even has a special register (CTR, the >“counter” register) for use with loop instructions, though it could also >(along with LR, the “link” register) be used for indirect branches. >(Obviously you need at least two registers with this property.)

The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?

The most common form of these will decrement the counter register, and
only branch back to the top of the loop if the counter has not reached
zero;

PDP-11 SOB (Subtract One and Branch).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 13:28:07 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?

Sure. The question is what it was. Maybe they wanted to look good on
some kernels. In the same vein they also added loads and stores with
update (i.e., autoincrement/decrement addressing), and in one version
of the architecture reference manual I found the warning that these
may be as slow as a separate load and update.

AMD64 has LOOP. I looked at it here several times. Theoretically one
can branch-predict it perfectly, but when I measured that <2016Jun16.103617@mips.complang.tuwien.ac.at> <2017Mar14.183125@mips.complang.tuwien.ac.at>, I found that they just
use history-based branch prediction for these instructions like
everybody else.

I think that the major reason is that in an OoO CPU the OoO part would
need to move the count to the front end, and either let the front end
wait until that is done, or introduce some mechanism to let the front
end run ahead and, when the count finally becomes available to the
front end, update it to the right value where the front end is now.

Moreover, at least some AMD64 CPUs take more cycles for a LOOP than
for the equivalent "sub; jne" sequence <2017Mar15.141411@mips.complang.tuwien.ac.at>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Aug 13 17:18:13 2024

On Tue, 13 Aug 2024 13:28:07 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?

Sure. The question is what it was. Maybe they wanted to look good on
some kernels. In the same vein they also added loads and stores with
update (i.e., autoincrement/decrement addressing), and in one version
of the architecture reference manual I found the warning that these
may be as slow as a separate load and update.

AMD64 has LOOP. I looked at it here several times. Theoretically one
can branch-predict it perfectly, but when I measured that <2016Jun16.103617@mips.complang.tuwien.ac.at> <2017Mar14.183125@mips.complang.tuwien.ac.at>, I found that they just
use history-based branch prediction for these instructions like
everybody else.

I think that the major reason is that in an OoO CPU the OoO part would
need to move the count to the front end, and either let the front end
wait until that is done, or introduce some mechanism to let the front
end run ahead and, when the count finally becomes available to the
front end, update it to the right value where the front end is now.

Actually that is not necessary, but there are additional advantages.

Imagine a GBOoO machine with reservation stations and one runs into
a recognizable loop. Once the RSs are setup, one turns off the FETCH
stage, adds an increment to each station, and then each time the
loop instruction is encountered, you just fire off the RSs again.
This saves around 1/3 of the power being consumed at no loss in
perf.

Moreover, at least some AMD64 CPUs take more cycles for a LOOP than
for the equivalent "sub; jne" sequence <2017Mar15.141411@mips.complang.tuwien.ac.at>

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Aug 13 17:15:00 2024

On Tue, 13 Aug 2024 9:00:25 +0000, Lawrence D'Oliveiro wrote:

I thought loop-control instructions had fallen out of favour in the RISC
era. But reading some IBM POWER (and PowerPC) docs has reminded me that
that family does have such instructions. I don’t think any other RISC architecture does, though. POWER even has a special register (CTR, the “counter” register) for use with loop instructions, though it could also (along with LR, the “link” register) be used for indirect branches. (Obviously you need at least two registers with this property.)

The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?

Yes, there is a point !

One can calculate ADD-CMP-BC in 1 gate delay longer than ADD. Thus,
the loop instruction can perform 3 instructions for you.

My 66000 has 3 looping instructions::
a) for( ; i<max; i++),
b) for( ; x != y; i++),
c) for( ; i<max && x ; i++)
With these almost every subroutine in /lib/str* and /lib/mem* vectorize.

The most common form of these will decrement the counter register, and

I made mine go in either direction by allowing a constant as the loop increment.

only branch back to the top of the loop if the counter has not reached
zero; if it is now zero, then fall through. However, the good old VAX
(in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.

This seemed to mystify quite a few assembly-language programmers. I
wonder
why it wasn’t a more popular idea ...

VVM is based entirely on LOOP[123], and the architectural semantics
allows
this to provide for vectorization and SIMDization. Thus, My 66000 gets
2,000 instructions at the price of 2 actual instruction (4 if you are
picky)

A byte-copy loop can move 16-bytes per clock--effectivley 40
instructions
per clock (5/c if you could write it in 64-bit form--but you don't have
to write it in 64-bit form to get 64-bit performance. The above is on
an IO 1-wide machine. Multiply by 4 for the 6-wide OoO machine.

The logic is simple--these are frequent enough to warrant "doing a bit
more than 'nothing'" but not so much you crater the whole architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Aug 13 22:00:12 2024

On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

However, the good old VAX (in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.

VAX example of how to use SOBGEQ instead of SOBGTR:

movl «loop count», Rn
br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop

Like I said, I wondered why this sort of thing wasn’t more common ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Aug 14 01:33:32 2024

On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

However, the good old VAX (in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000
family only had the decrement down to -1 version.

VAX example of how to use SOBGEQ instead of SOBGTR:

movl «loop count», Rn
br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop

Like I said, I wondered why this sort of thing wasn’t more common ...

Perhaps the RISC mantra has permeated the minds of ISA designers.

Mark Horowitz: Decode should be as simple as possible.

Albert Einstein: Everything should be as simple as possible,
but no simpler.

One of the above got it right...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed Aug 14 08:53:22 2024

On Wed, 14 Aug 2024 01:33:32 +0000, MitchAlsup1 wrote:

On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

However, the good old VAX (in its usual kitchen-sink fashion) had a
whole set of variations, including one that decremented down to -1
instead of zero. And the Motorola 68000 family only had the decrement
down to -1 version.

VAX example of how to use SOBGEQ instead of SOBGTR:

movl «loop count», Rn br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop

Like I said, I wondered why this sort of thing wasn’t more common ...

Perhaps the RISC mantra has permeated the minds of ISA designers.

Would you prefer it with a decrement+separate conditional-jump instruction pair?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Aug 14 09:10:01 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution. Performing an ALU op before the branch did not fit that
kind of pipeline.

However, having a branch-and-subtract would have been possible. But
how would that have interacted with the branch delay slots that many
of them had? I guess one could perform the subtract before the
instruction in the delay slot, and take the branch afterwards (if it
is taken).

So it would actually fit. Why was it not done? Maybe the idea was
that induction-variable elimination would usually eliminate the
subtract anyway, so why complicate the architecture with such an
instruction?

For over a decade, Intel decoders have decoded many sequences of ALU
and branch instructions into one uop, so they can do at a
microarchitectural level what you are asking about at the architecture
level. Other microarchitectures have followed this pattern, and
RISC-V seems to make a philosophy out of this.

ARM A64 OTOH seems to put everything into an instruction that fits in
32 bits, and while they have instructions (TBNZ and TBZ) that tests a
specific bit in a register and branch if the bit is set or clear, they
have not added a subtract-and-branch or branch-and-subtract
instruction. Apparently the uses for such an instruction are not that frequent.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Aug 14 23:58:58 2024

On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution.

Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 00:25:01 2024

On Wed, 14 Aug 2024 23:58:58 +0000, Lawrence D'Oliveiro wrote:

On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution.

Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.

Once one has FMAC with 3 source operands, one has encoding to have
ADD-CMP-BC as 1 instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 00:15:40 2024

On Wed, 14 Aug 2024 8:53:22 +0000, Lawrence D'Oliveiro wrote:

On Wed, 14 Aug 2024 01:33:32 +0000, MitchAlsup1 wrote:

On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:

On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:

However, the good old VAX (in its usual kitchen-sink fashion) had a
whole set of variations, including one that decremented down to -1
instead of zero. And the Motorola 68000 family only had the decrement
down to -1 version.

VAX example of how to use SOBGEQ instead of SOBGTR:

movl «loop count», Rn br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop

Like I said, I wondered why this sort of thing wasn’t more common ...

Perhaps the RISC mantra has permeated the minds of ISA designers.

Would you prefer it with a decrement+separate conditional-jump
instruction pair?

I have real LOOP instructions:: ADD-CMP-BC and access to constants
so one can::
ADD #{1,2,3...31}, ADD #-{1,2,3,...31}, ADD register,
CMP #{1,2,3...31}, CMP #-{1,2,3...31}, CMP register,
BC {EQ, NE, LE, LT, GE, GT, LO, LS, HI, HS}
Any way you want.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 00:23:41 2024

On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution. Performing an ALU op before the branch did not fit that
kind of pipeline.

MIPS would disagree.

However, having a branch-and-subtract would have been possible. But
how would that have interacted with the branch delay slots that many
of them had? I guess one could perform the subtract before the
instruction in the delay slot, and take the branch afterwards (if it
is taken).

MIPS pipeline performed Branch Target Calculation by pasting bits
from the instruction onto bits vacated from IP.

Most of the rest of us performed BTC in the Decode stage of the
pipeline.

So it would actually fit. Why was it not done? Maybe the idea was
that induction-variable elimination would usually eliminate the
subtract anyway, so why complicate the architecture with such an
instruction?

For over a decade, Intel decoders have decoded many sequences of ALU
and branch instructions into one uop, so they can do at a
microarchitectural level what you are asking about at the architecture
level. Other microarchitectures have followed this pattern, and
RISC-V seems to make a philosophy out of this.

On the Intel side they mostly depend on prediction.

On the RISC-V side they mostly depend on fusion. As far as I understand,
They only fuse pairs not ADD-CMP-BCs.

ARM A64 OTOH seems to put everything into an instruction that fits in
32 bits, and while they have instructions (TBNZ and TBZ) that tests a specific bit in a register and branch if the bit is set or clear, they
have not added a subtract-and-branch or branch-and-subtract
instruction. Apparently the uses for such an instruction are not that frequent.

My 66000 finds use cases all the time, and I also have Branch on bit instructions and have my CMP instructions build bit-vectors of outcomes.

I subscribe to the notion that what one can fit into an instruction
should fit in an instruction--where I differ is access to constants
as operands {immediates and displacements} of all convenient sizes;
with the disclaimer that not everything should be an instruction.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Aug 15 10:29:11 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution.

Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.

I guess what you want to point out is that

x = x-1
if (x!=-1) goto ...

is equivalent to

flag = x!=0; x = x-1; if (flag) goto ...

but in the latter the branch does not need to wait for the decrement
to complete. As for x!=0 vs. x!=1, the CPU may already have special
circuits for x!=0.

Ok, so this is not the reason for not having this instruction. Which
leaves: It is not useful that often.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 15 10:39:28 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ...

For the early RISCs, the pipeline was designed for early branch
execution. Performing an ALU op before the branch did not fit that
kind of pipeline.

MIPS would disagree.

In nearly all of the MIPS history, there is no ALU op before the
branch, only a comparison of two registers for equality. They revised
the branches significantly in 2014, but that's not early MIPS, and by
that time branch predictors were so good that resolving the branch one
cycle later was not a big issue.

MIPS pipeline performed Branch Target Calculation by pasting bits
from the instruction onto bits vacated from IP.

Conditional branches in MIPS are relative. Only J and JAL have this misfeature.

For over a decade, Intel decoders have decoded many sequences of ALU
and branch instructions into one uop, so they can do at a
microarchitectural level what you are asking about at the architecture
level. Other microarchitectures have followed this pattern, and
RISC-V seems to make a philosophy out of this.

On the Intel side they mostly depend on prediction.

Every high-performance CPU depends on prediction. Your point is what?

On the RISC-V side they mostly depend on fusion. As far as I understand,
They only fuse pairs not ADD-CMP-BCs.

RISC-V has compare-and-branch instructions; I don't know if any
implementations fuse that with a preceding addition/subtraction, but
if so, it's a fusion of a pair of instructions.

As for only fusing pairs, one of the patterns, in a section called
"Fusion Pair Candidates" Celio et al.
<https://arxiv.org/pdf/1607.02318> give the sequence

slli rd, rs1, {1,2,3}
add rd, rd, rs2
ld rd, 0(rd)

However, as they point out, this may be the result of first pairing
the first two instructions and then pairing the result with the third instruction.

The paper does not describe any implementation that actually performs
such instruction fusions, so any real implementation may perform the
fusions shown there, or more or fewer fusion patterns.

ARM A64 OTOH seems to put everything into an instruction that fits in
32 bits, and while they have instructions (TBNZ and TBZ) that tests a
specific bit in a register and branch if the bit is set or clear, they
have not added a subtract-and-branch or branch-and-subtract
instruction. Apparently the uses for such an instruction are not that
frequent.

My 66000 finds use cases all the time, and I also have Branch on bit >instructions and have my CMP instructions build bit-vectors of outcomes.

If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture? Looking at a binary I have at
hand, I see a lot of checking bit #63 and some checking of #31, #15,
#7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
There are also a number of uses coming from libgcc, e.g.,

6f0a8: 37e001c3 tbnz w3, #28, 6f0e0 <__aarch64_sync_cache_range+0x50>
6f0e8: 37e801e2 tbnz w2, #29, 6f124 <__aarch64_sync_cache_range+0x94>
6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c>
6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c>
6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14>
70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728>
7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0>
70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4>
7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728>
70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728>

The tf3 stuff probably is the implementation of long doubles. In any
case, in this binary with 26473 instructions, there are 30 occurences
of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
count).

Apparently the usefulness of decrement-and-branch is even lower.

Certainly in my code most loops count upwards.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 20:00:20 2024

On Thu, 15 Aug 2024 10:39:28 +0000, Anton Ertl wrote:

As for only fusing pairs, one of the patterns, in a section called
"Fusion Pair Candidates" Celio et al.
<https://arxiv.org/pdf/1607.02318> give the sequence

slli rd, rs1, {1,2,3}
add rd, rd, rs2
ld rd, 0(rd)

The second half of the title is:: "Removing ISA-bloat with Op-Fusion"

And RISC-V ends up with over 448 instructions whereas My 66000 has but
65.

I wonder how much Ozempic they are taking....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Aug 15 20:02:09 2024

On Thu, 15 Aug 2024 10:39:28 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.

If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture?

if( x & (1<<7) )

if( !(x & (1<<7) )

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Aug 16 01:52:42 2024

On Thu, 15 Aug 2024 10:29:11 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.

I guess what you want to point out is ...

That the example I gave will correctly handle the case where the loop
count is initially zero (fall out the bottom without executing the loop
once), without the need for a separate test.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Aug 16 05:23:30 2024

mitchalsup@aol.com (MitchAlsup1) writes:

And RISC-V ends up with over 448 instructions

How do you count this? Looking at chapter 19 of https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, I
count for RV64G:

47 RV32I
15 RV64I additional instructions
8 RV32M
5 RV64M additional instructions
11 RV32A
11 RV64A additional instructions
26 RV32F
4 RV64F additional instructions
26 RV32D
6 RV64D additional instructions
---------------------------------
159 RV64G

whereas My 66000 has but 65.

There are also One-instruction set computer designs <https://en.wikipedia.org/wiki/One-instruction_set_computer>, and by
that metric they are the best, no?

The main thing I dislike about Celio's talk and work is that he uses
the same metric for advocating his approach without giving any reason
why it should be relevant.

He also makes the mistake of using instruction count for discerning
between RISC and non-RISC (which would make the PDP-11, 6502 and
probably 8086 more RISC than RV64G) instead of using John Masheys
approach of identifying common traits; and instruction count was not
among the criteria that John Mashey identified as discerning between
RISC and non-RISC (not surprising given non-RISCs like PDP-11).

Patterson (who is also on that paper and who failed to define RISC
when he wrote the papers that introduced the term) makes the same
mistake when arguing for his vector approach (which, I think, resulted
in RV64V) over the approach taken in, e.g., AVX512. So maybe Celio
just was Patterson's voice in his talk, but he appeared to speak his conviction.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Fri Aug 16 07:06:11 2024

On Fri, 16 Aug 2024 05:23:30 GMT, Anton Ertl wrote:

... instruction count was not
among the criteria that John Mashey identified as discerning between
RISC and non-RISC (not surprising given non-RISCs like PDP-11).

Why is that particular criterion, of all of them, in the name, then?

At one point I thought it should be “IRSC”, for “Increased Register Set Computer” ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Fri Aug 16 07:43:31 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Fri, 16 Aug 2024 05:23:30 GMT, Anton Ertl wrote:

... instruction count was not
among the criteria that John Mashey identified as discerning between
RISC and non-RISC (not surprising given non-RISCs like PDP-11).

Why is that particular criterion, of all of them, in the name, then?

It is not. It's not Reduced InstructionS Computer, but "Reduced
Instruction Set Computer", and Mashey argued convincingly that this
should be read as "reduced-instruction set computer", not as "reduced instruction-set computer".

If it was "reduced instruction-set computer", then the RISCs should
have kept the VAX shift instruction, which shifted in either
direction, depending on the sign of the shift count. Instead, RISCs
generally split this instruction into a shift-left and shift-right
instruction, increasing the instruction count.

At one point I thought it should be “IRSC”, for “Increased Register Set >Computer” ...

This is one commonality of RISCs, but does not discern between RISCs
like the original IBM 801 (16 registers) and ARM A32 on one hand, and
S/360, VAX and AMD64 on the other hand (and especially not AMD64 with
APX). In any case, number of registers certainly is one of the
criteria that John Mashey uses, but he uses a number of criteria, and
these work well for classifying architectures that he did not classify
in his original postings
<2024Jan12.145502@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Aug 16 10:00:32 2024

On Fri, 16 Aug 2024 5:23:30 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

And RISC-V ends up with over 448 instructions

How do you count this? Looking at chapter 19 of https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, I
count for RV64G:

https://en.wikipedia.org/wiki/RISC-V

47 RV32I
15 RV64I additional instructions
8 RV32M
5 RV64M additional instructions
11 RV32A
11 RV64A additional instructions
26 RV32F
4 RV64F additional instructions
26 RV32D
6 RV64D additional instructions

43 B
40 C
187 V
43 Zk
15 H

---------------------------------
159 RV64G

492

whereas My 66000 has but 65.

There are also One-instruction set computer designs <https://en.wikipedia.org/wiki/One-instruction_set_computer>, and by
that metric they are the best, no?

Everything should be as simple as possible, but no simpler. A.E.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Fri Aug 16 18:40:23 2024

On Fri, 16 Aug 2024 10:54:43 +0000, quadibloc wrote:

But if it's programmed in a higher-level language, usually what a loop construct does is not the same as what a loop instruction does, so the instruction is not used.

I designed My 66000 LOOP instructions to cover 3 main cases::
a) std iterated loop where iteration can be + or -, constant or
register, and comparison can be any of the 10 integer CMPs
against a constant or register.
b) std early out iterated loop:: strncpy()
c) both

These LOOPs come with different execution semantics of the insts
inside::
a) Cache allocation is relaxed when the loop is "long enough"
so that vector strip-mines do not erase the current cache
footprint.
b) Multiple iterations can be performed simultaneously (SIMD)
c) The width of execution is primarily the width of the cache
port(s) not the register ports.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to Anton Ertl on Mon Sep 9 03:31:00 2024

In article <2024Aug15.123928@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Like I said, I wondered why this sort of thing wasn't more common ... [snip]

My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.

If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture? Looking at a binary I have at
hand, I see a lot of checking bit #63 and some checking of #31, #15,
#7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
There are also a number of uses coming from libgcc, e.g.,

6f0a8: 37e001c3 tbnz w3, #28, 6f0e0
<__aarch64_sync_cache_range+0x50>
6f0e8: 37e801e2 tbnz w2, #29, 6f124
<__aarch64_sync_cache_range+0x94>
6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c>
6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c>
6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14>
70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728>
7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0>
70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4>
7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728>
70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728>

The tf3 stuff probably is the implementation of long doubles. In any
case, in this binary with 26473 instructions, there are 30 occurences
of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
count).

Apparently the usefulness of decrement-and-branch is even lower.

Certainly in my code most loops count upwards.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

PA-RISC had "ADDIB,cond,n imm,reg,target". Add a 5-bit signed
immediate to reg, and then branch on comparing the result to 0
(effectively), allowing branching on <, <=, =, >, >=, overflow, carry,
etc. And a non-immediate version ADDB. The target was +/-8KB.

Really simple loops could be done with the loop operation in the delay
slot of ADDIB.

The HP C/C++ Compiler pretty much converted all for() loops to count down
to 0, when it wasn't too awkward. So:

for(i = 0; i < 100; i++) {
array[i] = 0;
}

would be effectively transformed to:

ptr = &array[0];
for(i = 99, i >= 0; i--) {
*ptr++ = 0;
}

Which becomes (PA-RISC has target register listed last, and delay slots,
and nullification where on branches it nullifies next instruction if it
is not taken):

MOV array,r8
LDI 99,r9
LOOP: ADDIB,>=,n -1,r9,LOOP ; r9=r9-1. If r9 >= 0, jump to LOOP
STD,ma r0,8(r8) ; (r8)=r0; r8=r8+8

So it could use ADDIB for many "for" loops. The way nullification works,
it works properly even if the loop should never execute. If r9 starts
at 0, no STD will be done. There was no reason to change the source
code, the compiler would do the transform for you. PA-RISC also had
CMPIB which just does the compare and branch. ADDIB is a very simple instruction which costs very little to add, and saves 2 instructions for
many loops (ADDI,CMP_0,Bcc -> ADDIB). I think it is a mistake for ARM to
not have it. I see a lot of "ADD, CMP, Bcc" in ARM assembly code.
To avoid inverting the counter, "ADD1CMPBcc" would ADD 1 to a counter,
compare the counter to another register, and branch on condition.

As for ARM TBNZ and TBZ, I see it used all the time in my code where I
often use single bit flags in control variables:

if(flags & FLAG_SPECIAL1) { // FLAG_SPECIAL1 = 0x40
// Do "SPECIAL1" stuff
}

In one program I've written on ARM, 2.3% of all instructions are TBZ or
TBNZ.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	01:20:11
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,728

Decrement And Branch

Who's Online

Recent Visitors

System Info