I thought loop-control instructions had fallen out of favour in the RISC
era. But reading some IBM POWER (and PowerPC) docs has reminded me that
that family does have such instructions. I don’t think any other RISC >architecture does, though. POWER even has a special register (CTR, the >“counter” register) for use with loop instructions, though it could also >(along with LR, the “link” register) be used for indirect branches. >(Obviously you need at least two registers with this property.)
The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?
The most common form of these will decrement the counter register, and
only branch back to the top of the loop if the counter has not reached
zero;
The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?
Sure. The question is what it was. Maybe they wanted to look good on
some kernels. In the same vein they also added loads and stores with
update (i.e., autoincrement/decrement addressing), and in one version
of the architecture reference manual I found the warning that these
may be as slow as a separate load and update.
AMD64 has LOOP. I looked at it here several times. Theoretically one
can branch-predict it perfectly, but when I measured that <2016Jun16.103617@mips.complang.tuwien.ac.at> <2017Mar14.183125@mips.complang.tuwien.ac.at>, I found that they just
use history-based branch prediction for these instructions like
everybody else.
I think that the major reason is that in an OoO CPU the OoO part would
need to move the count to the front end, and either let the front end
wait until that is done, or introduce some mechanism to let the front
end run ahead and, when the count finally becomes available to the
front end, update it to the right value where the front end is now.
Moreover, at least some AMD64 CPUs take more cycles for a LOOP than
for the equivalent "sub; jne" sequence <2017Mar15.141411@mips.complang.tuwien.ac.at>
- anton
I thought loop-control instructions had fallen out of favour in the RISC
era. But reading some IBM POWER (and PowerPC) docs has reminded me that
that family does have such instructions. I don’t think any other RISC architecture does, though. POWER even has a special register (CTR, the “counter” register) for use with loop instructions, though it could also (along with LR, the “link” register) be used for indirect branches. (Obviously you need at least two registers with this property.)
The original designers of POWER clearly thought there was a point to
having such instructions; do you agree?
The most common form of these will decrement the counter register, and
only branch back to the top of the loop if the counter has not reached
zero; if it is now zero, then fall through. However, the good old VAX
(in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.
This seemed to mystify quite a few assembly-language programmers. I
wonder
why it wasn’t a more popular idea ...
However, the good old VAX (in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000 family only had the decrement down to -1 version.
On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:
However, the good old VAX (in
its usual kitchen-sink fashion) had a whole set of variations, including
one that decremented down to -1 instead of zero. And the Motorola 68000
family only had the decrement down to -1 version.
VAX example of how to use SOBGEQ instead of SOBGTR:
movl «loop count», Rn
br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop
Like I said, I wondered why this sort of thing wasn’t more common ...
On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:
On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:
However, the good old VAX (in its usual kitchen-sink fashion) had a
whole set of variations, including one that decremented down to -1
instead of zero. And the Motorola 68000 family only had the decrement
down to -1 version.
VAX example of how to use SOBGEQ instead of SOBGTR:
movl «loop count», Rn br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop
Like I said, I wondered why this sort of thing wasn’t more common ...
Perhaps the RISC mantra has permeated the minds of ISA designers.
Like I said, I wondered why this sort of thing wasn't more common ...
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Like I said, I wondered why this sort of thing wasn't more common ...
For the early RISCs, the pipeline was designed for early branch
execution.
On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Like I said, I wondered why this sort of thing wasn't more common ...
For the early RISCs, the pipeline was designed for early branch
execution.
Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.
On Wed, 14 Aug 2024 01:33:32 +0000, MitchAlsup1 wrote:
On Tue, 13 Aug 2024 22:00:12 +0000, Lawrence D'Oliveiro wrote:
On Tue, 13 Aug 2024 09:00:25 -0000 (UTC), I wrote:
However, the good old VAX (in its usual kitchen-sink fashion) had a
whole set of variations, including one that decremented down to -1
instead of zero. And the Motorola 68000 family only had the decrement
down to -1 version.
VAX example of how to use SOBGEQ instead of SOBGTR:
movl «loop count», Rn br bottom_of_loop
top_of_loop:
.... body of loop ...
bottom_of_loop:
sobgeq Rn, top_of_loop
Like I said, I wondered why this sort of thing wasn’t more common ...
Perhaps the RISC mantra has permeated the minds of ISA designers.
Would you prefer it with a decrement+separate conditional-jump
instruction pair?
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Like I said, I wondered why this sort of thing wasn't more common ...
For the early RISCs, the pipeline was designed for early branch
execution. Performing an ALU op before the branch did not fit that
kind of pipeline.
However, having a branch-and-subtract would have been possible. But
how would that have interacted with the branch delay slots that many
of them had? I guess one could perform the subtract before the
instruction in the delay slot, and take the branch afterwards (if it
is taken).
So it would actually fit. Why was it not done? Maybe the idea was
that induction-variable elimination would usually eliminate the
subtract anyway, so why complicate the architecture with such an
instruction?
For over a decade, Intel decoders have decoded many sequences of ALU
and branch instructions into one uop, so they can do at a
microarchitectural level what you are asking about at the architecture
level. Other microarchitectures have followed this pattern, and
RISC-V seems to make a philosophy out of this.
ARM A64 OTOH seems to put everything into an instruction that fits in
32 bits, and while they have instructions (TBNZ and TBZ) that tests a specific bit in a register and branch if the bit is set or clear, they
have not added a subtract-and-branch or branch-and-subtract
instruction. Apparently the uses for such an instruction are not that frequent.
- anton
On Wed, 14 Aug 2024 09:10:01 GMT, Anton Ertl wrote:
Like I said, I wondered why this sort of thing wasn't more common ...
For the early RISCs, the pipeline was designed for early branch
execution.
Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.
On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Like I said, I wondered why this sort of thing wasn't more common ...
For the early RISCs, the pipeline was designed for early branch
execution. Performing an ALU op before the branch did not fit that
kind of pipeline.
MIPS would disagree.
MIPS pipeline performed Branch Target Calculation by pasting bits
from the instruction onto bits vacated from IP.
For over a decade, Intel decoders have decoded many sequences of ALU
and branch instructions into one uop, so they can do at a
microarchitectural level what you are asking about at the architecture
level. Other microarchitectures have followed this pattern, and
RISC-V seems to make a philosophy out of this.
On the Intel side they mostly depend on prediction.
On the RISC-V side they mostly depend on fusion. As far as I understand,
They only fuse pairs not ADD-CMP-BCs.
ARM A64 OTOH seems to put everything into an instruction that fits in
32 bits, and while they have instructions (TBNZ and TBZ) that tests a
specific bit in a register and branch if the bit is set or clear, they
have not added a subtract-and-branch or branch-and-subtract
instruction. Apparently the uses for such an instruction are not that
frequent.
My 66000 finds use cases all the time, and I also have Branch on bit >instructions and have my CMP instructions build bit-vectors of outcomes.
As for only fusing pairs, one of the patterns, in a section called
"Fusion Pair Candidates" Celio et al.
<https://arxiv.org/pdf/1607.02318> give the sequence
slli rd, rs1, {1,2,3}
add rd, rd, rs2
ld rd, 0(rd)
mitchalsup@aol.com (MitchAlsup1) writes:
My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.
If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture?
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Note that I was referring to the decrement-down-to-minus-1 form, as
opposed to the decrement-down-to-zero form.
I guess what you want to point out is ...
And RISC-V ends up with over 448 instructions
whereas My 66000 has but 65.
... instruction count was not
among the criteria that John Mashey identified as discerning between
RISC and non-RISC (not surprising given non-RISCs like PDP-11).
On Fri, 16 Aug 2024 05:23:30 GMT, Anton Ertl wrote:
... instruction count was not
among the criteria that John Mashey identified as discerning between
RISC and non-RISC (not surprising given non-RISCs like PDP-11).
Why is that particular criterion, of all of them, in the name, then?
At one point I thought it should be “IRSC”, for “Increased Register Set >Computer” ...
mitchalsup@aol.com (MitchAlsup1) writes:
And RISC-V ends up with over 448 instructions
How do you count this? Looking at chapter 19 of https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, I
count for RV64G:
47 RV32I43 B
15 RV64I additional instructions
8 RV32M
5 RV64M additional instructions
11 RV32A
11 RV64A additional instructions
26 RV32F
4 RV64F additional instructions
26 RV32D
6 RV64D additional instructions
---------------------------------492
159 RV64G
whereas My 66000 has but 65.
There are also One-instruction set computer designs <https://en.wikipedia.org/wiki/One-instruction_set_computer>, and by
that metric they are the best, no?
But if it's programmed in a higher-level language, usually what a loop construct does is not the same as what a loop instruction does, so the instruction is not used.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:My 66000 finds use cases all the time, and I also have Branch on bit >>instructions and have my CMP instructions build bit-vectors of outcomes.
Like I said, I wondered why this sort of thing wasn't more common ... [snip]
If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture? Looking at a binary I have at
hand, I see a lot of checking bit #63 and some checking of #31, #15,
#7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
There are also a number of uses coming from libgcc, e.g.,
6f0a8: 37e001c3 tbnz w3, #28, 6f0e0
<__aarch64_sync_cache_range+0x50>
6f0e8: 37e801e2 tbnz w2, #29, 6f124
<__aarch64_sync_cache_range+0x94>
6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c>
6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c>
6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14>
70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728>
7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0>
70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4>
7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728>
70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728>
The tf3 stuff probably is the implementation of long doubles. In any
case, in this binary with 26473 instructions, there are 30 occurences
of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
count).
Apparently the usefulness of decrement-and-branch is even lower.
Certainly in my code most loops count upwards.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 01:20:11 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,728 |