My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.
IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.
Thanks,
Brett
My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.
IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.
Thanks,
Brett
On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:
My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.
The article posted by Andy Glew was luke-warm at best. Now, while
IBM has figured out that 16 GPRs is insufficient, there is scant
data that 32 are insufficient {witness how few RISCs went with
bigger files}
Since My 66000 is a 64-bit architecture with a modicum of support for
8-bit, 16-bit, and 32-bit stuff; and since 32-truely GPRs seems to be
enough (compiler output), I think I will pass.
Due to significant access to constants, My 66000 with only 32-actual registers performs as well as RISC-V does with 32I+32F in most codes,
so there does not seem to be an insufficient number of registers. I
even have ASM examples where RISC-V runs out of registers where My
66000 does not !! Not wasting register to hold onto big immediates,
big displacements, or big addresses goes a long way to thinning out
the register count necessities.
In My 66000 one can utilize all 32 registers, with 0 reserved for
{linking, splicing, GOT access,...} these "effective constants"
become actual constants meaning one does not have to consume a
register to have access through that constant address value.
IBM supports Linux, so the compiler support should exist. X86 solved the
aliasing issue with finer tracking.
Neither of which would worry me.
Thanks,
Brett
Compilers love unrolling loops because it saves an instruction, which for a short loop could mean 10% faster. Point out your code has more unrolls and performance.
The lack of CPU’s with 64 registers is what makes for a market, that 4% >that could benefit have no options to pick from.
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
Where is your 4% number coming from?
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
On 8/11/2024 9:33 AM, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a
decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.
Sometimes, it is preferable to be able to map functions entirely to registers, and 64 does increase the probability of being able to do so (though, neither achieves 100% of functions; and functions which map
entirely to GPRs with 32 will not see an advantage with 64).
Well, and to some extent the compiler needs to be selective about which functions it allows to use all of the registers, since in some cases a situation can come up where the saving/restoring more registers in the prolog/epilog can cost more than the associated register spills.
But, have noted that 32 GPRs can get clogged up pretty quickly when
using them for FP-SIMD and similar (if working with 128-bit vectors as register pairs); or otherwise when working with 128-bit data as pairs.
Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but
can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32
matrix, and 16 registers to perform a matrix-transpose, ...
Granted, arguably, doing a matrix-multiply directly in registers using
SIMD ops is a bit niche (traditional option being to use scalar
operations and fetch numbers from memory using "for()" loops, but this
is slower). Most of the programs don't need fast MatMult though.
Annoyingly, it has led to my ISA fragmenting into two variants:
Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
Supports R32..R63 for only a subset of the ISA for 32-bit ops.
For ops outside this subset, needs 64-bit encodings in these cases.
XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
By itself, would be easier to decode than Baseline,
as it drops a bunch of wonky edge cases.
Though, some cases were dropped from Baseline when XG2 was added.
"Op40x2" was dropped as it was hair and became mostly moot.
Then, a common subset exists known as Fix32, which can be decoded in
both Baseline and XG2 Mode, but only has access to R0..R31.
Well, and a 3rd sub-variant:
XG2RV: Uses XG2's encodings but RISC-V's register space.
R0..R31 are X0..X31;
R32..R63 are F0..F31.
Arguable main use-case for XG2RV mode is for ASM blobs intended to be
called natively from RISC-V mode; but...
It is debatable whether such an operating mode actually makes sense, and
it might have made more sense to simply fake it in the ASM parser:
ADD R24, R25, R26 //Uses BJX2 register numbering.
ADD X14, X15, X16 //Uses RISC-V register remapping.
Likely, as a sub-mode of either Baseline or XG2 Mode.
Since, the register remapping scheme is known as part of the ISA spec,
it could be done in the assembler.
It is possible that XG2RV mode may eventually be dropped due to "lack of relevance".
Well, and similarly any ABI thunks would need to be done in Baseline or
XG2 mode, since neither RV mode nor XG2RV Mode has access to all the registers used for argument passing in BJX2.
In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5,
being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.
Well, and likewise one wants to limit the number of inter-ISA branches,
as the branch-predictor can't predict these, and they need a full
pipeline flush (a few extra cycles are needed to make sure the L1 I$ is fetching in the correct mode). Technically also the L1 I$ needs to flush
any cache-lines which were fetched in a different mode (the I$ uses
internal tag-bits to to figure out things like instruction length and bundling and to try to help with Superscalar in RV mode, *; mostly for timing/latency reasons, ...).
*: The way the BJX2 core deals with superscalar being to essentially
pretend as-if RV64 had WEX flag bits, which can be synthesized partly
when fetching cache lines (putting some of the latency in the I$ Miss handling, rather than during instruction-fetch). In the ID stage, it
sees the longer PC step and infers that two instructions are being
decoded as superscalar.
...
Where is your 4% number coming from?
I guess it could make sense, arguably, to try to come up with test cases
to try to get a quantitative measurement of the effect of 64 GPRs for programs which can make effective use of them...
Would be kind of a pain to test as 64 GPR programs couldn't run on a
kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in kernel-space (and is the main thing in my case that seems to benefit
from 64 GPRs).
But, technically, a 32 GPR kernel couldn't run RISC-V programs either.
So, would likely need to switch GLQuake and similar over to baseline
mode (and probably messing with "timedemo").
Checking, as-is, timedemo results for "demo1" are "969 frames 150.5
seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would
be faster with RGB555 LDR), at 50 MHz.
GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".
But, yeah, both are with builds that use 64 GPRs.
Software Quake: "969 frames 147.4 seconds 6.6 fps"
Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"
Not going to bother with GLQuake in RISC-V mode, would likely take a painfully long time.
Well, decided to run this test anyways:
"969 frames 687.3 seconds 1.4 fps"
IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done
to make it fast within the limits of RV64G). Though, this is with it
running GL entirely in RV64 mode (it might fare better as a userland application where the GL backend is running in kernel space in BJX2 mode).
Though, much of this is likely due more to RV64G's lack of SIMD and
similar, rather than due to having fewer GPRs.
BGB <cr88192@gmail.com> wrote:
On 8/11/2024 9:33 AM, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a
decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.
Sometimes, it is preferable to be able to map functions entirely to
registers, and 64 does increase the probability of being able to do so
(though, neither achieves 100% of functions; and functions which map
entirely to GPRs with 32 will not see an advantage with 64).
Well, and to some extent the compiler needs to be selective about which
functions it allows to use all of the registers, since in some cases a
situation can come up where the saving/restoring more registers in the
prolog/epilog can cost more than the associated register spills.
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that call, as it splits your function and burns registers that would otherwise get
used.
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
All antiques no longer available.
Where is your 4% number coming from?
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
Brett <ggtgp@yahoo.com> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>> that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
All antiques no longer available.
SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:
|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.
No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.
In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).
Where is your 4% number coming from?
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.
It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the computation is complete, and the refills can start early because the
stack pointer tends to be available early.
And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.
One other optimization that they use the additional registers for is "register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.
- anton
On 8/12/2024 12:36 PM, MitchAlsup1 wrote:
On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market,
that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register >>>>> files to make good use of them.
All antiques no longer available.
SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:
|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.
No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.
In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).
Where is your 4% number coming from?
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a
big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.
It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the
computation is complete, and the refills can start early because the
stack pointer tends to be available early.
And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.
One other optimization that they use the additional registers for is
"register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.
With full access to constants, there is even less need to promote
addresses or immediates into registers as you can simply poof them
up anything you want one.
There are tradeoffs still, if constants need space to encode...
Inline is still better than a memory load, granted.
May make sense to consolidate multiple uses of a value into a register
rather than try encoding them as an immediate each time.
On 8/12/2024 3:12 PM, MitchAlsup1 wrote:
See polpak:: r8_erf()
r8_erf: ; @r8_erf
; %bb.0:
fabs r2,r1
fcmp r3,r2,#0x3EF00000
bngt r3,.LBB141_5
; %bb.1:
fcmp r3,r2,#4
bngt r3,.LBB141_6
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E
bnlt r3,.LBB141_7
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7
fmac r4,r3,r4,#0x3FD38A78B9F065F6
fadd r5,r3,#0x40048C54508800DB
fmac r4,r3,r4,#0x3FD70FE40E2425B8
fmac r5,r3,r5,#0x3FFDF79D6855F0AD
fmac r4,r3,r4,#0x3FC0199D980A842F
fmac r5,r3,r5,#0x3FE0E4993E122C39
fmac r4,r3,r4,#0x3F9078448CD6C5B5
fmac r5,r3,r5,#0x3FAEFC42917D7DE7
fmac r4,r3,r4,#0x3F4595FD0D71E33C
fmul r4,r3,r4
fmac r3,r3,r5,#0x3F632147A014BAD1
fdiv r3,r4,r3
fadd r3,#0x3FE20DD750429B6D,-r3
fdiv r3,r3,r2
br .LBB141_4
LBB141_5:
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E
sra r2,r2,#8,#1
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322
fmac r3,r2,r3,#0x400949FB3ED443E9
fadd r4,r2,#0x403799EE342FB2DE
fmac r3,r2,r3,#0x405C774E4D365DA3
fmac r4,r2,r4,#0x406E80C9D57E55B8
fmac r3,r2,r3,#0x407797C38897528B
fmac r4,r2,r4,#0x40940A77529CADC8
fmac r3,r2,r3,#0x40A912C1535D121A
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD
fdiv r2,r1,r2
mov r1,r2
ret
LBB141_6:
mov r3,#0x3E571E703C5F5815
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
fmul r4,r2,#16
fmul r4,r4,#0x3D800000
rnd r4,r4,#5
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4
fmul r2,r2,-r5
fexp r2,r2
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T
fadd r2,#0,-r2
mov r1,r2
ret
LBB141_7:
fcmp r1,r1,#0
sra r1,r1,#8,#1
cvtsd r2,#-1
cvtsd r3,#1
mux r2,r1,r3,r2
mov r1,r2
ret
All of the constants are use once !
RISC-V takes 240 instructions and uses 342 words of
memory {.text, .data, .rodata}
My 66000 takes 85 instructions and uses 169 words of
memory {.text, .data, .rodata}
FWIW:
FADD Rm, Imm64f, Rn //XG2 Only
FADD Rm, Imm56f, Rn //
And:
FMUL Rm, Imm64f, Rn //XG2 Only
FMUL Rm, Imm56f, Rn //
On 8/12/2024 5:35 PM, MitchAlsup1 wrote:<snip>
On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:
On 8/12/2024 3:12 PM, MitchAlsup1 wrote:
See polpak:: r8_erf()
r8_erf: ; @r8_erf
Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??
Found what I assume you are talking about.
Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes everything;
Also needed to hack over some compiler holes related to "complex
_Double" to get it to build;
Also needed to stub over some library functions that were added in C99
but missing in my C library.
As for "r8_erf()":<snip>
<===
r8_erf:
On 8/12/2024 8:23 PM, MitchAlsup1 wrote:
On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:
On 8/12/2024 5:35 PM, MitchAlsup1 wrote:<snip>
On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:
On 8/12/2024 3:12 PM, MitchAlsup1 wrote:
See polpak:: r8_erf()
r8_erf: ; @r8_erf
Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??
Found what I assume you are talking about.
Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes >>> everything;
Also needed to hack over some compiler holes related to "complex
_Double" to get it to build;
Also needed to stub over some library functions that were added in C99
but missing in my C library.
I only ask for r8_erf()
<snip>
As for "r8_erf()":<snip>
<===
r8_erf:
I count 283 instructions compared to my 85 including the 104
instructions
it takes your compiler to get to the 1st instruction in My 66000 code !!
Yeah, this is a compiler issue...
It might have been less if the code was like:
static const double somearr[8]={ ... };
But, this would still have used memory loads.
Getting the constants into expressions would likely require using
#define or similar...
This is admittedly more how I would have imagined performance-oriented
code to be written. Not so much with dynamically initialized arrays.
But, as I will note, even with this general level of lackluster code generation, have still been managing to often beat RV64G performance...
Anybody claiming RISC-V has a good ISA should have their degree revoked.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Anybody claiming RISC-V has a good ISA should have their degree revoked.
Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.
I could probably gather the same statistic for My 66000...
On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Anybody claiming RISC-V has a good ISA should have their degree revoked.
Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.
I could probably gather the same statistic for My 66000...
Most of my groups have a bit under ½ of their space left.
Major:: 22 of 64 left
Mem:::: 32 of 64 left
2-OP::: 33 of 64 left
3-OP::: 4 of 8 left
1-OP::: 56 of 64 left
misc::: 9 of 16 left
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Anybody claiming RISC-V has a good ISA should have their degree revoked. >>>Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.
I could probably gather the same statistic for My 66000...
Most of my groups have a bit under ½ of their space left.
Major:: 22 of 64 left
Mem:::: 32 of 64 left
2-OP::: 33 of 64 left
3-OP::: 4 of 8 left
1-OP::: 56 of 64 left
misc::: 9 of 16 left
Yep, but there are also gaps in there.
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4% >>that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
Where is your 4% number coming from?
- anton
Brett <ggtgp@yahoo.com> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Where is your 4% number coming from?
The 4% number is poor memory and a guess.
Here is an antique paper on the issue:
https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
On 8/13/2024 12:24 PM, MitchAlsup1 wrote:
Assuming I use all of the ISA features that currently exist:
r8_erf: ; @r8_erf
MOV R4, R1
FABS R1,R2
FCMPGT 0x3780, R2 //Half
BF .LBB141_5
FCMPGT 0x4400, R2 //Half
BF .LBB141_6
FCMPGE 0x403A8B020C49BA5E, R2
BT .LBB141_7
FMUL R1, R1, R3
FLDCH 0x3C00, R2
FDIV R2, R3, R3
MOV 0x3F90B4FB18B485C7, R4
MOV 0x3FD38A78B9F065F6, R16
FMAC R3, R16, R4, R4
FADD R3, 0x40048C54508800DB, R5
MOV 0x3FD70FE40E2425B8, R16
FMAC R3, R16, R4, R4
MOV 0x3FFDF79D6855F0AD, R16
FMAC R3, R16, R5, R5
MOV 0x3FC0199D980A842F, R16
FMAC R3, R16, R4, R4
MOV 0x3FE0E4993E122C39, R16
FMAC R3, R16, R5, R5
MOV 0x3F9078448CD6C5B5, R16
FMAC R3, R16, R4, R4
MOV 0x3FAEFC42917D7DE7, R16
FMAC R3, R16, R5, R5
MOV 0x3F4595FD0D71E33C, R16
FMAC R3, R16, R4, R4
FMUL R4,R3,R4
MOV 0x3F632147A014BAD1, R16
FMAC R5, R3, R16, R3
FDIV R4, R3, R3
FNEG R3, R3
FADD R3, 0x3FE20DD750429B6D, R3
FDIV R3, R2, R3
BRA .LBB141_4
LBB141_5:
FMUL R1, R1, R3
MOV 0, R4
FCMPGT 0x3C9FFE5AB7E8AD5E, R2
CSELT R3, R4, R2
MOV 0x3FC7C7905A31C322, R3
MOV 0x400949FB3ED443E9, R16
fmac R2, R16, R3, R3
FADD R2,#0x403799EE342FB2DE, R4
MOV 0x405C774E4D365DA3, R16
RMAC R2, R16, R3, R3
MOV 0x406E80C9D57E55B8, R16
FMAC R2, R16, R4, R4
MOV 0x407797C38897528B, R16
FMAC R2, R16, R3, R3
MOV 0x40940A77529CADC8, R16
FMAC R2, R16, R4, R4
MOV 0x40A912C1535D121A, R16
FMAC R2, R16, R3, R3
FMUL R3, R1, R1
MOV 0x40A63879423B87AD, R16
FMAC R2, R16, R4, R2
FDIV R1, R2, R2
RTS
LBB141_6:
MOV 0x3E571E703C5F5815, R3
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
FMUL R2, 0x40300000, R4
FMUL R4, 0x3FB00000, R4
FSTCI R4, R4
FLDCI R4, R4
FNEG R4, R6
fadd R2, R6, R5
fadd R2, R4, R2
fmul R4, R6, R4
fexp r4,r4 //?
fmul R2,R7, R2
fexp r2,r2
fmul R4, R2, R2
FNEG R2, R2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T //?
fadd r2,#0,-r2
RTS
LBB141_7:
FLDCH 0xBC00, R2
FLDCH 0x3C00, R3
FCMPGT 0, R1
CSELT R2,R3,R2
RTS
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.
What I see around calls is MOV instructions grabbing arguments from the preserved registers and putting return values in to the proper preserved register. Inlining does get rid of these MOVs, but what else ??
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
I am on record as stating the proper number of bits in an instruction- specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) Making the registers 6-bits would increase that count to 36-bits.
34-bits comes from having enough Entropy to encode what needs encoding
and making careful data-driven choices on "what to put in and what to
leave out" and finding a clever means to access vectorization and multi- precision calculations. Without both of those 36-would likely be the
best option for the 32-register variants.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.
What I see around calls is MOV instructions grabbing arguments from the
preserved registers and putting return values in to the proper preserved
register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% that matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
I am on record as stating the proper number of bits in an instruction-
specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.
34-bits comes from having enough Entropy to encode what needs encoding
and making careful data-driven choices on "what to put in and what to
leave out" and finding a clever means to access vectorization and multi-
precision calculations. Without both of those 36-would likely be the
best option for the 32-register variants.
Brett <ggtgp@yahoo.com> wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get >>>> used.
What I see around calls is MOV instructions grabbing arguments from the
preserved registers and putting return values in to the proper preserved >>> register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% that >> matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which
removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it >>>> burns up your opcode space and makes encoding everything more difficult. >>>I am on record as stating the proper number of bits in an instruction-
specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
SPARCs FPGA through UltraSPARC used 1 full cycle to access the windowed >register file will MIPS, 88K, and early Alphas used 1/2 cycle.
Oh, and BTW, that 1/2 cycle of delay getting started should have cost
~5% IPC. But SAPRC never achieved high clock frequencies nor dis IA-64.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market,
that 4% that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
SPARCs FPGA through UltraSPARC used 1 full cycle to access the
windowed register file will MIPS, 88K, and early Alphas used 1/2
cycle.
Maybe. Obviously did not prevent them from having ALU instructions
with one-cycle latence and loads with 2-cycles latency in the early implementations, just like MIPS R2000. And the clock rate of the
SPARC MB86900 (14.28MHz) is not worse than the clock rate of the MIPS
R2000 (8.3, 12.5, and 15MHz grades), and that despite having the
interlocks that MIPS were so proud of not having.
Oh, and BTW, that 1/2 cycle of delay getting started should have cost
~5% IPC. But SAPRC never achieved high clock frequencies nor dis
IA-64.
As mentioned above, the clock rate was competetive with the early
MIPS. If we look at more recent times, the in-order UltraSPARC IV+
(90nm) achieved 2100MHz in 2007; Intel sold 3GHz 65nm Core 2 Duo E6850
at the time, so the UltraSPARC IV+ was not that far off.
This
undermines my theory that in-order designs have problems achieving
high clock rates.
Going for OoO implementations, the Fujitsu SPARC64 V+ (90nm) was
shipped in 2004 with 1.89MHz and in 2006 with 2.16MHz. AMD shipped
the 2.2GHz Athlon 64 3500+ (90nm) in 2004 and a 2.4GHz 90nm version in
2006, so the SPARC64 V+ was not far off.
Fujitsu continued their line until the 4.25GHz SPARC64 XII in 2017.
For comparison: AMD released the Ryzen 1800X in 2017 and that
supposedly can turbo up to 4GHz (but when I just measured it (with 1
core loaded), it achied <3.7GHz). Intel sold the Core i7-8700K
starting on Oct 5, 2017, which achieved 4.7GHz.
Oracle released the 5000MHz SPARC M8 in 2017.
Maybe SAPCR (sic!) did not achieve high clock rates, but SPARC did.
- anton
On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
SPARCs FPGA <snip>
F - Fujitsu (?)
P - ???
G - gate
A - array
SPARCs FPGA <snip>
On Thu, 15 Aug 2024 08:45:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
Was not Mitch himself involved in design of hyperSPARC that eventually reached very respectable clock frequency?
On 8/15/2024 9:33 AM, Michael S wrote:
On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
SPARCs FPGA <snip>
F - Fujitsu (?)
P - ???
G - gate
A - array
Half right. Field Programmable Gate Array. I.E. a "gate array" that
can be programmed in the field, as opposed to the factory.
On Thu, 15 Aug 2024 10:14:21 -0700
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 8/15/2024 9:33 AM, Michael S wrote:
On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
SPARCs FPGA <snip>
F - Fujitsu (?)
P - ???
G - gate
A - array
Half right. Field Programmable Gate Array. I.E. a "gate array" that
can be programmed in the field, as opposed to the factory.
Don't you think that if I am asking then I have reasons to think that
Mitch didn't mean "Field Programmable" ?
BTW, logic (HDL) design of FPGA-based embedded systems is part of what
I am doing for living during last 25 years.
On 8/14/2024 5:54 PM, Brett wrote:
Brett <ggtgp@yahoo.com> wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that >>>>> call,
as it splits your function and burns registers that would otherwise get >>>>> used.
What I see around calls is MOV instructions grabbing arguments from the >>>> preserved registers and putting return values in to the proper preserved >>>> register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% that >>> matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>> removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it >>>>> burns up your opcode space and makes encoding everything more difficult. >>>>I am on record as stating the proper number of bits in an instruction- >>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>> Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
Also longer context switch times, as more registers to save/restore.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 8/14/2024 5:54 PM, Brett wrote:
Brett <ggtgp@yahoo.com> wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that >>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>> used.
What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% that
matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
Actually due to the removal of MOVs and reloads the code density may be basically the same.
Also longer context switch times, as more registers to save/restore.
The save is should be free, as the load from ram is so slow.
If the context is time critical it should be written to use the registers that are reloaded first, first. In which case the code could start doing
work in the same amount of time regardless of register count. (I doubt the CPU design is actually that smart, or that the people that program the interrupts are.)
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 8/14/2024 5:54 PM, Brett wrote:
Brett <ggtgp@yahoo.com> wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that >>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>> used.
What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% >>>> that
matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
Actually due to the removal of MOVs and reloads the code density may be basically the same.
Also longer context switch times, as more registers to save/restore.
The save is should be free, as the load from ram is so slow.
If the context is time critical it should be written to use the
registers that are reloaded first, first. In which case the code
could start doing work in the same amount of time regardless of
register count. (I doubt the CPU design is actually that smart,
or that the people that program the interrupts are.)
On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 8/14/2024 5:54 PM, Brett wrote:
Brett <ggtgp@yahoo.com> wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls. >>>>>>>
A call can cause a significant amount of garbage code all around that >>>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>>> used.
What I see around calls is MOV instructions grabbing arguments from the >>>>>> preserved registers and putting return values in to the proper preserved >>>>>> register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10% >>>>> that
matters.
The first half of a big function will have some state that has to be >>>>> reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>>> removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it >>>>>>> burns up your opcode space and makes encoding everything more difficult.
I am on record as stating the proper number of bits in an instruction- >>>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>>> Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
Actually due to the removal of MOVs and reloads the code density may be
basically the same.
Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.
Also longer context switch times, as more registers to save/restore.
The save is should be free, as the load from ram is so slow.
When HW is doing the saves, the saves can be performed while
waiting for the first instruction to arrive and for the first
registers to arrive. Thus, done in HW, the saves are essentially
free.
If the context is time critical it should be written to use the
registers that are reloaded first, first. In which case the code
could start doing work in the same amount of time regardless of
register count. (I doubt the CPU design is actually that smart,
or that the people that program the interrupts are.)
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.
MitchAlsup1 <mitchalsup@aol.com> wrote:
Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
But has anyone told the software guys?
Of course convincing programmers to RTFM is futile. ;(
So we are back to finding any downsides for 64 registers in My 66000.
So we are back to finding any downsides for 64 registers in My 66000.
MitchAlsup1 <mitchalsup@aol.com> wrote:
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
But has anyone told the software guys?
Of course convincing programmers to RTFM is futile. ;(
If so this is the first I have heard that more registers is not bad for interrupt response time.
So we are back to finding any downsides for 64 registers in My 66000.
Lack of actual significant benefits is irrelevant, as all the programers
are 100% convinced that it will help some of their code. ;)
For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.
On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
The Instructions and the compiler's use of them were co-developed.
But has anyone told the software guys?
Use HLLs and you don't have to.
Of course convincing programmers to RTFM is futile. ;(
Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.
If so this is the first I have heard that more registers is not bad for
interrupt response time.
They are also bad for pipeline stage times.
So we are back to finding any downsides for 64 registers in My 66000.
Encoding
pipeline staging
context switch times
For example, My 66000 current encoding has room for 8 instructions
in the FMAC category (4 in use) with 6-bit register specifiers
I would need 4 major OpCodes instead of 1.
For your 98%-ile source code, 32-registers is plenty.
Lack of actual significant benefits is irrelevant, as all the programers
are 100% convinced that it will help some of their code. ;)
For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.
Brett <ggtgp@yahoo.com> schrieb:
MitchAlsup1 <mitchalsup@aol.com> wrote:
Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.
In principle, yes.
You can either use C++ headers, which result in huge compilation
times, or you can use LTO. LTO, if done right, is a huge time-eater
(I was looking for an English translation of "Zeitgrab", literarlly
"time grave" or "time tomb", this was the best I could come up with).
[...]
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
But has anyone told the software guys?
Software guys generally work with high-level languages where this is irrelevant, except for...
Of course convincing programmers to RTFM is futile. ;(
...people writing operating systems or drivers, and they better
read the docs for the architecture they are working on.
So we are back to finding any downsides for 64 registers in My 66000.
Encoding space. Not sure if you have Mitch's document,
but having
one more bit per register would reduce the 16-bit data in the
offset to 14 (no way you can expand that by a factor of four),
would require eight instead of one major opcodes for the three-
register instructions,
and the four-register instructions like FMA...
This would not matter if we were still living in a 36-bit world,
but the days of the IBM 704, the PDP-10 or the UNIVAC 1100 have
passed, except for emulation.
Thomas Koenig <tkoenig@netcologne.de> wrote:
Brett <ggtgp@yahoo.com> schrieb:
Software guys generally work with high-level languages where this is
irrelevant, except for...
Of course convincing programmers to RTFM is futile. ;(
...people writing operating systems or drivers, and they better
read the docs for the architecture they are working on.
So we are back to finding any downsides for 64 registers in My 66000.
Encoding space. Not sure if you have Mitch's document,
Section 4.1 Instruction Template, Figure 25, page 33-179
but having
one more bit per register would reduce the 16-bit data in the
offset to 14 (no way you can expand that by a factor of four),
14 is plenty,
you can actually do 12 and pack those instructions in with
shifts which have a pair of 6 bit fields, width and offset. This would
expand some constants but you make it back in shorter code with less MOVs
and more performance.
would require eight instead of one major opcodes for the three-
register instructions,
Mitch gloats about how many major opcodes he has free, in his 7 bit opcode
he has the greater part of a bit free, so we are a good part of the way there.
Conceptually some of the modifier bits move into the opcode space, not as clean but you have to squeeze those bits hard
One can come up with a few
patterns that are not hard to decode, and spread across several instruction types.
and the four-register instructions like FMA...
Trying to wave a red flag in front of Mitch. ;)
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
The Instructions and the compiler's use of them were co-developed.
But has anyone told the software guys?
Use HLLs and you don't have to.
I looked at interrupts in your manual and it did not say how many
registers
were full of garbage leaking information because they were not saved or restored to make interrupts faster. ;)
Of course convincing programmers to RTFM is futile. ;(
Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.
If so this is the first I have heard that more registers is not bad for
interrupt response time.
They are also bad for pipeline stage times.
So we are back to finding any downsides for 64 registers in My 66000.
Encoding
Admittedly painful, extremely so.
pipeline staging
A longer pipeline is slower to start up, but gets work done faster.
Is this what you mean?
context switch times
Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.
Brett <ggtgp@yahoo.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> wrote:
Conceptually some of the modifier bits move into the opcode space, not
as clean but you have to squeeze those bits hard
It is very fine point of semantics if the modifier bits are part
of the opcode space or not. I happen to think that they are,
they are just in a (somehwat) different place and spelled a bit
differently, but it does not really matter how you look at it -
you need the bits to encode them.
One can come up with a few patterns that are not hard to
decode, and spread across several instruction types.
So, go right ahead. Find an encoding that a) encompasses all of
Mitch's functionality, b) has six bits for registers everywhere,
and c) does not drive the assembler writer crazy (that's me,
for Mitch's design) or hardware designer bonkers (where Mitch has
the experience).
Let's start with the... BB1 instruction, which branches on bit
set in a register, so it needs a major opcode, a bit number, a
register number and a displacement. How do you propose to do that?
Shave one bit off the displacement?
and the four-register instructions like FMA...
Trying to wave a red flag in front of Mitch. ;)
I just happen to like FMA :-)
Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.
But - just making offhand suggestions won't cut it. You will
have to think about the layout of the instructions, how everything
fits in, and needing one to four more bits per instruction
can be accomodated.
On Sat, 17 Aug 2024 22:05:03 +0000, Thomas Koenig wrote:
Brett <ggtgp@yahoo.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> wrote:
Conceptually some of the modifier bits move into the opcode space, not
as clean but you have to squeeze those bits hard
It is very fine point of semantics if the modifier bits are part
of the opcode space or not. I happen to think that they are,
they are just in a (somehwat) different place and spelled a bit
differently, but it does not really matter how you look at it -
you need the bits to encode them.
To me, an instruction has 3 components:: Operands, Routing, and
calculation. We mainly consider the calculation (ADD) to be the
instruction and fuzz over what is operands and how does one
route them to places of calculation. My 66000 ISA directly
annotates the operands and the routing. This is what the
modifier bits do; they tell how to interpret the register
specifiers (Rn or #n), (Rn or -Rn) and when to substitute
another word or doubleword in the instruction stream as an
operand directly.
This does not add gates of delay to Operand routing because
all of the constant stuff is overlapped with the comparison
of register specifiers with pipeline result specifiers to
determine forwarding. Constants forward in the network prior
to register results preventing any added delay.
One can come up with a few patterns that are not hard to
decode, and spread across several instruction types.
So, go right ahead. Find an encoding that a) encompasses all of
Mitch's functionality, b) has six bits for registers everywhere,
and c) does not drive the assembler writer crazy (that's me,
for Mitch's design) or hardware designer bonkers (where Mitch has
the experience).
Consider, for example, memory reference address modes for 1
instruction::
LDSB Rd,[Rp,disp16]
LDSB Rd,[IP,disp16]
and
LDSB Rd,[Rp,Ri<<s]
LDSB Rd,[Rp,0]
LDSB Rd,[IP,Ri<<s]
LDSB Rd,[Rp,,disp32]
LDSB Rd,[Rp,Ri<<2,disp32]
LDSB Rd,[IP,,disp32]
LDSB Rd,[IP,Ri<<s,disp32]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[Rp,Ri<<s,disp64]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[IP,Ri<<s,disp64]
I use 2 instructions here::
1) a major OpCode with 16-bit immediate
R0 in the Rb position is a proxy for IP
2) a major OpCode and a MEME OpCode with 5-bits of Modifiers.
R0 in Rb position is remains a Proxy for IP
R0 in Ri position is a proxy for #0.
3) I still have 1-bit left over to denote participation in ATOMIC
events.
you get all sizes and signs of Load-Locked
you get up to 8 LLs
you can use as many Store-Conditionals as you need
all interested 3rd parties see memory before or after the event
and nothing in between.
Using 6-bit registers I would be down by 3-bits causing all sorts of
memory reference grief--leading to other compromises in ISA design
elsewhere.
Based on the code I read out of Brian's compiler: there is no particular
need for 64-registers. I am already using only 72% of the instructions
{72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
{same compiler, same optimizations, just different code generators}.
One can argue that having 64-bit displacements is not-all-that-necessary
But how does one take dusty deck FORTRAN FEM programs and allow the
common blocks to grow bigger than 4GBs ?? This is the easiest way
to port code written 5 decades ago to use the sizes of memory they
need to run those "Great Big" FEM models today.
Let's start with the... BB1 instruction, which branches on bit
set in a register, so it needs a major opcode, a bit number, a
register number and a displacement. How do you propose to do that?
Shave one bit off the displacement?
Then proceed to Branch on Condition:: along with the standard::
EQ0, NE0, GT0, GE0, LT0, LE0 conditions one gets with other encodings,
I also get FEQ0, FNE0, FGT0, FGE0, FLT0, FLE0, DEQ0, DNE0, DGT0,
DGE0, DLT0, DLE0 along with Interference, SVC, SVR, and RET.
{And I left out the unordered float/double comparisons, above.}} 1-instruction due mostly to NOT having condition codes.
and the four-register instructions like FMA...
I prefer 3-operand 1-result instead of 4-register. 4-register could
have 1-operand and 3 results and lacks decent specificity. 35 years
ago I used 3-register to describe Mc88100 and I regret that now.
I prefer FMAC instead of FMA--in hindsight I should had made it
FMAC and DMAC, but alas... I use FMAC to cover all 4 of::
x = y * z + q
x = y * -z + q
x = y * z - q
x = y * -z - q
Trying to wave a red flag in front of Mitch. ;)
I just happen to like FMA :-)
Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.
Why do I get the feeling the compiler guys would not like this ??
But - just making offhand suggestions won't cut it. You will
have to think about the layout of the instructions, how everything
fits in, and needing one to four more bits per instruction
can be accomodated.
On Sat, 17 Aug 2024 20:57:43 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
Ok, so the hardware is smart enough.
The Instructions and the compiler's use of them were co-developed.
But has anyone told the software guys?
Use HLLs and you don't have to.
I looked at interrupts in your manual and it did not say how many
registers
were full of garbage leaking information because they were not saved or
restored to make interrupts faster. ;)
When an ISR[13] returns from handling its exception it ahs a register
file
filled with stuff useful to future running's of ISR[13].
When ISR[13] gains control to handle another interrupt it has a file
filled
with what it was filled with the last time it ran--all 30 of them while registers R0..R1 contain information about the current interrupt to be serviced.
SP points at its stack
FP points at its frame or is another register containing whatever it
contained the previous time
R29..R2 contain the value it had the previous time it ran
Of course convincing programmers to RTFM is futile. ;(
Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.
If so this is the first I have heard that more registers is not bad for >>>> interrupt response time.
They are also bad for pipeline stage times.
So we are back to finding any downsides for 64 registers in My 66000.
Encoding
Admittedly painful, extremely so.
pipeline staging
A longer pipeline is slower to start up, but gets work done faster.
Is this what you mean?
No, I mean the feedback loops take more cycles so apparent latency
is greater.
context switch times
Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or
megabytes.
While it is under 1% of all cycles, current x86s take 1,000 cycles application to application and 10,000 cycles hypervisor to hypervisor.
I want both of these down in the 20-cycle range.
Based on the code I read out of Brian's compiler: there is no particular
need for 64-registers. I am already using only 72% of the instructions
{72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
{same compiler, same optimizations, just different code generators}.
One can argue that having 64-bit displacements is not-all-that-necessary
But how does one take dusty deck FORTRAN FEM programs and allow the
common blocks to grow bigger than 4GBs ?? This is the easiest way
to port code written 5 decades ago to use the sizes of memory they
need to run those "Great Big" FEM models today.
Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.
Why do I get the feeling the compiler guys would not like this ??
MitchAlsup1 <mitchalsup@aol.com> wrote:<snip>
High registers is mostly marketing vapor ware extension for you, see if anyone cares and put them on a list for when a market for that extension
pops up.
The lack of CPU’s with 64 registers is what makes for a market, that 4% that could benefit have no options to pick from. You would be happy to
have control of a market that big. Point customers at a compiler
configured
for 64 registers and say that with high registers and inline constants
that
is what they could expect for code generation.
If there is demand for high registers you will probably just spin a CPU
arch with more registers, but that will never happen if you never ask.
This
is the definition of vapor ware, a free market survey. You can even add
more registers as an incompatible extension, if fact you should.
Task swapping time is way down in the noise. It’s reloading the L1 and L2 cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.
Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or
megabytes.
Depends on the kind of swap. If you're thinking of time-sharing
preemption, then indeed context switch time is not important.
But when considering communication between processes, then very fast
context switch times allow for finer grain divisions, like
micro-kernels.
Historically, these things have never really materialized, admittedly.
Stefan
Pigs don't win the 100 yard dash at the Olympics, either.
On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:<snip>
High registers is mostly marketing vapor ware extension for you, see if
anyone cares and put them on a list for when a market for that extension
pops up.
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from. You would be happy to
have control of a market that big. Point customers at a compiler
configured
for 64 registers and say that with high registers and inline constants
that
is what they could expect for code generation.
I agree with the lead in, and disagree with where you took it.
Let up postulate that having 64 registers is a 10% win (overstating
the size of its win my 2.5×) but that 98% of subroutines don't need 64-registers. So, 98% gains nothing and 2% gains 10%
0.98*1.0 + 0.02*1.1 = 1.002
or
0.2% gain.
If there is demand for high registers you will probably just spin a CPU
arch with more registers, but that will never happen if you never ask.
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.
This
is the definition of vapor ware, a free market survey. You can even add
more registers as an incompatible extension, if fact you should.
I will leave stuff like this to you.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.
64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite available. So close, yet so far. I could not make it work.
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22 registers would have already gained 1/2 of all of what is possible.
64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.
We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction stream.
64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22
registers would have already gained 1/2 of all of what is possible.
64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.
We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction
stream.
64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.
If you pack 7 instructions in 8 long words that gives you an extra
nibble, > 4 bits.
You can do lots of four operand dual operations, which may get you back
the code density lost, while improving performance.
3 instructions packed in 4 longs gives 64 registers plus four operand
dual instructions.
On Mon, 19 Aug 2024 23:35:54 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16, of course x84 had fab and >>>> cubic dollar advantages that dwarfed the register limit.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22
registers would have already gained 1/2 of all of what is possible.
64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.
We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction
stream.
64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.
If you pack 7 instructions in 8 long words that gives you an extra
nibble, > 4 bits.
You can do lots of four operand dual operations, which may get you back
the code density lost, while improving performance.
Given 36-bit containers--how do you add 32 or 64-bit constants ??
throw 36-bits at the 32-bit needs case and 72-bits at the 64-bit
needs case ?!?
3 instructions packed in 4 longs gives 64 registers plus four operand
dual instructions.
{{ note 3 instructions in 4 longs is 85.3-bits per instruction::
I suspect you mean 3 instructions in 4 words which is 42.6-bits
per instruction far more than is needed. You get 14 instructions
of 36-bits in 512-bits (a cache line)}}
Why don't you give it a try !?!
But notice, you are starting out with a much larger instruction--
how are you going to "profitably" utilize all those bits from
source code of typical imperative languages ??
whereas with 32-bit instructions don't violate the RISC tenets.
I end up needing only 72% the number of instructions RISC-V needs
(a near 40% pipelined instruction advantage).
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers.
On Mon, 19 Aug 2024 16:05:22 +0000, Stefan Monnier wrote:
Task swapping time is way down in the noise. It’s reloading the L1
and L2
cache that swamps the time. 64 registers is nothing compared to
32k or megabytes.
Depends on the kind of swap. If you're thinking of time-sharing preemption, then indeed context switch time is not important.
But when considering communication between processes, then very fast context switch times allow for finer grain divisions, like
micro-kernels.
MicroKernels failed due to the excessive overhead of context
switching. Whether is was control delivery delay, TLB reloads, Cache
reloads, register file loads and stores, ... it doesn't really mater
as each delay adds up. When there is too much delay the system is
sluggish and unacceptable en-the-large.
Historically, these things have never really materialized,
admittedly.
Pigs don't win the 100 yard dash at the Olympics, either.
Stefan
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16
And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the announcement of APX says something about 10% fewer memory accesses or somesuch.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>like it has 20-22 registers.
You feeling is strong (as shown by your repeatedly ignoring the counterevidence), but wrong:
LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:
Let's call the 17th register r16:
On a load-store architecture you replace "LD-OP dest,src" with:
ld r16=src
op dest,dest,r16
On a load-store architecture you replace "LD-OP-ST dest,src" with:
ld r16=dest
op r16,r16,src
st dest=r16
For a VAX-like three-memory-argument instruction you need two extra registers, r16 and r17:
"mem1 = mem2 op mem3" becomes:
ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17
- anton
and you may have
several of these in a local sequence of code. ...
On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With
renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.
If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16
And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the
announcement of APX says something about 10% fewer memory accesses or
somesuch.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers.
You feeling is strong (as shown by your repeatedly ignoring the
counterevidence), but wrong:
LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:
Let's call the 17th register r16:
On a load-store architecture you replace "LD-OP dest,src" with:
ld r16=src
op dest,dest,r16
On a load-store architecture you replace "LD-OP-ST dest,src" with:
ld r16=dest
op r16,r16,src
st dest=r16
For a VAX-like three-memory-argument instruction you need two extra
registers, r16 and r17:
"mem1 = mem2 op mem3" becomes:
ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17
- anton
That is not what I am talking about::
i = i + 1;
as
ADD [&i],#1
1 instruction = 1 add, 1 LD and 1 ST. And
i = i + j;
as
ADD Ri,[&j]
In neither case is an extra register needed, and you may have
several of these in a local sequence of code. ...
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet another one etc... Each one can reuse the same temporary register.
MitchAlsup1 wrote:
On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With
renaming
one can have R7 in use 40 times in a 100 instruction deep execution >>>>>> window.
If this was true we would have 16 or even 8 visible registers, and all >>>>> would be fine. x86 does mostly fine with 16
And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the
announcement of APX says something about 10% fewer memory accesses or
somesuch.
Careful, here::
x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>>> like it has 20-22 registers.
You feeling is strong (as shown by your repeatedly ignoring the
counterevidence), but wrong:
LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:
Let's call the 17th register r16:
On a load-store architecture you replace "LD-OP dest,src" with:
ld r16=src
op dest,dest,r16
On a load-store architecture you replace "LD-OP-ST dest,src" with:
ld r16=dest
op r16,r16,src
st dest=r16
For a VAX-like three-memory-argument instruction you need two extra
registers, r16 and r17:
"mem1 = mem2 op mem3" becomes:
ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17
- anton
That is not what I am talking about::
i = i + 1;
as
ADD [&i],#1
1 instruction = 1 add, 1 LD and 1 ST. And
i = i + j;
as
ADD Ri,[&j]
In neither case is an extra register needed, and you may have
several of these in a local sequence of code. ...
On an in-order pipeline you need someplace to stash the temp value.
If you want, call it a special in-flight pseudo-register that only
exists for forwarding, it is still an identifier for a value that
is outside the architectural register set.
I think it might need two registers if you can have two such
instructions in the pipeline back-to-back as there could be
multiple temp values in-flight at once
ADD [&i],#1
ADD [&j],#1
could have &i doing its store while &j is doing its load.
On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.
On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
LD-OP-ST is a bridge too far for me.
LD-OP and OP-ST are fine with me and have benefits.
But you have not built such, you built an improved RISC…
I assume OP-ST has issues with the value getting stuck if the address is
slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.
You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.
I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.
On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet >>>> another one etc... Each one can reuse the same temporary register.
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
LD-OP-ST is a bridge too far for me.
LD-OP and OP-ST are fine with me and have benefits.
If you put cache write at or after register file write in the
pipeline; LD-OP-ST basically falls out for free and you can
move the intermediate values from whence they are produced
to where they are consumed with forwarding.
But you have not built such, you built an improved RISC…
I spent 7 years doing x86-64.....so much for not having.....
It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.
I assume OP-ST has issues with the value getting stuck if the address is
slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.
Athlon and Opteron had value capturing reservation stations.
K9 have value-free RSs. It caused little headache because
while we did not give it a named physical register, we did
give it a physical register for the intermediates. SW can only
read/write named PRs getting the name from logical to physical
register renaming.
You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.
I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.
Partially because AMD performed "relatively" better on LD-OPs and
LD-OP-STs than Intel at that time. Where "relatively" means
significantly above the noise level but "not all that much".
On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:
On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.
In the LD-OP-ST microarchitecture there would be some buffer
that carries the intermediate values through the execution
window. And, Yes, you can build a LD-OP-ST reservation station
(Athlon and Opteron did).
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
LD-OP-ST is a bridge too far for me.
LD-OP and OP-ST are fine with me and have benefits.
I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.
On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.
- anton
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
LD-OP-ST is a bridge too far for me.
On Wed, 21 Aug 2024 12:00:47 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.
- anton
AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.
They travel through rename phase
as 2 uOps.
I am not sure if they are split back into 4 uOps before or
after OoO schedulers, but would guess the former.
mitchalsup@aol.com (MitchAlsup1) writes:
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
Latency is not the issue in modern high-performance AMD64 cores, which
have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>.
And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:
sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem
In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:
gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs
I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.
In the second line I used gforth, an engine that keeps the top of
stack in memory, the return-stack pointer in memory, stores IP to
memory after every change, and does not use static superinstructions,
all for better identifying where an error happened.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
What makes you think that instruction count is particularly relevant?
Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more complex decoder. And in the OoO engine the difference may be gone (at
least on Intel CPUs).
There are also some savings in reduced I-cache usage (possibly leading
to higher I-cache hit rate), reduced memory I-fetch memory bandwidth >required, etc, though these may be modest at best.
mitchalsup@aol.com (MitchAlsup1) writes:
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
Latency is not the issue in modern high-performance AMD64 cores, which
have zero-cycle store-to-load forwarding ><http://www.complang.tuwien.ac.at/anton/memdep/>.
And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:
sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem
In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:
gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs
I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
There are also some savings in reduced I-cache usage (possibly leading
to higher I-cache hit rate), reduced memory I-fetch memory bandwidth
required, etc, though these may be modest at best.
Let's see how that works out. I am using the code size numbers
from <2024Jan4.101941@mips.complang.tuwien.ac.at>:
bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit
So the least code size is from a load/store architecture with 16
registers, followed (or preceded in the case of grep) by a load/store architecture with 32 registers. The instruction sets that have
loap-op and load-op-st instructions result in bigger code.
The
different sizes of armhf (ARMv7) and armel (ARMv4t-ARMv6t) show that
there is more to code sizes than just the architecture.
The LD-OP-STs in Athlon and Opteron had a memory OpCode and
calculation OpCode, and was performed in such a way that the physical
address of the LD was used for the ST when its time came. The
calculation OpCode was an ALU or the IMUL/DIV unit.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:
On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.
In the LD-OP-ST microarchitecture there would be some buffer
that carries the intermediate values through the execution
window. And, Yes, you can build a LD-OP-ST reservation station
(Athlon and Opteron did).
All the material I have seen is that AMD has a load-store ROP, but the
op in between is in a separate functional unit, with a separate
scheduler entry; and I expect that the load-store ROP occupies the
load/store scheduler(s) twice: once for the load part, once for the
store part.
There is also something about macroops that can be load-op-stores, but from what I have read, when it comes to execution,
they are split into ROPs.
If you have more details that contradict the information published up
to now, please let us know more about them.
On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.
- anton
Michael S <already5chosen@yahoo.com> writes:
On Wed, 21 Aug 2024 12:00:47 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.
- anton
AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.
Which 4 uops and 2 macroops are those? My guess is that ST is
store-data and store-address uops, and ld and op are one uop each.
They travel through rename phase
as 2 uOps.
Interesting. But yes, only two values are generated for physical
registers: the result of the load and the result of the op. So I
expect that the two store parts are tacked onto the op on the way
through the renamer, and then that macroop is split into its parts on
the way to the schedulers.
I am not sure if they are split back into 4 uOps before or
after OoO schedulers, but would guess the former.
Golden Cove is depicted as having an op scheduler, a load scheduler
and a store scheduler, so they have to split the ld-op-store into at
least three parts for scheduling.
Sunny Cove is depicted as having an op scheduler, a store data
scheduler, and two AGU schedulers, which would again mean at least
three parts, but this time with a different split.
Both based on <https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/>
- anton
I spent 7 years doing x86-64.....so much for not having.....
It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:
On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
and you may have
several of these in a local sequence of code. ...
No, you can not have several. It's always one then another one then yet >>>>> another one etc... Each one can reuse the same temporary register.
The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.
The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.
LD-OP-ST is a bridge too far for me.
LD-OP and OP-ST are fine with me and have benefits.
If you put cache write at or after register file write in the
pipeline; LD-OP-ST basically falls out for free and you can
move the intermediate values from whence they are produced
to where they are consumed with forwarding.
LD-OP-ST mostly only fits if it is add to memory.
42 bit opcodes work, you only need one in four RISC opcodes to merge to
LD-OP or OP-ST for code density to be the same, and generally you will do better.
The two leftover bits can be ignored, or be a template indicator, so you
can pack in a LD-OP-ST, or 31 bit RISC ops.
Or go heads and tails packing.
But you have not built such, you built an improved RISC…
I spent 7 years doing x86-64.....so much for not having.....
It is from that episode the cemented me on the value of
[Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.
I assume OP-ST has issues with the value getting stuck if the address is >>> slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.
Athlon and Opteron had value capturing reservation stations.
K9 have value-free RSs. It caused little headache because
while we did not give it a named physical register, we did
give it a physical register for the intermediates. SW can only
read/write named PRs getting the name from logical to physical
register renaming.
You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.
I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.
Partially because AMD performed "relatively" better on LD-OPs and
LD-OP-STs than Intel at that time. Where "relatively" means
significantly above the noise level but "not all that much".
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 12:43:21 |
Calls: | 10,389 |
Calls today: | 4 |
Files: | 14,061 |
Messages: | 6,416,880 |
Posted today: | 1 |