Forum: >>> Magnum BBS <<<

Re: My 66000 and High word facility

From MitchAlsup1@21:1/5 to Brett on Sat Aug 10 18:49:35 2024

On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:

My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.

The article posted by Andy Glew was luke-warm at best. Now, while
IBM has figured out that 16 GPRs is insufficient, there is scant
data that 32 are insufficient {witness how few RISCs went with
bigger files}

Since My 66000 is a 64-bit architecture with a modicum of support for
8-bit, 16-bit, and 32-bit stuff; and since 32-truely GPRs seems to be
enough (compiler output), I think I will pass.

Due to significant access to constants, My 66000 with only 32-actual
registers performs as well as RISC-V does with 32I+32F in most codes,
so there does not seem to be an insufficient number of registers. I
even have ASM examples where RISC-V runs out of registers where My
66000 does not !! Not wasting register to hold onto big immediates,
big displacements, or big addresses goes a long way to thinning out
the register count necessities.

In My 66000 one can utilize all 32 registers, with 0 reserved for
{linking, splicing, GOT access,...} these "effective constants"
become actual constants meaning one does not have to consume a
register to have access through that constant address value.

IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

Neither of which would worry me.

Thanks,
Brett

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to All on Sat Aug 10 18:17:54 2024

My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode space
and it is Another nice boost for some customers.

IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

Thanks,
Brett

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Aug 10 21:12:12 2024

On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:

My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.

IBM supports Linux, so the compiler support should exist. X86 solved the aliasing issue with finer tracking.

x86 (the x is lower case) solved the problem at great cost to various implementations. The AMD *doze family could not perform forwarding of
these lower or upper portions of registers and its performance suffered.
The high/low stuff makes it very difficult to do forwarding when the
clock cycle is less than 14-gates per cycle.

After x86 grew out of its 8-register only enclave and went with 16
(later
32) GPRs; register pressure went down markedly.

Thanks,
Brett

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 11 00:46:09 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 10 Aug 2024 18:17:54 +0000, Brett wrote:

My 66000 should look into the z/architecture High Word Facility as that
would give you another 65% more registers or so. You have the opcode
space and it is Another nice boost for some customers.

The article posted by Andy Glew was luke-warm at best. Now, while
IBM has figured out that 16 GPRs is insufficient, there is scant
data that 32 are insufficient {witness how few RISCs went with
bigger files}

Since My 66000 is a 64-bit architecture with a modicum of support for
8-bit, 16-bit, and 32-bit stuff; and since 32-truely GPRs seems to be
enough (compiler output), I think I will pass.

Due to significant access to constants, My 66000 with only 32-actual registers performs as well as RISC-V does with 32I+32F in most codes,
so there does not seem to be an insufficient number of registers. I
even have ASM examples where RISC-V runs out of registers where My
66000 does not !! Not wasting register to hold onto big immediates,
big displacements, or big addresses goes a long way to thinning out
the register count necessities.

In My 66000 one can utilize all 32 registers, with 0 reserved for
{linking, splicing, GOT access,...} these "effective constants"
become actual constants meaning one does not have to consume a
register to have access through that constant address value.

These are excellent points and need to be in your marketing information.

Compilers love unrolling loops because it saves an instruction, which for a short loop could mean 10% faster. Point out your code has more unrolls and performance.

I don’t know if you are in the 14 gate delay market that makes high
registers a fail. Can’t find Andy Glew’s article on z/arch, but that arch has limited opcode space that imposes constraints, you don’t.

High registers is mostly marketing vapor ware extension for you, see if
anyone cares and put them on a list for when a market for that extension
pops up.

The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from. You would be happy to have control of a market that big. Point customers at a compiler configured for
64 registers and say that with high registers and inline constants that is
what they could expect for code generation.

If there is demand for high registers you will probably just spin a CPU
arch with more registers, but that will never happen if you never ask. This
is the definition of vapor ware, a free market survey. You can even add
more registers as an incompatible extension, if fact you should.

IBM supports Linux, so the compiler support should exist. X86 solved the
aliasing issue with finer tracking.

Neither of which would worry me.

Thanks,
Brett

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Brett on Sun Aug 11 08:33:47 2024

Brett <ggtgp@yahoo.com> schrieb:

Compilers love unrolling loops because it saves an instruction, which for a short loop could mean 10% faster. Point out your code has more unrolls and performance.

If you want to look at what the compiler for My 66000 does, it
can be found at https://github.com/bagel99/llvm-my66000 .
Installation is a bit cumbersome, but manageable.

Speaking as somebody who neither designed the ISA nor written
the compiler port: The Virtual Vector methods makes unrolling
vectorized loops unprofitable; all you "gain" from unrolling those
is increased register pressure and code size. Having constants
in the instruction stream also reduces register pressure.

In the beginning, I had my doubts that 32 general registers which
are also used for floating point are enough, but looking at
generated code convinced me.

Unrolling in the presence of VVM is not that easy Non-vectorizable
loops can still be profitable to unroll, as can be outer loops.
But when working with an existing compiler which has assumptions
about currently available architectures baked in, this is quite
difficult.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Sun Aug 11 14:33:33 2024

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

The additional registers obviously did not give these architectures a
decisive advantage.

When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.

Where is your 4% number coming from?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Sun Aug 11 17:48:21 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

All antiques no longer available.

The additional registers obviously did not give these architectures a decisive advantage.

When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.

Where is your 4% number coming from?

The 4% number is poor memory and a guess.
Here is an antique paper on the issue:

https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

I used to be able to find better sources, but Google is full of garbage
now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Anton Ertl on Sun Aug 11 20:53:42 2024

On 2024-08-11 17:33, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

SPARC also has 32 separate floating-point registers, not windowed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to BGB on Mon Aug 12 02:23:00 2024

BGB <cr88192@gmail.com> wrote:

On 8/11/2024 9:33 AM, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

The additional registers obviously did not give these architectures a
decisive advantage.

When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.

In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.

Sometimes, it is preferable to be able to map functions entirely to registers, and 64 does increase the probability of being able to do so (though, neither achieves 100% of functions; and functions which map
entirely to GPRs with 32 will not see an advantage with 64).

Well, and to some extent the compiler needs to be selective about which functions it allows to use all of the registers, since in some cases a situation can come up where the saving/restoring more registers in the prolog/epilog can cost more than the associated register spills.

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that call,
as it splits your function and burns registers that would otherwise get
used.

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.

But, have noted that 32 GPRs can get clogged up pretty quickly when
using them for FP-SIMD and similar (if working with 128-bit vectors as register pairs); or otherwise when working with 128-bit data as pairs.

Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but
can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32
matrix, and 16 registers to perform a matrix-transpose, ...

Granted, arguably, doing a matrix-multiply directly in registers using
SIMD ops is a bit niche (traditional option being to use scalar
operations and fetch numbers from memory using "for()" loops, but this
is slower). Most of the programs don't need fast MatMult though.

Annoyingly, it has led to my ISA fragmenting into two variants:
Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
Supports R32..R63 for only a subset of the ISA for 32-bit ops.
For ops outside this subset, needs 64-bit encodings in these cases.
XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
By itself, would be easier to decode than Baseline,
as it drops a bunch of wonky edge cases.
Though, some cases were dropped from Baseline when XG2 was added.
"Op40x2" was dropped as it was hair and became mostly moot.

Then, a common subset exists known as Fix32, which can be decoded in
both Baseline and XG2 Mode, but only has access to R0..R31.

Well, and a 3rd sub-variant:
XG2RV: Uses XG2's encodings but RISC-V's register space.
R0..R31 are X0..X31;
R32..R63 are F0..F31.

Arguable main use-case for XG2RV mode is for ASM blobs intended to be
called natively from RISC-V mode; but...

It is debatable whether such an operating mode actually makes sense, and
it might have made more sense to simply fake it in the ASM parser:
ADD R24, R25, R26 //Uses BJX2 register numbering.
ADD X14, X15, X16 //Uses RISC-V register remapping.

Likely, as a sub-mode of either Baseline or XG2 Mode.
Since, the register remapping scheme is known as part of the ISA spec,
it could be done in the assembler.

It is possible that XG2RV mode may eventually be dropped due to "lack of relevance".

Well, and similarly any ABI thunks would need to be done in Baseline or
XG2 mode, since neither RV mode nor XG2RV Mode has access to all the registers used for argument passing in BJX2.
In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5,
being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.

Well, and likewise one wants to limit the number of inter-ISA branches,
as the branch-predictor can't predict these, and they need a full
pipeline flush (a few extra cycles are needed to make sure the L1 I$ is fetching in the correct mode). Technically also the L1 I$ needs to flush
any cache-lines which were fetched in a different mode (the I$ uses
internal tag-bits to to figure out things like instruction length and bundling and to try to help with Superscalar in RV mode, *; mostly for timing/latency reasons, ...).

*: The way the BJX2 core deals with superscalar being to essentially
pretend as-if RV64 had WEX flag bits, which can be synthesized partly
when fetching cache lines (putting some of the latency in the I$ Miss handling, rather than during instruction-fetch). In the ID stage, it
sees the longer PC step and infers that two instructions are being
decoded as superscalar.

...

Where is your 4% number coming from?

I guess it could make sense, arguably, to try to come up with test cases
to try to get a quantitative measurement of the effect of 64 GPRs for programs which can make effective use of them...

Would be kind of a pain to test as 64 GPR programs couldn't run on a
kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in kernel-space (and is the main thing in my case that seems to benefit
from 64 GPRs).

But, technically, a 32 GPR kernel couldn't run RISC-V programs either.

So, would likely need to switch GLQuake and similar over to baseline
mode (and probably messing with "timedemo").

Checking, as-is, timedemo results for "demo1" are "969 frames 150.5
seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would
be faster with RGB555 LDR), at 50 MHz.

GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".

But, yeah, both are with builds that use 64 GPRs.

Software Quake: "969 frames 147.4 seconds 6.6 fps"
Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"

Not going to bother with GLQuake in RISC-V mode, would likely take a painfully long time.

Well, decided to run this test anyways:
"969 frames 687.3 seconds 1.4 fps"

IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done
to make it fast within the limits of RV64G). Though, this is with it
running GL entirely in RV64 mode (it might fare better as a userland application where the GL backend is running in kernel space in BJX2 mode).

Though, much of this is likely due more to RV64G's lack of SIMD and
similar, rather than due to having fewer GPRs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Brett on Mon Aug 12 08:22:11 2024

Brett wrote:

BGB <cr88192@gmail.com> wrote:

On 8/11/2024 9:33 AM, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPUâ€™s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

The additional registers obviously did not give these architectures a
decisive advantage.

When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.

In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.

Sometimes, it is preferable to be able to map functions entirely to
registers, and 64 does increase the probability of being able to do so
(though, neither achieves 100% of functions; and functions which map
entirely to GPRs with 32 will not see an advantage with 64).

Well, and to some extent the compiler needs to be selective about which
functions it allows to use all of the registers, since in some cases a
situation can come up where the saving/restoring more registers in the
prolog/epilog can cost more than the associated register spills.

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that call, as it splits your function and burns registers that would otherwise get
used.

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.

The solution (?) have always looked obvious to me: Some form of huffmann encoding of register specifiers, so that the most common ones (bottom 16
or 32) require just a small amount of space (as today), and then either
a prefix or a suffix to provide extra bits when you want to use those
higher register numbers. Mitch's CARRY sets up a single extra register
for a set of operations, a WIDE prefix could contain two extra register
bits for four registers over the next 2 or 3 instructions.

As long as this doesn't make the decoder a speed limiter, it would be
zero cost for regular code and still quite cheap except for increasing
code size by 33-50% for the inner loops of algorithms that need 64 or
even 128 regs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Mon Aug 12 06:29:36 2024

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

All antiques no longer available.

SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.

No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.

In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).

Where is your 4% number coming from?

The 4% number is poor memory and a guess.
Here is an antique paper on the issue:

https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.

It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the
computation is complete, and the refills can start early because the
stack pointer tends to be available early.

And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.

One other optimization that they use the additional registers for is
"register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Aug 12 17:36:30 2024

On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>> that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

All antiques no longer available.

SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.

No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.

In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).

Where is your 4% number coming from?

The 4% number is poor memory and a guess.
Here is an antique paper on the issue:

https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.

It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the computation is complete, and the refills can start early because the
stack pointer tends to be available early.

And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.

One other optimization that they use the additional registers for is "register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.

With full access to constants, there is even less need to promote
addresses or immediates into registers as you can simply poof them
up anything you want one.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Aug 12 20:12:59 2024

On Mon, 12 Aug 2024 19:27:22 +0000, BGB wrote:

On 8/12/2024 12:36 PM, MitchAlsup1 wrote:

On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market,
that 4%
that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

AMD29K: IIRC a 128-register stack and 64 additional registers

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register >>>>> files to make good use of them.

All antiques no longer available.

SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:

|Fujitsu will also discontinue their SPARC production [...] end-of-sale
|in 2029, of UNIX servers and a year later for their mainframe.

No word of when Oracle will discontinue (or has discontinued) sales,
but both companies introduced their last SPARC CPUs in 2017.

In any case, my point still stands: these architectures were
available, and the large number of registers failed to give them a
decisive advantage. Maybe it even gave them a decisive disadvantage:
AMD29K and IA-64 never had OoO implementations, and SPARC got them
only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
Power and Alpha switched in 1998 (POWER3, 21264).

Where is your 4% number coming from?

The 4% number is poor memory and a guess.
Here is an antique paper on the issue:

https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a
big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.

It seems to me that with OoO the cycle cost of spilling and refilling
on call boundaries was lowered: the spills can be delayed until the
computation is complete, and the refills can start early because the
stack pointer tends to be available early.

And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
even if the called function is short, the spilling and refilling
around it (if any) does not increase the latency of the value that's
spilled and refilled. But that consideration is only relevant for
Intel APX, ARM A64 and RISC-V went for 32 registers several years
before zero-cycle store-to-load-forwarding was implemented.

One other optimization that they use the additional registers for is
"register promotion", i.e., putting values from memory into registers
for a while (if absence of aliasing can be proven). One interesting
aspect here is that register promotion with 64 or 256 registers (RP-64
and RP-256) is usually not much better (if better at all) than
register promotion with 32 registers (RP-32); see Figure 1. So
register promotion does not make a strong case for more registers,
either, at least in this paper.

With full access to constants, there is even less need to promote
addresses or immediates into registers as you can simply poof them
up anything you want one.

There are tradeoffs still, if constants need space to encode...

Inline is still better than a memory load, granted.

May make sense to consolidate multiple uses of a value into a register
rather than try encoding them as an immediate each time.

See polpak:: r8_erf()

r8_erf: ; @r8_erf
; %bb.0:
fabs r2,r1
fcmp r3,r2,#0x3EF00000
bngt r3,.LBB141_5
; %bb.1:
fcmp r3,r2,#4
bngt r3,.LBB141_6
; %bb.2:
fcmp r3,r2,#0x403A8B020C49BA5E
bnlt r3,.LBB141_7
; %bb.3:
fmul r3,r1,r1
fdiv r3,#1,r3
mov r4,#0x3F90B4FB18B485C7
fmac r4,r3,r4,#0x3FD38A78B9F065F6
fadd r5,r3,#0x40048C54508800DB
fmac r4,r3,r4,#0x3FD70FE40E2425B8
fmac r5,r3,r5,#0x3FFDF79D6855F0AD
fmac r4,r3,r4,#0x3FC0199D980A842F
fmac r5,r3,r5,#0x3FE0E4993E122C39
fmac r4,r3,r4,#0x3F9078448CD6C5B5
fmac r5,r3,r5,#0x3FAEFC42917D7DE7
fmac r4,r3,r4,#0x3F4595FD0D71E33C
fmul r4,r3,r4
fmac r3,r3,r5,#0x3F632147A014BAD1
fdiv r3,r4,r3
fadd r3,#0x3FE20DD750429B6D,-r3
fdiv r3,r3,r2
br .LBB141_4
LBB141_5:
fmul r3,r1,r1
fcmp r2,r2,#0x3C9FFE5AB7E8AD5E
sra r2,r2,#8,#1
cvtsd r4,#0
mux r2,r2,r3,r4
mov r3,#0x3FC7C7905A31C322
fmac r3,r2,r3,#0x400949FB3ED443E9
fadd r4,r2,#0x403799EE342FB2DE
fmac r3,r2,r3,#0x405C774E4D365DA3
fmac r4,r2,r4,#0x406E80C9D57E55B8
fmac r3,r2,r3,#0x407797C38897528B
fmac r4,r2,r4,#0x40940A77529CADC8
fmac r3,r2,r3,#0x40A912C1535D121A
fmul r1,r3,r1
fmac r2,r2,r4,#0x40A63879423B87AD
fdiv r2,r1,r2
mov r1,r2
ret
LBB141_6:
mov r3,#0x3E571E703C5F5815
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
fmul r4,r2,#16
fmul r4,r4,#0x3D800000
rnd r4,r4,#5
fadd r5,r2,-r4
fadd r2,r2,r4
fmul r4,r4,-r4
fexp r4,r4
fmul r2,r2,-r5
fexp r2,r2
fmul r2,r4,r2
fadd r2,#0,-r2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T
fadd r2,#0,-r2
mov r1,r2
ret
LBB141_7:
fcmp r1,r1,#0
sra r1,r1,#8,#1
cvtsd r2,#-1
cvtsd r3,#1
mux r2,r1,r3,r2
mov r1,r2
ret

All of the constants are use once !

RISC-V takes 240 instructions and uses 342 words of
memory {.text, .data, .rodata}

My 66000 takes 85 instructions and uses 169 words of
memory {.text, .data, .rodata}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Aug 12 22:35:14 2024

On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

See polpak:: r8_erf()

r8_erf:                                 ; @r8_erf
; %bb.0:
    fabs    r2,r1
    fcmp    r3,r2,#0x3EF00000
    bngt    r3,.LBB141_5
; %bb.1:
    fcmp    r3,r2,#4
    bngt    r3,.LBB141_6
; %bb.2:
    fcmp    r3,r2,#0x403A8B020C49BA5E
    bnlt    r3,.LBB141_7
; %bb.3:
    fmul    r3,r1,r1
    fdiv    r3,#1,r3
    mov    r4,#0x3F90B4FB18B485C7
    fmac    r4,r3,r4,#0x3FD38A78B9F065F6
    fadd    r5,r3,#0x40048C54508800DB
    fmac    r4,r3,r4,#0x3FD70FE40E2425B8
    fmac    r5,r3,r5,#0x3FFDF79D6855F0AD
    fmac    r4,r3,r4,#0x3FC0199D980A842F
    fmac    r5,r3,r5,#0x3FE0E4993E122C39
    fmac    r4,r3,r4,#0x3F9078448CD6C5B5
    fmac    r5,r3,r5,#0x3FAEFC42917D7DE7
    fmac    r4,r3,r4,#0x3F4595FD0D71E33C
    fmul    r4,r3,r4
    fmac    r3,r3,r5,#0x3F632147A014BAD1
    fdiv    r3,r4,r3
    fadd    r3,#0x3FE20DD750429B6D,-r3
    fdiv    r3,r3,r2
    br    .LBB141_4
LBB141_5:
    fmul    r3,r1,r1
    fcmp    r2,r2,#0x3C9FFE5AB7E8AD5E
    sra    r2,r2,#8,#1
    cvtsd    r4,#0
    mux    r2,r2,r3,r4
    mov    r3,#0x3FC7C7905A31C322
    fmac    r3,r2,r3,#0x400949FB3ED443E9
    fadd    r4,r2,#0x403799EE342FB2DE
    fmac    r3,r2,r3,#0x405C774E4D365DA3
    fmac    r4,r2,r4,#0x406E80C9D57E55B8
    fmac    r3,r2,r3,#0x407797C38897528B
    fmac    r4,r2,r4,#0x40940A77529CADC8
    fmac    r3,r2,r3,#0x40A912C1535D121A
    fmul    r1,r3,r1
    fmac    r2,r2,r4,#0x40A63879423B87AD
    fdiv    r2,r1,r2
    mov    r1,r2
    ret
LBB141_6:
    mov    r3,#0x3E571E703C5F5815
    fmac    r3,r2,r3,#0x3FE20DD508EB103E
    fadd    r4,r2,#0x402F7D66F486DED5
    fmac    r3,r2,r3,#0x4021C42C35B8BC02
    fmac    r4,r2,r4,#0x405D6C69B0FFCDE7
    fmac    r3,r2,r3,#0x405087A0D1C420D0
    fmac    r4,r2,r4,#0x4080C972E588749E
    fmac    r3,r2,r3,#0x4072AA2986ABA462
    fmac    r4,r2,r4,#0x4099558EECA29D27
    fmac    r3,r2,r3,#0x408B8F9E262B9FA3
    fmac    r4,r2,r4,#0x40A9B599356D1202
    fmac    r3,r2,r3,#0x409AC030C15DC8D7
    fmac    r4,r2,r4,#0x40B10A9E7CB10E86
    fmac    r3,r2,r3,#0x40A0062821236F6B
    fmac    r4,r2,r4,#0x40AADEBC3FC90DBD
    fmac    r3,r2,r3,#0x4093395B7FD2FC8E
    fmac    r4,r2,r4,#0x4093395B7FD35F61
    fdiv    r3,r3,r4
LBB141_4:
    fmul    r4,r2,#16
    fmul    r4,r4,#0x3D800000
    rnd    r4,r4,#5
    fadd    r5,r2,-r4
    fadd    r2,r2,r4
    fmul    r4,r4,-r4
    fexp    r4,r4
    fmul    r2,r2,-r5
    fexp    r2,r2
    fmul    r2,r4,r2
    fadd    r2,#0,-r2
    fmac    r2,r2,r3,#0x3F000000
    fadd    r2,r2,#0x3F000000
    pdlt    r1,T
    fadd    r2,#0,-r2
    mov    r1,r2
    ret
LBB141_7:
    fcmp    r1,r1,#0
    sra    r1,r1,#8,#1
    cvtsd    r2,#-1
    cvtsd    r3,#1
    mux    r2,r1,r3,r2
    mov    r1,r2
    ret

All of the constants are use once !

RISC-V takes 240 instructions and uses 342 words of
memory {.text, .data, .rodata}

My 66000 takes 85 instructions and uses 169 words of
memory {.text, .data, .rodata}

FWIW:
FADD Rm, Imm64f, Rn //XG2 Only
FADD Rm, Imm56f, Rn //

And:
FMUL Rm, Imm64f, Rn //XG2 Only
FMUL Rm, Imm56f, Rn //

Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Aug 13 01:23:12 2024

On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:

On 8/12/2024 5:35 PM, MitchAlsup1 wrote:

On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

See polpak:: r8_erf()

r8_erf: ; @r8_erf

<snip>

Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??

Found what I assume you are talking about.

Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes everything;
Also needed to hack over some compiler holes related to "complex
_Double" to get it to build;
Also needed to stub over some library functions that were added in C99
but missing in my C library.

I only ask for r8_erf()

<snip>

As for "r8_erf()":

<===

r8_erf:

<snip>

I count 283 instructions compared to my 85 including the 104
instructions
it takes your compiler to get to the 1st instruction in My 66000 code !!

In the middle I see much the same problems as RISC-V has:: while you
have
the ability to poof constants, you can't use them without wasting
registers
and in general an inefficient FP instruction set {no FMAC, no sign
control
on operands, no transcendental instructions, and at least you are not as
poor of FP compare-branches as RISC-V}

It is true that LLVM can unroll loops and when the loop is consuming
only
constants Brian's compiler just emits the polynomial directly with nary
a LD or ST, just constants as operands, whereas your compiler poofs
constants into existence rather than forwarding them directly into
execution. Every poof cost you an instruction, mine just cost
instruction
space not pipeline delay.

I think this demonstrates my point perfectly--universal constants inside
a RISC instruction set is a BIG WIN.

It also illustrates the fact that a RIOSC ISA needs a good compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Aug 13 17:24:30 2024

On Tue, 13 Aug 2024 3:50:04 +0000, BGB wrote:

On 8/12/2024 8:23 PM, MitchAlsup1 wrote:

On Tue, 13 Aug 2024 0:34:55 +0000, BGB wrote:

On 8/12/2024 5:35 PM, MitchAlsup1 wrote:

On Mon, 12 Aug 2024 20:58:45 +0000, BGB wrote:

On 8/12/2024 3:12 PM, MitchAlsup1 wrote:

See polpak:: r8_erf()

r8_erf: ; @r8_erf

<snip>

Why don't yuo download polpack, compile it, and state how many
instructions it takes and how many words of storage it takes ??

Found what I assume you are talking about.

Needed to add "polpak_test.c" as otherwise BGBCC lacks a main and prunes >>> everything;
Also needed to hack over some compiler holes related to "complex
_Double" to get it to build;
Also needed to stub over some library functions that were added in C99
but missing in my C library.

I only ask for r8_erf()

<snip>

As for "r8_erf()":

<===

r8_erf:

<snip>

I count 283 instructions compared to my 85 including the 104
instructions
it takes your compiler to get to the 1st instruction in My 66000 code !!

Yeah, this is a compiler issue...

Why not sit down and code it in ASM to see what your ISA can really do?
Feel free to use My 66000 code as an example.

It might have been less if the code was like:
static const double somearr[8]={ ... };

But, this would still have used memory loads.
Getting the constants into expressions would likely require using
#define or similar...

This is admittedly more how I would have imagined performance-oriented
code to be written. Not so much with dynamically initialized arrays.

That particular piece of code was originally written in FORTRAN
probably late 1960s or early 1970s then ported to C a while back.

<snip>

But, as I will note, even with this general level of lackluster code generation, have still been managing to often beat RV64G performance...

Anybody claiming RISC-V has a good ISA should have their degree revoked.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Aug 13 19:21:04 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Anybody claiming RISC-V has a good ISA should have their degree revoked.

Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.

I could probably gather the same statistic for My 66000...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Aug 13 20:41:10 2024

On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Anybody claiming RISC-V has a good ISA should have their degree revoked.

Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.

I could probably gather the same statistic for My 66000...

Most of my groups have a bit under ½ of their space left.

Major:: 22 of 64 left
Mem:::: 32 of 64 left
2-OP::: 33 of 64 left
3-OP::: 4 of 8 left
1-OP::: 56 of 64 left
misc::: 9 of 16 left

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Aug 14 13:15:07 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Anybody claiming RISC-V has a good ISA should have their degree revoked.

Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.

I could probably gather the same statistic for My 66000...

Most of my groups have a bit under ½ of their space left.

Major:: 22 of 64 left
Mem:::: 32 of 64 left
2-OP::: 33 of 64 left
3-OP::: 4 of 8 left
1-OP::: 56 of 64 left
misc::: 9 of 16 left

Yep, but there are also gaps in there.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Aug 14 16:59:58 2024

On Wed, 14 Aug 2024 13:15:07 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 13 Aug 2024 19:21:04 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Anybody claiming RISC-V has a good ISA should have their degree revoked. >>>

Interesting datapoint: In the GhostWrite paper, they say that 84.03%
of the RISC-V instruction space are taken up.

I could probably gather the same statistic for My 66000...

Most of my groups have a bit under ½ of their space left.

Major:: 22 of 64 left

I forgot to mention that 6 of the taken are permanently reserved
to prevent jumping into code and having anything execute--inde-
pendent of whether --E is permitted. So only 36 of 64 are in use.

Mem:::: 32 of 64 left
2-OP::: 33 of 64 left
3-OP::: 4 of 8 left
1-OP::: 56 of 64 left
misc::: 9 of 16 left

Yep, but there are also gaps in there.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 14 22:06:46 2024

On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >>that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

SPARCs FPGA through UltraSPARC used 1 full cycle to access the windowed register file will MIPS, 88K, and early Alphas used 1/2 cycle. So SPARC architecture saddled them with an inherent disadvantage....

AMD29K: IIRC a 128-register stack and 64 additional registers

Similar issues.

IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.

Don't know for certain, but I would expect the same as above.

The additional registers obviously did not give these architectures a decisive advantage.

Captain Obvious strikes again

Oh, and BTW, that 1/2 cycle of delay getting started should have cost
~5% IPC. But SAPRC never achieved high clock frequencies nor dis IA-64.

When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.

Where is your 4% number coming from?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 14 22:26:27 2024

On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Where is your 4% number coming from?

The 4% number is poor memory and a guess.
Here is an antique paper on the issue:

https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf

Interesting. I only skimmed the paper, but I read a lot about
inlining and interprocedural register allocation. SPARCs register
windows and AMD29K's and IA-64's register stacks were intended to be
useful for that, but somehow the other architectures did not suffer a big-enough disadvantage to make them adopt one of these concepts, and
that's despite register windows/stacks working even for indirect calls
(e.g., method calls in the general case), where interprocedural
register allocation or inlining don't help.

The problem of register-windows is when "you miss the cache",
first you have to take the exception,
then you have to blindly push an IN or pull and OUT with no knowlege
of how many registers are in use (or several of them)
then you have to return from the exception.

So, you have two exception control transfers, and a blind copy of
fixed sized data, loss of a few TLB entries, and loss of a few
cache lines of data+instructions.

Whereas MIPS, 88k, Alpha, RISC-V always "hit in the cache" so to
speak.

There was an old paper that stated MIPS team had optimizing compiler up
and optimizing, while SPARC team bet on HW to compensate for their lack. History has chose the non-SPARC path.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Aug 14 22:19:32 2024

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.

What I see around calls is MOV instructions grabbing arguments from the preserved registers and putting return values in to the proper preserved register. Inlining does get rid of these MOVs, but what else ??

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.

I am on record as stating the proper number of bits in an instruction- specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.

34-bits comes from having enough Entropy to encode what needs encoding
and making careful data-driven choices on "what to put in and what to
leave out" and finding a clever means to access vectorization and multi- precision calculations. Without both of those 36-would likely be the
best option for the 32-register variants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Aug 14 22:43:07 2024

On Wed, 14 Aug 2024 10:15:58 +0000, BGB wrote:

On 8/13/2024 12:24 PM, MitchAlsup1 wrote:

Assuming I use all of the ISA features that currently exist:

r8_erf: ; @r8_erf
MOV R4, R1
FABS R1,R2
FCMPGT 0x3780, R2 //Half
BF .LBB141_5

FCMPGT 0x4400, R2 //Half
BF .LBB141_6

FCMPGE 0x403A8B020C49BA5E, R2
BT .LBB141_7

FMUL R1, R1, R3
FLDCH 0x3C00, R2
FDIV R2, R3, R3
MOV 0x3F90B4FB18B485C7, R4
MOV 0x3FD38A78B9F065F6, R16
FMAC R3, R16, R4, R4
FADD R3, 0x40048C54508800DB, R5

MOV 0x3FD70FE40E2425B8, R16
FMAC R3, R16, R4, R4

MOV 0x3FFDF79D6855F0AD, R16
FMAC R3, R16, R5, R5

MOV 0x3FC0199D980A842F, R16
FMAC R3, R16, R4, R4
MOV 0x3FE0E4993E122C39, R16
FMAC R3, R16, R5, R5
MOV 0x3F9078448CD6C5B5, R16
FMAC R3, R16, R4, R4
MOV 0x3FAEFC42917D7DE7, R16
FMAC R3, R16, R5, R5
MOV 0x3F4595FD0D71E33C, R16
FMAC R3, R16, R4, R4

FMUL R4,R3,R4
MOV 0x3F632147A014BAD1, R16
FMAC R5, R3, R16, R3
FDIV R4, R3, R3
FNEG R3, R3
FADD R3, 0x3FE20DD750429B6D, R3
FDIV R3, R2, R3
BRA .LBB141_4
LBB141_5:
FMUL R1, R1, R3
MOV 0, R4
FCMPGT 0x3C9FFE5AB7E8AD5E, R2
CSELT R3, R4, R2
MOV 0x3FC7C7905A31C322, R3

MOV 0x400949FB3ED443E9, R16
fmac R2, R16, R3, R3
FADD R2,#0x403799EE342FB2DE, R4

MOV 0x405C774E4D365DA3, R16
RMAC R2, R16, R3, R3
MOV 0x406E80C9D57E55B8, R16
FMAC R2, R16, R4, R4

MOV 0x407797C38897528B, R16
FMAC R2, R16, R3, R3
MOV 0x40940A77529CADC8, R16
FMAC R2, R16, R4, R4
MOV 0x40A912C1535D121A, R16
FMAC R2, R16, R3, R3

FMUL R3, R1, R1
MOV 0x40A63879423B87AD, R16
FMAC R2, R16, R4, R2
FDIV R1, R2, R2
RTS

LBB141_6:
MOV 0x3E571E703C5F5815, R3
fmac r3,r2,r3,#0x3FE20DD508EB103E
fadd r4,r2,#0x402F7D66F486DED5
fmac r3,r2,r3,#0x4021C42C35B8BC02
fmac r4,r2,r4,#0x405D6C69B0FFCDE7
fmac r3,r2,r3,#0x405087A0D1C420D0
fmac r4,r2,r4,#0x4080C972E588749E
fmac r3,r2,r3,#0x4072AA2986ABA462
fmac r4,r2,r4,#0x4099558EECA29D27
fmac r3,r2,r3,#0x408B8F9E262B9FA3
fmac r4,r2,r4,#0x40A9B599356D1202
fmac r3,r2,r3,#0x409AC030C15DC8D7
fmac r4,r2,r4,#0x40B10A9E7CB10E86
fmac r3,r2,r3,#0x40A0062821236F6B
fmac r4,r2,r4,#0x40AADEBC3FC90DBD
fmac r3,r2,r3,#0x4093395B7FD2FC8E
fmac r4,r2,r4,#0x4093395B7FD35F61
fdiv r3,r3,r4
LBB141_4:
FMUL R2, 0x40300000, R4
FMUL R4, 0x3FB00000, R4
FSTCI R4, R4
FLDCI R4, R4
FNEG R4, R6
fadd R2, R6, R5
fadd R2, R4, R2
fmul R4, R6, R4
fexp r4,r4 //?

fmul R2,R7, R2
fexp r2,r2
fmul R4, R2, R2
FNEG R2, R2
fmac r2,r2,r3,#0x3F000000
fadd r2,r2,#0x3F000000
pdlt r1,T //?
fadd r2,#0,-r2
RTS
LBB141_7:
FLDCH 0xBC00, R2
FLDCH 0x3C00, R3
FCMPGT 0, R1
CSELT R2,R3,R2
RTS

Not bad: I count 101 instructions and 183 words of memory.
{{I checked nothing}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Thu Aug 15 00:36:57 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.

What I see around calls is MOV instructions grabbing arguments from the preserved registers and putting return values in to the proper preserved register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% that matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.

I am on record as stating the proper number of bits in an instruction- specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) Making the registers 6-bits would increase that count to 36-bits.

My 6600 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 6600 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

All the customer will see is more registers, more performance, on top of
all your other substantial improvements.

34-bits comes from having enough Entropy to encode what needs encoding
and making careful data-driven choices on "what to put in and what to
leave out" and finding a clever means to access vectorization and multi- precision calculations. Without both of those 36-would likely be the
best option for the 32-register variants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Thu Aug 15 00:54:15 2024

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.

What I see around calls is MOV instructions grabbing arguments from the
preserved registers and putting return values in to the proper preserved
register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% that matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.

I am on record as stating the proper number of bits in an instruction-
specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

All the customer will see is more registers, more performance, on top of
all your other substantial improvements.

34-bits comes from having enough Entropy to encode what needs encoding
and making careful data-driven choices on "what to put in and what to
leave out" and finding a clever means to access vectorization and multi-
precision calculations. Without both of those 36-would likely be the
best option for the 32-register variants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Brett on Wed Aug 14 22:21:28 2024

On 8/14/2024 5:54 PM, Brett wrote:

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get >>>> used.

What I see around calls is MOV instructions grabbing arguments from the
preserved registers and putting return values in to the proper preserved >>> register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% that >> matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which
removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it >>>> burns up your opcode space and makes encoding everything more difficult. >>>

I am on record as stating the proper number of bits in an instruction-
specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

Also longer context switch times, as more registers to save/restore.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Aug 15 08:45:30 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market, that 4% >>>that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

SPARCs FPGA through UltraSPARC used 1 full cycle to access the windowed >register file will MIPS, 88K, and early Alphas used 1/2 cycle.

Maybe. Obviously did not prevent them from having ALU instructions
with one-cycle latence and loads with 2-cycles latency in the early implementations, just like MIPS R2000. And the clock rate of the
SPARC MB86900 (14.28MHz) is not worse than the clock rate of the MIPS
R2000 (8.3, 12.5, and 15MHz grades), and that despite having the
interlocks that MIPS were so proud of not having.

Oh, and BTW, that 1/2 cycle of delay getting started should have cost
~5% IPC. But SAPRC never achieved high clock frequencies nor dis IA-64.

As mentioned above, the clock rate was competetive with the early
MIPS. If we look at more recent times, the in-order UltraSPARC IV+
(90nm) achieved 2100MHz in 2007; Intel sold 3GHz 65nm Core 2 Duo E6850
at the time, so the UltraSPARC IV+ was not that far off. This
undermines my theory that in-order designs have problems achieving
high clock rates.

Going for OoO implementations, the Fujitsu SPARC64 V+ (90nm) was
shipped in 2004 with 1.89MHz and in 2006 with 2.16MHz. AMD shipped
the 2.2GHz Athlon 64 3500+ (90nm) in 2004 and a 2.4GHz 90nm version in
2006, so the SPARC64 V+ was not far off.

Fujitsu continued their line until the 4.25GHz SPARC64 XII in 2017.
For comparison: AMD released the Ryzen 1800X in 2017 and that
supposedly can turbo up to 4GHz (but when I just measured it (with 1
core loaded), it achied <3.7GHz). Intel sold the Core i7-8700K
starting on Oct 5, 2017, which achieved 4.7GHz.

Oracle released the 5000MHz SPARC M8 in 2017.

Maybe SAPCR (sic!) did not achieve high clock rates, but SPARC did.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Thu Aug 15 17:05:48 2024

On Thu, 15 Aug 2024 08:45:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 11 Aug 2024 14:33:33 +0000, Anton Ertl wrote:

Brett <ggtgp@yahoo.com> writes:

The lack of CPU’s with 64 registers is what makes for a market,
that 4% that could benefit have no options to pick from.

They had:

SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.

SPARCs FPGA through UltraSPARC used 1 full cycle to access the
windowed register file will MIPS, 88K, and early Alphas used 1/2
cycle.

Maybe. Obviously did not prevent them from having ALU instructions
with one-cycle latence and loads with 2-cycles latency in the early implementations, just like MIPS R2000. And the clock rate of the
SPARC MB86900 (14.28MHz) is not worse than the clock rate of the MIPS
R2000 (8.3, 12.5, and 15MHz grades), and that despite having the
interlocks that MIPS were so proud of not having.

Oh, and BTW, that 1/2 cycle of delay getting started should have cost
~5% IPC. But SAPRC never achieved high clock frequencies nor dis
IA-64.

As mentioned above, the clock rate was competetive with the early
MIPS. If we look at more recent times, the in-order UltraSPARC IV+
(90nm) achieved 2100MHz in 2007; Intel sold 3GHz 65nm Core 2 Duo E6850
at the time, so the UltraSPARC IV+ was not that far off.

Even more so 12 years earlier:
ULtraSparc - 200 MHz
PPro - 200 MHz
R10K - 195 MHz
PA-RISC 8000 - 180 MHz, but few months later and much pricier

This
undermines my theory that in-order designs have problems achieving
high clock rates.

POWER6 (same year) is much heavier blow to your theory.

Going for OoO implementations, the Fujitsu SPARC64 V+ (90nm) was
shipped in 2004 with 1.89MHz and in 2006 with 2.16MHz. AMD shipped
the 2.2GHz Athlon 64 3500+ (90nm) in 2004 and a 2.4GHz 90nm version in
2006, so the SPARC64 V+ was not far off.

Fujitsu continued their line until the 4.25GHz SPARC64 XII in 2017.
For comparison: AMD released the Ryzen 1800X in 2017 and that
supposedly can turbo up to 4GHz (but when I just measured it (with 1
core loaded), it achied <3.7GHz). Intel sold the Core i7-8700K
starting on Oct 5, 2017, which achieved 4.7GHz.

Oracle released the 5000MHz SPARC M8 in 2017.

Maybe SAPCR (sic!) did not achieve high clock rates, but SPARC did.

- anton

Was not Mitch himself involved in design of hyperSPARC that eventually
reached very respectable clock frequency?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Michael S on Thu Aug 15 10:14:21 2024

On 8/15/2024 9:33 AM, Michael S wrote:

On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

SPARCs FPGA <snip>

F - Fujitsu (?)
P - ???
G - gate
A - array

Half right. Field Programmable Gate Array. I.E. a "gate array" that
can be programmed in the field, as opposed to the factory.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Aug 15 19:33:05 2024

On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

SPARCs FPGA <snip>

F - Fujitsu (?)
P - ???
G - gate
A - array

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Aug 15 18:04:08 2024

On Thu, 15 Aug 2024 14:05:48 +0000, Michael S wrote:

On Thu, 15 Aug 2024 08:45:30 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Was not Mitch himself involved in design of hyperSPARC that eventually reached very respectable clock frequency?

We got HyperSPARC up to 200 MHz and had a 250 MHz version in debug.

This was 20%-25% slower that the competition on the "average" SPARC
workload, but for some reason the "wall street traders" bought ship-
loads of them as they were somewhat faster than SuperSPARC or
UltraSPARC on that kind of workload--where milliseconds faster
means millions of dollars.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Thu Aug 15 23:54:42 2024

On Thu, 15 Aug 2024 10:14:21 -0700
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/15/2024 9:33 AM, Michael S wrote:

On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

SPARCs FPGA <snip>

F - Fujitsu (?)
P - ???
G - gate
A - array

Half right. Field Programmable Gate Array. I.E. a "gate array" that
can be programmed in the field, as opposed to the factory.

Don't you think that if I am asking then I have reasons to think that
Mitch didn't mean "Field Programmable" ?

BTW, logic (HDL) design of FPGA-based embedded systems is part of what
I am doing for living during last 25 years.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Aug 15 21:10:54 2024

On Thu, 15 Aug 2024 20:54:42 +0000, Michael S wrote:

On Thu, 15 Aug 2024 10:14:21 -0700
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/15/2024 9:33 AM, Michael S wrote:

On Wed, 14 Aug 2024 22:06:46 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

SPARCs FPGA <snip>

F - Fujitsu (?)
P - ???
G - gate
A - array

Half right. Field Programmable Gate Array. I.E. a "gate array" that
can be programmed in the field, as opposed to the factory.

Don't you think that if I am asking then I have reasons to think that
Mitch didn't mean "Field Programmable" ?

I could have been misremembering the ASIC SPARC instead of the FPGA
SPARC.

BTW, logic (HDL) design of FPGA-based embedded systems is part of what
I am doing for living during last 25 years.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Stephen Fuld on Fri Aug 16 04:30:54 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/14/2024 5:54 PM, Brett wrote:

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that >>>>> call,
as it splits your function and burns registers that would otherwise get >>>>> used.

What I see around calls is MOV instructions grabbing arguments from the >>>> preserved registers and putting return values in to the proper preserved >>>> register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% that >>> matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>> removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it >>>>> burns up your opcode space and makes encoding everything more difficult. >>>>

I am on record as stating the proper number of bits in an instruction- >>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>> Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

Actually due to the removal of MOVs and reloads the code density may be basically the same.

Also longer context switch times, as more registers to save/restore.

The save is should be free, as the load from ram is so slow.
If the context is time critical it should be written to use the registers
that are reloaded first, first. In which case the code could start doing
work in the same amount of time regardless of register count. (I doubt the
CPU design is actually that smart, or that the people that program the interrupts are.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Fri Aug 16 18:33:36 2024

Brett <ggtgp@yahoo.com> wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/14/2024 5:54 PM, Brett wrote:

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that >>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>> used.

What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% that
matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>

I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

Actually due to the removal of MOVs and reloads the code density may be basically the same.

Also longer context switch times, as more registers to save/restore.

The save is should be free, as the load from ram is so slow.
If the context is time critical it should be written to use the registers that are reloaded first, first. In which case the code could start doing
work in the same amount of time regardless of register count. (I doubt the CPU design is actually that smart, or that the people that program the interrupts are.)

When I wrote that I was thinking of visible registers, rename messes that
up…

But an interrupt does not need a full register set state to start up, so my comment is valid after all.

One might need to change how one writes interrupt code, have not done that much, and it was 20 years ago.

Brett

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Fri Aug 16 18:50:05 2024

On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/14/2024 5:54 PM, Brett wrote:

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that >>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>> used.

What I see around calls is MOV instructions grabbing arguments from the >>>>> preserved registers and putting return values in to the proper preserved >>>>> register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% >>>> that
matters.

The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>> removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it >>>>>> burns up your opcode space and makes encoding everything more difficult. >>>>>

I am on record as stating the proper number of bits in an instruction- >>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>> Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

Actually due to the removal of MOVs and reloads the code density may be basically the same.

Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.

Also longer context switch times, as more registers to save/restore.

The save is should be free, as the load from ram is so slow.

When HW is doing the saves, the saves can be performed while
waiting for the first instruction to arrive and for the first
registers to arrive. Thus, done in HW, the saves are essentially
free.

If the context is time critical it should be written to use the
registers that are reloaded first, first. In which case the code
could start doing work in the same amount of time regardless of
register count. (I doubt the CPU design is actually that smart,
or that the people that program the interrupts are.)

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sat Aug 17 00:24:24 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 8/14/2024 5:54 PM, Brett wrote:

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:

BGB <cr88192@gmail.com> wrote:

Another benefit of 64 registers is more inlining removing calls. >>>>>>>
A call can cause a significant amount of garbage code all around that >>>>>>> call,
as it splits your function and burns registers that would otherwise get >>>>>>> used.

What I see around calls is MOV instructions grabbing arguments from the >>>>>> preserved registers and putting return values in to the proper preserved >>>>>> register. Inlining does get rid of these MOVs, but what else ??

For middling functions, I spent my time optimizing heavy code, the 10% >>>>> that
matters.

The first half of a big function will have some state that has to be >>>>> reloaded after a call, or worse yet saved and reloaded.

Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which >>>>> removes more of those MOVs.

I can understand the reluctance to go to 6 bit register specifiers, it >>>>>>> burns up your opcode space and makes encoding everything more difficult.

I am on record as stating the proper number of bits in an instruction- >>>>>> specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts) >>>>>> Making the registers 6-bits would increase that count to 36-bits.

My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.

Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.

6-bits will make you stand out and get noticed.

The only down side I see is a few percent in code density.

Actually due to the removal of MOVs and reloads the code density may be
basically the same.

Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.

Also longer context switch times, as more registers to save/restore.

The save is should be free, as the load from ram is so slow.

When HW is doing the saves, the saves can be performed while
waiting for the first instruction to arrive and for the first
registers to arrive. Thus, done in HW, the saves are essentially
free.

If the context is time critical it should be written to use the
registers that are reloaded first, first. In which case the code
could start doing work in the same amount of time regardless of
register count. (I doubt the CPU design is actually that smart,
or that the people that program the interrupts are.)

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.
But has anyone told the software guys?
Of course convincing programmers to RTFM is futile. ;(

If so this is the first I have heard that more registers is not bad for interrupt response time.

So we are back to finding any downsides for 64 registers in My 66000.

Lack of actual significant benefits is irrelevant, as all the programers
are 100% convinced that it will help some of their code. ;)

For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 07:44:49 2024

Brett <ggtgp@yahoo.com> schrieb:

MitchAlsup1 <mitchalsup@aol.com> wrote:

Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.

In principle, yes.

You can either use C++ headers, which result in huge compilation
times, or you can use LTO. LTO, if done right, is a huge time-eater
(I was looking for an English translation of "Zeitgrab", literarlly
"time grave" or "time tomb", this was the best I could come up with).

[...]

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.
But has anyone told the software guys?

Software guys generally work with high-level languages where this is irrelevant, except for...

Of course convincing programmers to RTFM is futile. ;(

...people writing operating systems or drivers, and they better
read the docs for the architecture they are working on.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding space. Not sure if you have Mitch's document, but having
one more bit per register would reduce the 16-bit data in the
offset to 14 (no way you can expand that by a factor of four),
would require eight instead of one major opcodes for the three-
register instructions, amd the four-register instructions like FMA...
you get the picture.

This would not matter if we were still living in a 36-bit world,
but the days of the IBM 704, the PDP-10 or the UNIVAC 1100 have
passed, except for emulation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 07:29:34 2024

Brett <ggtgp@yahoo.com> schrieb:

So we are back to finding any downsides for 64 registers in My 66000.

Encoding space.

Do you have Mitch's ISA document? Memory access instructions
would be restricted to 14 bit offsets, standard three-register
arithmetic would use eight instead of one major opcode, and FMA
and friends...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Aug 17 20:08:55 2024

On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.

The Instructions and the compiler's use of them were co-developed.

But has anyone told the software guys?

Use HLLs and you don't have to.

Of course convincing programmers to RTFM is futile. ;(

Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.

If so this is the first I have heard that more registers is not bad for interrupt response time.

They are also bad for pipeline stage times.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding
pipeline staging
context switch times

For example, My 66000 current encoding has room for 8 instructions
in the FMAC category (4 in use) with 6-bit register specifiers
I would need 4 major OpCodes instead of 1.

For your 98%-ile source code, 32-registers is plenty.

Lack of actual significant benefits is irrelevant, as all the programers
are 100% convinced that it will help some of their code. ;)

For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sat Aug 17 20:57:43 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.

The Instructions and the compiler's use of them were co-developed.

But has anyone told the software guys?

Use HLLs and you don't have to.

I looked at interrupts in your manual and it did not say how many registers were full of garbage leaking information because they were not saved or restored to make interrupts faster. ;)

Of course convincing programmers to RTFM is futile. ;(

Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.

If so this is the first I have heard that more registers is not bad for
interrupt response time.

They are also bad for pipeline stage times.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding

Admittedly painful, extremely so.

pipeline staging

A longer pipeline is slower to start up, but gets work done faster.
Is this what you mean?

context switch times

Task swapping time is way down in the noise. It’s reloading the L1 and L2 cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

For example, My 66000 current encoding has room for 8 instructions
in the FMAC category (4 in use) with 6-bit register specifiers
I would need 4 major OpCodes instead of 1.

For your 98%-ile source code, 32-registers is plenty.

Lack of actual significant benefits is irrelevant, as all the programers
are 100% convinced that it will help some of their code. ;)

For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Thomas Koenig on Sat Aug 17 20:40:55 2024

Thomas Koenig <tkoenig@netcologne.de> wrote:

Brett <ggtgp@yahoo.com> schrieb:

MitchAlsup1 <mitchalsup@aol.com> wrote:

Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.

In principle, yes.

You can either use C++ headers, which result in huge compilation
times, or you can use LTO. LTO, if done right, is a huge time-eater
(I was looking for an English translation of "Zeitgrab", literarlly
"time grave" or "time tomb", this was the best I could come up with).

[...]

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.
But has anyone told the software guys?

Software guys generally work with high-level languages where this is irrelevant, except for...

Of course convincing programmers to RTFM is futile. ;(

...people writing operating systems or drivers, and they better
read the docs for the architecture they are working on.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding space. Not sure if you have Mitch's document,

Section 4.1 Instruction Template, Figure 25, page 33-179

but having
one more bit per register would reduce the 16-bit data in the
offset to 14 (no way you can expand that by a factor of four),

14 is plenty, you can actually do 12 and pack those instructions in with
shifts which have a pair of 6 bit fields, width and offset. This would
expand some constants but you make it back in shorter code with less MOVs
and more performance.

would require eight instead of one major opcodes for the three-
register instructions,

Mitch gloats about how many major opcodes he has free, in his 7 bit opcode
he has the greater part of a bit free, so we are a good part of the way
there.

Conceptually some of the modifier bits move into the opcode space, not as
clean but you have to squeeze those bits hard. One can come up with a few patterns that are not hard to decode, and spread across several instruction types.

and the four-register instructions like FMA...

Trying to wave a red flag in front of Mitch. ;)

This is a pain point.
I would sacrifice most or all of XCOM6 the predicate instructions.

Does it fit or does one look at extended opcodes for FMA.

This would not matter if we were still living in a 36-bit world,
but the days of the IBM 704, the PDP-10 or the UNIVAC 1100 have
passed, except for emulation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Brett on Sat Aug 17 22:05:03 2024

Brett <ggtgp@yahoo.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Brett <ggtgp@yahoo.com> schrieb:

Software guys generally work with high-level languages where this is
irrelevant, except for...

Of course convincing programmers to RTFM is futile. ;(

...people writing operating systems or drivers, and they better
read the docs for the architecture they are working on.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding space. Not sure if you have Mitch's document,

Section 4.1 Instruction Template, Figure 25, page 33-179

but having
one more bit per register would reduce the 16-bit data in the
offset to 14 (no way you can expand that by a factor of four),

14 is plenty,

16 is better.

you can actually do 12 and pack those instructions in with
shifts which have a pair of 6 bit fields, width and offset. This would
expand some constants but you make it back in shorter code with less MOVs
and more performance.

Hmm... I am not convinced.

Do you have code to back up your claims?

would require eight instead of one major opcodes for the three-
register instructions,

Mitch gloats about how many major opcodes he has free, in his 7 bit opcode

It's 6 for the major opcode, actually.

he has the greater part of a bit free, so we are a good part of the way there.

That sentence no parse.

Conceptually some of the modifier bits move into the opcode space, not as clean but you have to squeeze those bits hard

It is very fine point of semantics if the modifier bits are part
of the opcode space or not. I happen to think that they are,
they are just in a (somehwat) different place and spelled a bit
differently, but it does not really matter how you look at it -
you need the bits to encode them.

One can come up with a few
patterns that are not hard to decode, and spread across several instruction types.

So, go right ahead. Find an encoding that a) encompasses all of
Mitch's functionality, b) has six bits for registers everywhere,
and c) does not drive the assembler writer crazy (that's me,
for Mitch's design) or hardware designer bonkers (where Mitch has
the experience).

Let's start with the... BB1 instruction, which branches on bit
set in a register, so it needs a major opcode, a bit number, a
register number and a displacement. How do you propose to do that?
Shave one bit off the displacement?

and the four-register instructions like FMA...

Trying to wave a red flag in front of Mitch. ;)

I just happen to like FMA :-)

Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.

But - just making offhand suggestions won't cut it. You will
have to think about the layout of the instructions, how everything
fits in, and needing one to four more bits per instruction
can be accomodated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Aug 17 22:15:17 2024

On Sat, 17 Aug 2024 20:57:43 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.

The Instructions and the compiler's use of them were co-developed.

But has anyone told the software guys?

Use HLLs and you don't have to.

I looked at interrupts in your manual and it did not say how many
registers
were full of garbage leaking information because they were not saved or restored to make interrupts faster. ;)

When an ISR[13] returns from handling its exception it ahs a register
file
filled with stuff useful to future running's of ISR[13].

When ISR[13] gains control to handle another interrupt it has a file
filled
with what it was filled with the last time it ran--all 30 of them while registers R0..R1 contain information about the current interrupt to be serviced.
SP points at its stack
FP points at its frame or is another register containing whatever it
contained the previous time
R29..R2 contain the value it had the previous time it ran

Of course convincing programmers to RTFM is futile. ;(

Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.

If so this is the first I have heard that more registers is not bad for
interrupt response time.

They are also bad for pipeline stage times.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding

Admittedly painful, extremely so.

pipeline staging

A longer pipeline is slower to start up, but gets work done faster.
Is this what you mean?

No, I mean the feedback loops take more cycles so apparent latency
is greater.

context switch times

Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

While it is under 1% of all cycles, current x86s take 1,000 cycles
application to application and 10,000 cycles hypervisor to hypervisor.

I want both of these down in the 20-cycle range.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Aug 17 23:03:34 2024

On Sat, 17 Aug 2024 22:05:03 +0000, Thomas Koenig wrote:

Brett <ggtgp@yahoo.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Conceptually some of the modifier bits move into the opcode space, not
as clean but you have to squeeze those bits hard

It is very fine point of semantics if the modifier bits are part
of the opcode space or not. I happen to think that they are,
they are just in a (somehwat) different place and spelled a bit
differently, but it does not really matter how you look at it -
you need the bits to encode them.

To me, an instruction has 3 components:: Operands, Routing, and
calculation. We mainly consider the calculation (ADD) to be the
instruction and fuzz over what is operands and how does one
route them to places of calculation. My 66000 ISA directly
annotates the operands and the routing. This is what the
modifier bits do; they tell how to interpret the register
specifiers (Rn or #n), (Rn or -Rn) and when to substitute
another word or doubleword in the instruction stream as an
operand directly.

This does not add gates of delay to Operand routing because
all of the constant stuff is overlapped with the comparison
of register specifiers with pipeline result specifiers to
determine forwarding. Constants forward in the network prior
to register results preventing any added delay.

One can come up with a few patterns that are not hard to
decode, and spread across several instruction types.

So, go right ahead. Find an encoding that a) encompasses all of
Mitch's functionality, b) has six bits for registers everywhere,
and c) does not drive the assembler writer crazy (that's me,
for Mitch's design) or hardware designer bonkers (where Mitch has
the experience).

Consider, for example, memory reference address modes for 1
instruction::
LDSB Rd,[Rp,disp16]
LDSB Rd,[IP,disp16]
and
LDSB Rd,[Rp,Ri<<s]
LDSB Rd,[Rp,0]
LDSB Rd,[IP,Ri<<s]
LDSB Rd,[Rp,,disp32]
LDSB Rd,[Rp,Ri<<2,disp32]
LDSB Rd,[IP,,disp32]
LDSB Rd,[IP,Ri<<s,disp32]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[Rp,Ri<<s,disp64]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[IP,Ri<<s,disp64]

I use 2 instructions here::
1) a major OpCode with 16-bit immediate
R0 in the Rb position is a proxy for IP
2) a major OpCode and a MEME OpCode with 5-bits of Modifiers.
R0 in Rb position is remains a Proxy for IP
R0 in Ri position is a proxy for #0.
3) I still have 1-bit left over to denote participation in ATOMIC
events.
you get all sizes and signs of Load-Locked
you get up to 8 LLs
you can use as many Store-Conditionals as you need
all interested 3rd parties see memory before or after the event
and nothing in between.

Using 6-bit registers I would be down by 3-bits causing all sorts of
memory reference grief--leading to other compromises in ISA design
elsewhere.

Based on the code I read out of Brian's compiler: there is no particular
need for 64-registers. I am already using only 72% of the instructions
{72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
{same compiler, same optimizations, just different code generators}.

One can argue that having 64-bit displacements is not-all-that-necessary
But how does one take dusty deck FORTRAN FEM programs and allow the
common blocks to grow bigger than 4GBs ?? This is the easiest way
to port code written 5 decades ago to use the sizes of memory they
need to run those "Great Big" FEM models today.

Let's start with the... BB1 instruction, which branches on bit
set in a register, so it needs a major opcode, a bit number, a
register number and a displacement. How do you propose to do that?
Shave one bit off the displacement?

Then proceed to Branch on Condition:: along with the standard::
EQ0, NE0, GT0, GE0, LT0, LE0 conditions one gets with other encodings,
I also get FEQ0, FNE0, FGT0, FGE0, FLT0, FLE0, DEQ0, DNE0, DGT0,
DGE0, DLT0, DLE0 along with Interference, SVC, SVR, and RET.
{And I left out the unordered float/double comparisons, above.}}
1-instruction due mostly to NOT having condition codes.

and the four-register instructions like FMA...

I prefer 3-operand 1-result instead of 4-register. 4-register could
have 1-operand and 3 results and lacks decent specificity. 35 years
ago I used 3-register to describe Mc88100 and I regret that now.

I prefer FMAC instead of FMA--in hindsight I should had made it
FMAC and DMAC, but alas... I use FMAC to cover all 4 of::

x = y * z + q
x = y * -z + q
x = y * z - q
x = y * -z - q

Trying to wave a red flag in front of Mitch. ;)

I just happen to like FMA :-)

Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.

Why do I get the feeling the compiler guys would not like this ??

But - just making offhand suggestions won't cut it. You will
have to think about the layout of the instructions, how everything
fits in, and needing one to four more bits per instruction
can be accomodated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 18 02:39:04 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 17 Aug 2024 22:05:03 +0000, Thomas Koenig wrote:

Brett <ggtgp@yahoo.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Conceptually some of the modifier bits move into the opcode space, not
as clean but you have to squeeze those bits hard

It is very fine point of semantics if the modifier bits are part
of the opcode space or not. I happen to think that they are,
they are just in a (somehwat) different place and spelled a bit
differently, but it does not really matter how you look at it -
you need the bits to encode them.

To me, an instruction has 3 components:: Operands, Routing, and
calculation. We mainly consider the calculation (ADD) to be the
instruction and fuzz over what is operands and how does one
route them to places of calculation. My 66000 ISA directly
annotates the operands and the routing. This is what the
modifier bits do; they tell how to interpret the register
specifiers (Rn or #n), (Rn or -Rn) and when to substitute
another word or doubleword in the instruction stream as an
operand directly.

This does not add gates of delay to Operand routing because
all of the constant stuff is overlapped with the comparison
of register specifiers with pipeline result specifiers to
determine forwarding. Constants forward in the network prior
to register results preventing any added delay.

One can come up with a few patterns that are not hard to
decode, and spread across several instruction types.

So, go right ahead. Find an encoding that a) encompasses all of
Mitch's functionality, b) has six bits for registers everywhere,
and c) does not drive the assembler writer crazy (that's me,
for Mitch's design) or hardware designer bonkers (where Mitch has
the experience).

Consider, for example, memory reference address modes for 1
instruction::
LDSB Rd,[Rp,disp16]
LDSB Rd,[IP,disp16]
and
LDSB Rd,[Rp,Ri<<s]
LDSB Rd,[Rp,0]
LDSB Rd,[IP,Ri<<s]
LDSB Rd,[Rp,,disp32]
LDSB Rd,[Rp,Ri<<2,disp32]
LDSB Rd,[IP,,disp32]
LDSB Rd,[IP,Ri<<s,disp32]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[Rp,Ri<<s,disp64]
LDSB Rd,[Rp,,disp64]
LDSB Rd,[IP,Ri<<s,disp64]

I use 2 instructions here::
1) a major OpCode with 16-bit immediate
R0 in the Rb position is a proxy for IP
2) a major OpCode and a MEME OpCode with 5-bits of Modifiers.
R0 in Rb position is remains a Proxy for IP
R0 in Ri position is a proxy for #0.
3) I still have 1-bit left over to denote participation in ATOMIC
events.
you get all sizes and signs of Load-Locked
you get up to 8 LLs
you can use as many Store-Conditionals as you need
all interested 3rd parties see memory before or after the event
and nothing in between.

Using 6-bit registers I would be down by 3-bits causing all sorts of
memory reference grief--leading to other compromises in ISA design
elsewhere.

Based on the code I read out of Brian's compiler: there is no particular
need for 64-registers. I am already using only 72% of the instructions
{72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
{same compiler, same optimizations, just different code generators}.

One can argue that having 64-bit displacements is not-all-that-necessary
But how does one take dusty deck FORTRAN FEM programs and allow the
common blocks to grow bigger than 4GBs ?? This is the easiest way
to port code written 5 decades ago to use the sizes of memory they
need to run those "Great Big" FEM models today.

Let's start with the... BB1 instruction, which branches on bit
set in a register, so it needs a major opcode, a bit number, a
register number and a displacement. How do you propose to do that?
Shave one bit off the displacement?

Then proceed to Branch on Condition:: along with the standard::
EQ0, NE0, GT0, GE0, LT0, LE0 conditions one gets with other encodings,
I also get FEQ0, FNE0, FGT0, FGE0, FLT0, FLE0, DEQ0, DNE0, DGT0,
DGE0, DLT0, DLE0 along with Interference, SVC, SVR, and RET.
{And I left out the unordered float/double comparisons, above.}} 1-instruction due mostly to NOT having condition codes.

and the four-register instructions like FMA...

I prefer 3-operand 1-result instead of 4-register. 4-register could
have 1-operand and 3 results and lacks decent specificity. 35 years
ago I used 3-register to describe Mc88100 and I regret that now.

I prefer FMAC instead of FMA--in hindsight I should had made it
FMAC and DMAC, but alas... I use FMAC to cover all 4 of::

x = y * z + q
x = y * -z + q
x = y * z - q
x = y * -z - q

Trying to wave a red flag in front of Mitch. ;)

I just happen to like FMA :-)

Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.

Why do I get the feeling the compiler guys would not like this ??

But - just making offhand suggestions won't cut it. You will
have to think about the layout of the instructions, how everything
fits in, and needing one to four more bits per instruction
can be accomodated.

Yes I know and agree that you have a beautiful instruction set layout.
And a 64 register variant would be butt ugly, but x86 won. Thumb 2 won over ARM32 which was better. Thumb 2 almost never happened because management
hated it.

I know my fellow programmers, give them a 64 register variant and they will make the stupid choice like me 80% of the time. ;)

Ask the customers what they want, and don’t be surprised when they pick the stupid option. If it gets you a sale you would have lost, just count the
money and be happy.

I don’t expect you to do any work on 64 registers, just add a vapor ware option and put it on ice for a few years. Let boredom and demand kick in,
maybe it will just die like most vapor ware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sun Aug 18 02:16:04 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 17 Aug 2024 20:57:43 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 17 Aug 2024 0:24:24 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.

Ok, so the hardware is smart enough.

The Instructions and the compiler's use of them were co-developed.

But has anyone told the software guys?

Use HLLs and you don't have to.

I looked at interrupts in your manual and it did not say how many
registers
were full of garbage leaking information because they were not saved or
restored to make interrupts faster. ;)

When an ISR[13] returns from handling its exception it ahs a register
file
filled with stuff useful to future running's of ISR[13].

When ISR[13] gains control to handle another interrupt it has a file
filled
with what it was filled with the last time it ran--all 30 of them while registers R0..R1 contain information about the current interrupt to be serviced.
SP points at its stack
FP points at its frame or is another register containing whatever it
contained the previous time
R29..R2 contain the value it had the previous time it ran

I don’t remember the PlayStation using all registers in an interrupt, it
was only a few lines of code and 8 registers was fine. This would only save
you 3 cycles and is probably not worth the potential hassles.

I have heard programmers complaining that interrupt response was too slow
and so they had to add a second toy CPU just to handle interrupts. Probably people that made the mistake of upgrading to x86.

Have wondered if having a scratchpad for interrupt code (and critical data) would solve those problems as memory can be 150 cycles away, plus you can
have 40 pending reads queued ahead of you. Makes servicing interrupts in a timely matter difficult, even if you are not touching much memory.

Of course convincing programmers to RTFM is futile. ;(

Done with Instructions in HW one has to convince exactly two
people; GCC code generator and LLVM code generator.

If so this is the first I have heard that more registers is not bad for >>>> interrupt response time.

They are also bad for pipeline stage times.

So we are back to finding any downsides for 64 registers in My 66000.

Encoding

Admittedly painful, extremely so.

pipeline staging

A longer pipeline is slower to start up, but gets work done faster.
Is this what you mean?

No, I mean the feedback loops take more cycles so apparent latency
is greater.

context switch times

Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or
megabytes.

While it is under 1% of all cycles, current x86s take 1,000 cycles application to application and 10,000 cycles hypervisor to hypervisor.

I want both of these down in the 20-cycle range.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Aug 18 06:34:34 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Based on the code I read out of Brian's compiler: there is no particular
need for 64-registers. I am already using only 72% of the instructions
{72% average, 70% geomean, 69% harmonic mean} that RISC-V requires
{same compiler, same optimizations, just different code generators}.

That's true - the code is usually expressed as a very straightforward translation of the original code, at least for C.

Register pressure will increase for unrolling of outer loops,
for languages which use dope vectors (aka array descriptors),
and for more aggressive inlining.

Consider an argument passed as an assumed-shape array in
Fortran.

subroutine foo(a)
real, dimension(:,:)

where the array assumes the shape from the caller,
two-dimensional in this case.

For passing such an array, we need a base pointer and
information about

- the lower bound
- the upper bound
- the stride

along each dimension, so it is 7 quantities in ths case.

One can argue that having 64-bit displacements is not-all-that-necessary
But how does one take dusty deck FORTRAN FEM programs and allow the
common blocks to grow bigger than 4GBs ?? This is the easiest way
to port code written 5 decades ago to use the sizes of memory they
need to run those "Great Big" FEM models today.

That is certainly one reason. Another is being able to have
a "huge" model with code > 2GB without too much effort.
Programs _are_ getting bigger...

Of course, it might be possible to code FMA like AVX does, with
only three registers - 18 bits for three registers, plus two bits
for which one of them gets smashed for the result.

Why do I get the feeling the compiler guys would not like this ??

Because they won't? :-) It is certainly more straightforward
this way.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sun Aug 18 22:03:01 2024

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

<snip>

High registers is mostly marketing vapor ware extension for you, see if anyone cares and put them on a list for when a market for that extension
pops up.

The lack of CPU’s with 64 registers is what makes for a market, that 4% that could benefit have no options to pick from. You would be happy to
have control of a market that big. Point customers at a compiler
configured
for 64 registers and say that with high registers and inline constants
that
is what they could expect for code generation.

I agree with the lead in, and disagree with where you took it.

Let up postulate that having 64 registers is a 10% win (overstating
the size of its win my 2.5×) but that 98% of subroutines don't need 64-registers. So, 98% gains nothing and 2% gains 10%

0.98*1.0 + 0.02*1.1 = 1.002
or
0.2% gain.

If there is demand for high registers you will probably just spin a CPU
arch with more registers, but that will never happen if you never ask.

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

This
is the definition of vapor ware, a free market survey. You can even add
more registers as an incompatible extension, if fact you should.

I will leave stuff like this to you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Aug 19 12:05:22 2024

Task swapping time is way down in the noise. It’s reloading the L1 and L2 cache that swamps the time. 64 registers is nothing compared to 32k or megabytes.

Depends on the kind of swap. If you're thinking of time-sharing
preemption, then indeed context switch time is not important.

But when considering communication between processes, then very fast
context switch times allow for finer grain divisions, like
micro-kernels.

Historically, these things have never really materialized, admittedly.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Mon Aug 19 18:22:27 2024

On Mon, 19 Aug 2024 16:05:22 +0000, Stefan Monnier wrote:

Task swapping time is way down in the noise. It’s reloading the L1 and
L2
cache that swamps the time. 64 registers is nothing compared to 32k or
megabytes.

Depends on the kind of swap. If you're thinking of time-sharing
preemption, then indeed context switch time is not important.

But when considering communication between processes, then very fast
context switch times allow for finer grain divisions, like
micro-kernels.

MicroKernels failed due to the excessive overhead of context switching.
Whether is was control delivery delay, TLB reloads, Cache reloads,
register file loads and stores, ... it doesn't really mater as each
delay adds up. When there is too much delay the system is sluggish
and unacceptable en-the-large.

Historically, these things have never really materialized, admittedly.

Pigs don't win the 100 yard dash at the Olympics, either.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Aug 19 18:34:05 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Pigs don't win the 100 yard dash at the Olympics, either.

Cheetahs would, but that would be cheeting.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Mon Aug 19 18:52:39 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

<snip>

High registers is mostly marketing vapor ware extension for you, see if
anyone cares and put them on a list for when a market for that extension
pops up.

The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from. You would be happy to
have control of a market that big. Point customers at a compiler
configured
for 64 registers and say that with high registers and inline constants
that
is what they could expect for code generation.

I agree with the lead in, and disagree with where you took it.

Let up postulate that having 64 registers is a 10% win (overstating
the size of its win my 2.5×) but that 98% of subroutines don't need 64-registers. So, 98% gains nothing and 2% gains 10%

0.98*1.0 + 0.02*1.1 = 1.002
or
0.2% gain.

I agree with this, but you have 4% of the market where more registers gives
a much larger speedup. You would be glad to have that much market share.

If there is demand for high registers you will probably just spin a CPU
arch with more registers, but that will never happen if you never ask.

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.

This
is the definition of vapor ware, a free market survey. You can even add
more registers as an incompatible extension, if fact you should.

I will leave stuff like this to you.

I do agree that high registers to double your register count is far cleaner
for the instruction set than going to 64 separate registers. You have much
of high register implemented anyway if you support integer vector
operations in the integer register file like MIPS, or have a unified
register file, be it visible or not.

64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite available.
So close, yet so far. I could not make it work.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Mon Aug 19 19:31:54 2024

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22 registers would have already gained 1/2 of all of what is possible.

64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite available. So close, yet so far. I could not make it work.

We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction
stream.

64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Mon Aug 19 23:35:54 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22 registers would have already gained 1/2 of all of what is possible.

64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.

We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction stream.

64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.

If you pack 7 instructions in 8 long words that gives you an extra nibble,
4 bits.
You can do lots of four operand dual operations, which may get you back the code density lost, while improving performance.

3 instructions packed in 4 longs gives 64 registers plus four operand dual instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Tue Aug 20 00:12:44 2024

On Mon, 19 Aug 2024 23:35:54 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16, of course x84 had fab and
cubic dollar advantages that dwarfed the register limit.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22
registers would have already gained 1/2 of all of what is possible.

64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.

We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction
stream.

64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.

If you pack 7 instructions in 8 long words that gives you an extra
nibble, > 4 bits.
You can do lots of four operand dual operations, which may get you back
the code density lost, while improving performance.

Given 36-bit containers--how do you add 32 or 64-bit constants ??
throw 36-bits at the 32-bit needs case and 72-bits at the 64-bit
needs case ?!?

3 instructions packed in 4 longs gives 64 registers plus four operand
dual instructions.

{{ note 3 instructions in 4 longs is 85.3-bits per instruction::
I suspect you mean 3 instructions in 4 words which is 42.6-bits
per instruction far more than is needed. You get 14 instructions
of 36-bits in 512-bits (a cache line)}}

Why don't you give it a try !?!

But notice, you are starting out with a much larger instruction--
how are you going to "profitably" utilize all those bits from
source code of typical imperative languages ??

whereas with 32-bit instructions don't violate the RISC tenets.
I end up needing only 72% the number of instructions RISC-V needs
(a near 40% pipelined instruction advantage).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Tue Aug 20 03:50:36 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 19 Aug 2024 23:35:54 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sun, 11 Aug 2024 0:46:09 +0000, Brett wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16, of course x84 had fab and >>>> cubic dollar advantages that dwarfed the register limit.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers. Do not underestimate this phenomenon. The
gain from 16-32 registers is only 3%-ish so one would estimate that 22
registers would have already gained 1/2 of all of what is possible.

64 separate registers was a bridge to far, but it was an interesting
exercise before it crashed and burned due to the bits being not quite
available. So close, yet so far. I could not make it work.

We remain hobbled by the definition of Byte containing exactly 8-bits.
It is this which drives the 16-bit and 32-bit instruction sizes; and
it is this which drives the sizes of constants used by the instruction
stream.

64 registers makes PERFECT sense in a 36-bit (or 72-bit) architecture.
But we must all face facts::
a) Little Endian Won
b) 8-bit Bytes Won
c) longer operands are composed of multiple bytes mostly powers of 2.
d) otherwise it is merely an academic exercise.

If you pack 7 instructions in 8 long words that gives you an extra
nibble, > 4 bits.
You can do lots of four operand dual operations, which may get you back
the code density lost, while improving performance.

Given 36-bit containers--how do you add 32 or 64-bit constants ??
throw 36-bits at the 32-bit needs case and 72-bits at the 64-bit
needs case ?!?

The four extra bits is four extra bits, or it could be a scale/shift
amount, though that could stall an instruction that crossed a cache line.
Same for 72 bits, but with more opcode flags of say extract insert. It is arguable that such fields are data and not opcode. Assuming the data does
not impose a gate delay of concern that an opcode would not.

3 instructions packed in 4 longs gives 64 registers plus four operand
dual instructions.

{{ note 3 instructions in 4 longs is 85.3-bits per instruction::
I suspect you mean 3 instructions in 4 words which is 42.6-bits
per instruction far more than is needed. You get 14 instructions
of 36-bits in 512-bits (a cache line)}}

10 more bits gives you a register plus a second operation.
Add from memory and LEA being the classic examples.
Though I am more in the line of just combining general operations.
A more general load pair, three way add etc.

If you can find enough combine potential then code density will not suffer.
And for reasonably clocked devices the code will be faster. A three way add only adds a few gates, it why you allow negate on all sources, it’s cheap
and faster than two operations.

This is what I call Post RISC.

Why don't you give it a try !?!

Yes it works.

But notice, you are starting out with a much larger instruction--
how are you going to "profitably" utilize all those bits from
source code of typical imperative languages ??

whereas with 32-bit instructions don't violate the RISC tenets.
I end up needing only 72% the number of instructions RISC-V needs
(a near 40% pipelined instruction advantage).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Aug 20 07:01:49 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16

And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the
announcement of APX says something about 10% fewer memory accesses or
somesuch.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers.

You feeling is strong (as shown by your repeatedly ignoring the counterevidence), but wrong:

LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:

Let's call the 17th register r16:

On a load-store architecture you replace "LD-OP dest,src" with:

ld r16=src
op dest,dest,r16

On a load-store architecture you replace "LD-OP-ST dest,src" with:

ld r16=dest
op r16,r16,src
st dest=r16

For a VAX-like three-memory-argument instruction you need two extra
registers, r16 and r17:

"mem1 = mem2 op mem3" becomes:

ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Aug 20 11:59:31 2024

On Mon, 19 Aug 2024 18:22:27 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 19 Aug 2024 16:05:22 +0000, Stefan Monnier wrote:

Task swapping time is way down in the noise. It’s reloading the L1
and L2
cache that swamps the time. 64 registers is nothing compared to
32k or megabytes.

Depends on the kind of swap. If you're thinking of time-sharing preemption, then indeed context switch time is not important.

But when considering communication between processes, then very fast context switch times allow for finer grain divisions, like
micro-kernels.

MicroKernels failed due to the excessive overhead of context
switching. Whether is was control delivery delay, TLB reloads, Cache
reloads, register file loads and stores, ... it doesn't really mater
as each delay adds up. When there is too much delay the system is
sluggish and unacceptable en-the-large.

I don't believe that failure of uKernels to take over the world of
OSes is related to the factors, you mentioned.
It failed because relatively to monolithic kernel it is less convenient
way to structure the OS software. Various parts of the OS are
more dependent on each other logically, esp. in read-only manner, than proponents of uKernels are admitting. Every change takes more
developer's time and causes touching more places in code than with
monolithic.

Historically, these things have never really materialized,
admittedly.

Pigs don't win the 100 yard dash at the Olympics, either.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Aug 20 09:40:11 2024

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.

If the issue is only the encoding, then presumably, Mitch could go the
route of a prefix instruction (like his PRED instruction or the
instruction he uses to do wide shifts/adds/...).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Aug 20 16:40:06 2024

On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With renaming >>>> one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all
would be fine. x86 does mostly fine with 16

And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the announcement of APX says something about 10% fewer memory accesses or somesuch.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>like it has 20-22 registers.

You feeling is strong (as shown by your repeatedly ignoring the counterevidence), but wrong:

LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:

Let's call the 17th register r16:

On a load-store architecture you replace "LD-OP dest,src" with:

ld r16=src
op dest,dest,r16

On a load-store architecture you replace "LD-OP-ST dest,src" with:

ld r16=dest
op r16,r16,src
st dest=r16

For a VAX-like three-memory-argument instruction you need two extra registers, r16 and r17:

"mem1 = mem2 op mem3" becomes:

ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17

- anton

That is not what I am talking about::

i = i + 1;
as
ADD [&i],#1

1 instruction = 1 add, 1 LD and 1 ST. And

i = i + j;
as
ADD Ri,[&j]

In neither case is an extra register needed, and you may have
several of these in a local sequence of code. ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Aug 20 20:40:50 2024

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Aug 20 14:18:25 2024

MitchAlsup1 wrote:

On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With
renaming
one can have R7 in use 40 times in a 100 instruction deep execution
window.

If this was true we would have 16 or even 8 visible registers, and all >>>> would be fine. x86 does mostly fine with 16

And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the
announcement of APX says something about 10% fewer memory accesses or
somesuch.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more
like it has 20-22 registers.

You feeling is strong (as shown by your repeatedly ignoring the
counterevidence), but wrong:

LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:

Let's call the 17th register r16:

On a load-store architecture you replace "LD-OP dest,src" with:

ld r16=src
op dest,dest,r16

On a load-store architecture you replace "LD-OP-ST dest,src" with:

ld r16=dest
op r16,r16,src
st dest=r16

For a VAX-like three-memory-argument instruction you need two extra
registers, r16 and r17:

"mem1 = mem2 op mem3" becomes:

ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17

- anton

That is not what I am talking about::

i = i + 1;
as
ADD [&i],#1

1 instruction = 1 add, 1 LD and 1 ST. And

i = i + j;
as
ADD Ri,[&j]

In neither case is an extra register needed, and you may have
several of these in a local sequence of code. ...

On an in-order pipeline you need someplace to stash the temp value.
If you want, call it a special in-flight pseudo-register that only exists
for forwarding, it is still an identifier for a value that is outside
the architectural register set.

I think it might need two registers if you can have two such instructions
in the pipeline back-to-back as there could be multiple temp values
in-flight at once

ADD [&i],#1
ADD [&j],#1

could have &i doing its store while &j is doing its load.

On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Aug 20 20:59:28 2024

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue Aug 20 21:05:41 2024

On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:

MitchAlsup1 wrote:

On Tue, 20 Aug 2024 7:01:49 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 19 Aug 2024 18:52:39 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

The thing is that one you go down the GBOoO route, your lack of
registers
"namable in ASM" ceases to become a performance degrader. With
renaming
one can have R7 in use 40 times in a 100 instruction deep execution >>>>>> window.

If this was true we would have 16 or even 8 visible registers, and all >>>>> would be fine. x86 does mostly fine with 16

And yet Intel went to 32 SIMD registers with AVX-512 (which admittedly
was first developed for an in-order microarchitecture) and are now
going to 32 GPRs with APX (no in-order excuse here). And IIRC the
announcement of APX says something about 10% fewer memory accesses or
somesuch.

Careful, here::

x86 has LD-OPs and LD-OP-STs which makes the 16 register file feel more >>>> like it has 20-22 registers.

You feeling is strong (as shown by your repeatedly ignoring the
counterevidence), but wrong:

LD-OPs and LD-OP-STs as on AMD64 and PDP-11 make the 16 registers
equivalent to 17 registers on a load/store architecture:

Let's call the 17th register r16:

On a load-store architecture you replace "LD-OP dest,src" with:

ld r16=src
op dest,dest,r16

On a load-store architecture you replace "LD-OP-ST dest,src" with:

ld r16=dest
op r16,r16,src
st dest=r16

For a VAX-like three-memory-argument instruction you need two extra
registers, r16 and r17:

"mem1 = mem2 op mem3" becomes:

ld r16=mem2
ld r17=mem3
op r16,r16,r17
st mem1=r17

- anton

That is not what I am talking about::

i = i + 1;
as
ADD [&i],#1

1 instruction = 1 add, 1 LD and 1 ST. And

i = i + j;
as
ADD Ri,[&j]

In neither case is an extra register needed, and you may have
several of these in a local sequence of code. ...

On an in-order pipeline you need someplace to stash the temp value.
If you want, call it a special in-flight pseudo-register that only
exists for forwarding, it is still an identifier for a value that
is outside the architectural register set.

The LD-OP-ST machine would have this built into the pipeline--
such that nobody has to name the carrier of the value down the
pipeline.

I think it might need two registers if you can have two such
instructions in the pipeline back-to-back as there could be
multiple temp values in-flight at once

ADD [&i],#1
ADD [&j],#1

could have &i doing its store while &j is doing its load.

On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.

In the LD-OP-ST microarchitecture there would be some buffer
that carries the intermediate values through the execution
window. And, Yes, you can build a LD-OP-ST reservation station
(Athlon and Opteron did). It becomes easier if there is some
buffer to carry the intermediate values {address, operand, result}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Tue Aug 20 23:08:03 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

LD-OP-ST is a bridge too far for me.

LD-OP and OP-ST are fine with me and have benefits.
But you have not built such, you built an improved RISC…

I assume OP-ST has issues with the value getting stuck if the address is
slow to resolve. With a register the value can just spill to the register backing file. And because of this you create a hidden register name for the value.

You have information on how many hidden registers are in flight on average
and worst case, so I believe your numbers.

I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Aug 21 01:40:10 2024

On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

LD-OP-ST is a bridge too far for me.

LD-OP and OP-ST are fine with me and have benefits.

If you put cache write at or after register file write in the
pipeline; LD-OP-ST basically falls out for free and you can
move the intermediate values from whence they are produced
to where they are consumed with forwarding.

But you have not built such, you built an improved RISC…

I spent 7 years doing x86-64.....so much for not having.....

It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.

I assume OP-ST has issues with the value getting stuck if the address is
slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.

Athlon and Opteron had value capturing reservation stations.
K9 have value-free RSs. It caused little headache because
while we did not give it a named physical register, we did
give it a physical register for the intermediates. SW can only
read/write named PRs getting the name from logical to physical
register renaming.

You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.

I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.

Partially because AMD performed "relatively" better on LD-OPs and
LD-OP-STs than Intel at that time. Where "relatively" means
significantly above the noise level but "not all that much".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Wed Aug 21 05:13:41 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet >>>> another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

LD-OP-ST is a bridge too far for me.

LD-OP and OP-ST are fine with me and have benefits.

If you put cache write at or after register file write in the
pipeline; LD-OP-ST basically falls out for free and you can
move the intermediate values from whence they are produced
to where they are consumed with forwarding.

LD-OP-ST mostly only fits if it is add to memory.

42 bit opcodes work, you only need one in four RISC opcodes to merge to
LD-OP or OP-ST for code density to be the same, and generally you will do better.

The two leftover bits can be ignored, or be a template indicator, so you
can pack in a LD-OP-ST, or 31 bit RISC ops.

Or go heads and tails packing.

But you have not built such, you built an improved RISC…

I spent 7 years doing x86-64.....so much for not having.....

It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.

I assume OP-ST has issues with the value getting stuck if the address is
slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.

Athlon and Opteron had value capturing reservation stations.
K9 have value-free RSs. It caused little headache because
while we did not give it a named physical register, we did
give it a physical register for the intermediates. SW can only
read/write named PRs getting the name from logical to physical
register renaming.

You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.

I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.

Partially because AMD performed "relatively" better on LD-OPs and
LD-OP-STs than Intel at that time. Where "relatively" means
significantly above the noise level but "not all that much".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Aug 21 12:00:47 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:

On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.

In the LD-OP-ST microarchitecture there would be some buffer
that carries the intermediate values through the execution
window. And, Yes, you can build a LD-OP-ST reservation station
(Athlon and Opteron did).

All the material I have seen is that AMD has a load-store ROP, but the
op in between is in a separate functional unit, with a separate
scheduler entry; and I expect that the load-store ROP occupies the
load/store scheduler(s) twice: once for the load part, once for the
store part. There is also something about macroops that can be
load-op-stores, but from what I have read, when it comes to execution,
they are split into ROPs.

If you have more details that contradict the information published up
to now, please let us know more about them.

On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Aug 21 10:13:12 2024

mitchalsup@aol.com (MitchAlsup1) writes:

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

Latency is not the issue in modern high-performance AMD64 cores, which
have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>.

And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:

sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:

gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.

In the second line I used gforth, an engine that keeps the top of
stack in memory, the return-stack pointer in memory, stores IP to
memory after every change, and does not use static superinstructions,
all for better identifying where an error happened.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

What makes you think that instruction count is particularly relevant?
Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more
complex decoder. And in the OoO engine the difference may be gone (at
least on Intel CPUs).

Consider the Forth program

: squared dup * ;

This results in the following code sequences for the two engines
mentioned above:

dup 1->1 dup 0->0
mov $50[r13],r15
add rbx,$08 add r15,$08
mov $00[r13],r8 mov rax,[r14]
sub r13,$08 sub r14,$08
mov [r14],rax
* 1->1 * 0->0
mov $50[r13],r15
add rbx,$08 add r15,$08
mov rax,$08[r14]
imul r8,$08[r13] imul rax,[r14]
add r13,$08 add r14,$08
mov [r14],rax
;s 1->1 ;s 0->0
mov $50[r13],r15
mov rax,$58[r13]
mov rbx,[r14] mov r10,[rax]
add r14,$08 add rax,$08
mov $58[r13],rax
mov r15,r10
mov rax,[rbx] mov rcx,[r15]
jmp rax jmp rcx

TOS=r8, RP=r14, IP=rbx TOS=[r14], RP=$58[r13], IP=r15/$50[r13]

The registers are allocated differently in the two engines; for the
three things where the memory/register allocation differed, I have
shown the allocation.

One interesting case is the sequence

7FA02A77133D: mov rax,$58[r13]
7FA02A771341: mov r10,[rax]
7FA02A771344: add rax,$08
7FA02A771348: mov $58[r13],rax

Sure you could use a load-op-store instruction for adding 8 to
$58[r13], but the mov in 7FA02A771341 still needs the value in a
register, so apparently gcc (which produced the code snippets for the individual Forth words above) decided that it's better to save
execution resources rather than reduce the number of instructions (at
a higher execution resource cost) by writing the code as

mov rax,$58[r13]
add $58[r13], $8
mov r10,[rax]

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Wed Aug 21 12:09:58 2024

Brett <ggtgp@yahoo.com> writes:

LD-OP-ST is a bridge too far for me.

LD-OP and OP-ST are fine with me and have benefits.

Can you name one architecture that has OP-ST? For VAX instructions
the sources can be in registers and the target in memory, so let's
refine this: Can you name one architecture that does not have
LD-OP-ST, but that has OP-ST? The S/360 and PDP-11 approach of having
one memory operand that can be either a source (ld-op) or a source and
target (ld-op-st) seems to have had many successors, in particular 8086/IA-32/AMD64.

I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.

There is no OP-ST in the AMD64 instruction set, but gcc certainly
generates LD-OP and LD-OP-ST; for the latter:

Code +
5570F6F544D1: mov $50[r13],r15
5570F6F544D5: add r15,$08
5570F6F544D9: lea rax,$08[r14]
5570F6F544DD: mov rdx,[r14]
5570F6F544E0: add [rax],rdx
5570F6F544E3: mov r14,rax
5570F6F544E6: mov rcx,[r15]
5570F6F544E9: jmp rcx

The instruction at 5570F6F544E0 is a LD-OP-ST.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Aug 21 16:42:33 2024

On Wed, 21 Aug 2024 12:00:47 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.

- anton

AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are
immediately fused into 2 fused uOps. They travel through rename phase
as 2 uOps. I am not sure if they are split back into 4 uOps before or
after OoO schedulers, but would guess the former.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Wed Aug 21 14:28:17 2024

Brett <ggtgp@yahoo.com> writes:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet
another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

LD-OP-ST is a bridge too far for me.

These are some of the most important operations. Even ARM64
supports a small set of LD-OP-ST (atomic) operations. Most CPU
implementations delegate them to the cache subsystem or
to an I/O device (e.g. PCIe which supports atomic operations).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Wed Aug 21 15:28:05 2024

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Aug 2024 12:00:47 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.

- anton

AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.

Which 4 uops and 2 macroops are those? My guess is that ST is
store-data and store-address uops, and ld and op are one uop each.

They travel through rename phase
as 2 uOps.

Interesting. But yes, only two values are generated for physical
registers: the result of the load and the result of the op. So I
expect that the two store parts are tacked onto the op on the way
through the renamer, and then that macroop is split into its parts on
the way to the schedulers.

I am not sure if they are split back into 4 uOps before or
after OoO schedulers, but would guess the former.

Golden Cove is depicted as having an op scheduler, a load scheduler
and a store scheduler, so they have to split the ld-op-store into at
least three parts for scheduling.

Sunny Cove is depicted as having an op scheduler, a store data
scheduler, and two AGU schedulers, which would again mean at least
three parts, but this time with a different split.

Both based on <https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Wed Aug 21 08:49:10 2024

On 8/21/2024 3:13 AM, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

Latency is not the issue in modern high-performance AMD64 cores, which
have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>.

And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:

sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:

gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.

In the second line I used gforth, an engine that keeps the top of
stack in memory, the return-stack pointer in memory, stores IP to
memory after every change, and does not use static superinstructions,
all for better identifying where an error happened.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

What makes you think that instruction count is particularly relevant?
Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more complex decoder. And in the OoO engine the difference may be gone (at
least on Intel CPUs).

There are also some savings in reduced I-cache usage (possibly leading
to higher I-cache hit rate), reduced memory I-fetch memory bandwidth
required, etc, though these may be modest at best.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Wed Aug 21 16:45:37 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

There are also some savings in reduced I-cache usage (possibly leading
to higher I-cache hit rate), reduced memory I-fetch memory bandwidth >required, etc, though these may be modest at best.

Let's see how that works out. I am using the code size numbers
from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

So the least code size is from a load/store architecture with 16
registers, followed (or preceded in the case of grep) by a load/store architecture with 32 registers. The instruction sets that have
loap-op and load-op-st instructions result in bigger code. The
different sizes of armhf (ARMv7) and armel (ARMv4t-ARMv6t) show that
there is more to code sizes than just the architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Wed Aug 21 17:54:44 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

Latency is not the issue in modern high-performance AMD64 cores, which
have zero-cycle store-to-load forwarding ><http://www.complang.tuwien.ac.at/anton/memdep/>.

And yet, putting variables in registers gives a significant speedup:
On a Rocket Lake, numbers are times in seconds:

sieve bubble matrix fib fft
0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

In the first line, I used gforth-fast and tried to disable all
optimizations except those that keep certain variables in registers:

gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs

I could not reduce the static superinstructions below 31 and still get
a result; I will have to investigate why, but that probably does not
make that much of a difference for several of these benchmarks.

Fixed that, so now with

gforth-fast --ss-states=1 --ss-number=0 --opt-ip-updates=0 onebench.fs

sieve bubble matrix fib fft
0.069 0.074 0.036 0.052 0.017 TOS in reg, RP in reg, IP in reg
0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem

Or on a Golden Cove:

sieve bubble matrix fib fft
0.059 0.059 0.024 0.047 0.020 TOS in reg, RP in reg, IP in reg
0.108 0.156 0.065 0.098 0.037 TOS in mem, RP in mem, IP write-through to mem

So even on these advanced cores with zero-cycle store-to-load
forwarding it hurts quite a bit to keep variables in memory.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Wed Aug 21 10:20:07 2024

On 8/21/2024 9:45 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

There are also some savings in reduced I-cache usage (possibly leading
to higher I-cache hit rate), reduced memory I-fetch memory bandwidth
required, etc, though these may be modest at best.

Let's see how that works out. I am using the code size numbers
from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

bash grep gzip
595204 107636 46744 armhf 16 regs load/store 32-bit
599832 101102 46898 riscv64 32 regs load/store 64-bit
796501 144926 57729 amd64 16 regs ld-op ld-op-st 64-bit
829776 134784 56868 arm64 32 regs load/store 64-bit
853892 152068 61124 i386 8 regs ld-op ld-op-st 32-bit
891128 158544 68500 armel 16 regs load/store 32-bit
892688 168816 64664 s390x 16 regs ld-op ld-op-st 64-bit
1020720 170736 71088 mips64el 32 regs load/store 64-bit
1168104 194900 83332 ppc64el 32 regs load/store 64-bit

So the least code size is from a load/store architecture with 16
registers, followed (or preceded in the case of grep) by a load/store architecture with 32 registers. The instruction sets that have
loap-op and load-op-st instructions result in bigger code.

Interesting, thanks.

The
different sizes of armhf (ARMv7) and armel (ARMv4t-ARMv6t) show that
there is more to code sizes than just the architecture.

Certainly. It would take a more detailed analysis (that I am not
capable of), to determine all the causes of the results you show.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Wed Aug 21 22:31:01 2024

On Wed, 21 Aug 2024 19:13:55 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

The LD-OP-STs in Athlon and Opteron had a memory OpCode and
calculation OpCode, and was performed in such a way that the physical
address of the LD was used for the ST when its time came. The
calculation OpCode was an ALU or the IMUL/DIV unit.

Are you sure about IMUL/DIV ?
MUL and DIV instructions have no RMW form on x86/i386/AMD64.
OTOH, shifts have.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Aug 21 19:13:55 2024

On Wed, 21 Aug 2024 12:00:47 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 20 Aug 2024 18:18:25 +0000, EricP wrote:

On OoO, if the reservation stations are valueless, you need a real
physical register to stash the temp value as there is no guarantee
the OP part of the uOp will launch just when the LD part finishes
doing its thing and forwards the value.

In the LD-OP-ST microarchitecture there would be some buffer
that carries the intermediate values through the execution
window. And, Yes, you can build a LD-OP-ST reservation station
(Athlon and Opteron did).

All the material I have seen is that AMD has a load-store ROP, but the
op in between is in a separate functional unit, with a separate
scheduler entry; and I expect that the load-store ROP occupies the
load/store scheduler(s) twice: once for the load part, once for the
store part.

The LD-OP-STs in Athlon and Opteron had a memory OpCode and calculation
OpCode, and was performed in such a way that the physical address of
the LD was used for the ST when its time came. The calculation OpCode
was an ALU or the IMUL/DIV unit.

There is also something about macroops that can be load-op-stores, but from what I have read, when it comes to execution,
they are split into ROPs.

If you have more details that contradict the information published up
to now, please let us know more about them.

On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Aug 21 22:46:12 2024

On Wed, 21 Aug 2024 15:28:05 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Aug 2024 12:00:47 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

On the Intel side, LD-OP-ST is split into three uops according to
everything I have read. Apparently they are satisfied with this
approach, or they would have gone for something else.

- anton

AFAIK, on the Intel side, LD-OP-ST is decoded into 4 uOps that are >immediately fused into 2 fused uOps.

Which 4 uops and 2 macroops are those? My guess is that ST is
store-data and store-address uops, and ld and op are one uop each.

Most likely.

They travel through rename phase
as 2 uOps.

Interesting. But yes, only two values are generated for physical
registers: the result of the load and the result of the op. So I
expect that the two store parts are tacked onto the op on the way
through the renamer, and then that macroop is split into its parts on
the way to the schedulers.

I am not sure if they are split back into 4 uOps before or
after OoO schedulers, but would guess the former.

Golden Cove is depicted as having an op scheduler, a load scheduler
and a store scheduler, so they have to split the ld-op-store into at
least three parts for scheduling.

Sunny Cove is depicted as having an op scheduler, a store data
scheduler, and two AGU schedulers, which would again mean at least
three parts, but this time with a different split.

Both based on <https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/>

- anton

Unlike previous Intel cores, both Sunny Cove and Golden Cove have no
univerrsal AGUs. Each AGU is dedicated either to calculation of load
addresses or to calculation of store addresses (2+2 on SuCo, 3+2 on
GoCo).
So, on this cores I see no way how any less than 4 uOps can go
Schedulers. My uncertainty was about older PRF-based cores, i.e. SB
through Skylake.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Thu Aug 22 12:01:13 2024

On 8/20/2024 6:40 PM, MitchAlsup1 wrote:

snip

I spent 7 years doing x86-64.....so much for not having.....

It is from that episode the cemented me on the value of [Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.

I understand that providing LD-OP for all the operations would take a
lot of op code space. But I suspect that there is a distribution of the utility of LD-OP depending upon which operation is involved. e.g. there
are probably more instances where load and integer add combined would be
useful than load and floating point divide would be. I suspect that determining the few most useful combinations wouldn't be too difficult.

So the question. Does it make sense to use a few op codes to implement
the most common LD-OPs?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brett on Fri Aug 23 00:32:14 2024

Brett <ggtgp@yahoo.com> wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 23:08:03 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 20 Aug 2024 17:40:50 +0000, Michael S wrote:

On Tue, 20 Aug 2024 16:40:06 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

and you may have
several of these in a local sequence of code. ...

No, you can not have several. It's always one then another one then yet >>>>> another one etc... Each one can reuse the same temporary register.

The point is that the cost of not getting allocated into a register
is vastly lower--the count of instructions remains 1 while the
latency increases. That increase in latency does not hurt those
use once/seldom variables.

The the examples cited, the lack of register allocation triples
the instruction count due to lack of LD-OP and LD-OP-ST. The
register count I stated is how many registers would a
non-LD-OP machine need to break even on the instruction count.

LD-OP-ST is a bridge too far for me.

LD-OP and OP-ST are fine with me and have benefits.

If you put cache write at or after register file write in the
pipeline; LD-OP-ST basically falls out for free and you can
move the intermediate values from whence they are produced
to where they are consumed with forwarding.

LD-OP-ST mostly only fits if it is add to memory.

42 bit opcodes work, you only need one in four RISC opcodes to merge to
LD-OP or OP-ST for code density to be the same, and generally you will do better.

The two leftover bits can be ignored, or be a template indicator, so you
can pack in a LD-OP-ST, or 31 bit RISC ops.

When you use a packet to hold 3 LD-OP-ST or 4 RISC ops, I am am not talking about two separate decoders.

75% of the data format would be shared, which yes means one will be
scattered, but that does not matter in the grand schemes of things. Two
full separate decoders would be far far more ugly.

Or go heads and tails packing.

But you have not built such, you built an improved RISC…

I spent 7 years doing x86-64.....so much for not having.....

It is from that episode the cemented me on the value of
[Rbase+Rindex<<scale+Displacement] and the utility of LD-OPs
and LD-OP-STs. Then I took that and made a better RISC ISA.
That RISC ISA did not have LD-OP-STs because of OpCode
encoding reasons not from pipelining reasons.

I assume OP-ST has issues with the value getting stuck if the address is >>> slow to resolve. With a register the value can just spill to the
register backing file. And because of this you create a hidden register
name for the value.

Athlon and Opteron had value capturing reservation stations.
K9 have value-free RSs. It caused little headache because
while we did not give it a named physical register, we did
give it a physical register for the intermediates. SW can only
read/write named PRs getting the name from logical to physical
register renaming.

You have information on how many hidden registers are in flight on
average and worst case, so I believe your numbers.

I have not looked to see if compilers generate LD-OP and OP-ST, at one
point Intel was discouraging such code.

Partially because AMD performed "relatively" better on LD-OPs and
LD-OP-STs than Intel at that time. Where "relatively" means
significantly above the noise level but "not all that much".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	12:43:21
Calls:	10,389
Calls today:	4
Files:	14,061
Messages:	6,416,880
Posted today:	1

Re: My 66000 and High word facility

Who's Online

Recent Visitors

System Info