What is Concertina 2?
On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:
What is Concertina 2?
Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:
What is Concertina 2?
Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.
Both sets are congruent to zero modulo 4.
Therefore, the
only proper solution becomes that modulo value, which amounts
in this case to a 4-bit digit/nibble. Any size data type can
be constructed from a variable number of nibbles up
to some architectural max (e.g. 400 bits for a 100 nibble
operand). The processor can treat them as binary or BCD
depending on the requirements of the application (e.g. BCD
fits COBOL well).
On Thu, 22 May 2025 18:03:34 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:
What is Concertina 2?
Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.
Both sets are congruent to zero modulo 4.
Restricted because it does not support 28-bits, 40-bits, 44-bits,...
Since the basis of the ISA is a RISC-like ISA,
3) Use only four base registers instead of eight.
4) Use only three index registers instead of seven.
5) Use only six index registers instead of seven, and use only four base registers instead of eight when indexing is used.
On 6/11/2025 12:56 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
Since the basis of the ISA is a RISC-like ISA,
[...]
3) Use only four base registers instead of eight.
4) Use only three index registers instead of seven.
5) Use only six index registers instead of seven, and use only four base >>> registers instead of eight when indexing is used.
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
Agreed.
Ideally, one has an ISA where nearly all registers are the same:
No distinction between base/index/data registers;
No distinction between integer and floating point registers;
No distinction between general registers and SIMD registers;
...
Though, there are tradeoffs. For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.
So, if ZR/LR/SP/GP are "not GPR", this is fine.
Pretty much everything else is best served by being a GPR or suitably
GPR like.
....
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.
As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
bits somewhere.
Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those purposes.
A standard RISC would not have an index register field, only a base
register field, meaning array accesses would require multiple
instructions.
The 68000 only had base-index addressing with an 8-bit displacement;
true base-index addressing with a normal displacement arrived in the
68020, but the instructions using it took up 48 bits.
I'll agree the 68000 architecture did have a serious mistake. It was
CISC, so it didn't need to be RISC-like, but the special address
registers should only have been used as base registers; the regular arithmetic registers should have been the ones used as index registers,
since one has to do arithmetic to produce valid index values.
The separate address registers would then have been useful, by allowing
those of the eight (rather than 16 or 32) general registers that would
have been used up holding static base register values to be freed up.
John Savard
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.
For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.
So, if ZR/LR/SP/GP are "not GPR", this is fine.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.
BGB <cr88192@gmail.com> writes:
For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.
So, if ZR/LR/SP/GP are "not GPR", this is fine.
I assume you mean Zero register, link register, stack pointer, global pointer. On most register architectures (those with GPRs) all of them
are addressed as GPRs in most instructions. Specifically:
Zero register: The CISCs (S/360, PDP-11, VAX, IA-32, AMD64) don't have
a zero register, but use immediate 0 instead. Most RISCs have a
register (register 0 or 31) that is addressed like a GPR, but really
is a special-purpose register: It reads as 0 and writing to it has no
effect. Power has some instructions that treat register 0 as zero
register and others that treat it as GPR.
Link register: On some architectures there is a register that is a GPR
as far as most instructions are concerned. But the call instruction
with immediate (relative) target uses that register as implicit target
for the return address. MIPS is an example of that. Power has LR as
a special-purpose register.
Stack pointer: That's just software-defined on many register
architectures, i.e., one could change the ABI to use a different stack pointer, and the resulting code would have the same size and speed.
An interesting case is RISC-V. In RV64G it's just software-defined,
but the C (compressed) extension defines some instructions that
provide smaller instructions for a specific assignment of SP to the
GPRs; I expect that similar things happen for other compressed
instruction set extensions.
Global pointer: That's just software-defined on all register
architectures I am aware of.
Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.
- anton
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.
As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
bits somewhere.
On Wed, 11 Jun 2025 16:49:06 +0000, MitchAlsup1 wrote:
On Wed, 11 Jun 2025 14:12:04 +0000, quadibloc wrote:
Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those
purposes.
This is going to hurt register allocation.
Yes. It will. Unfortunately.
Basically, as should be apparent by now, my overriding goal in defining
the Concertina II architecture - and its predecessor as well - was to
make it "just as good", or at least "just _about_ as good", as both the
68020 and the IBM System/360.
This meant that I had to be able to fit base plus index plus
displacement into 32 bits, since the System/360 did that, and I had to
have 16-bit displacements because the 68020, and indeed x86 and most microprocessors did that.
And I had to have register-to-register operate instructions that fit
into only 16 bits. Because the System/360 had them, and indeed so do
many microprocessors.
Otherwise, my ISA would be clearly and demonstrably inferior. Where I couldn't attain a full match, I tried to be at least "almost" as good.
So either my 16-bit operate instructions have to come in pairs, and have
a very restricted set of operations, or they require the overhead of a
block header. I couldn't attain the goal of matching the S/360
completely, but at least I stayed close.
So while having 32 registers like a RISC, I ended up having some
purposes for which I could only use a set of eight registers. Not great,
but it was the tradeoff that was left to me given the choice I made.
So here it is - an ISA that offers RISC-like simplicity of decoding, but
an instruction set that approaches CISC in code compactness - and which offers a choice of RISC, CISC, or VLIW programming styles. Which may
lead to VLIW speed and efficiency on suitable implementations.
John Savard
quadibloc wrote:
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base
register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.
Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.
A separate LEA Load Effective Address instruction to calculate rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.
Then rDest is used as a base in the LD or ST.
One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).
As I required 5 bits for the opcode to allow both loads and stores for
several sizes each of integer and floating-point operands, I had to save
bits somewhere.
The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
So 3 bits for the data type for loads and stores,
which if you
put that in the opcode field uses up almost all your opcodes.
So you take the data types out of the disp16 field and now your
offset range is 13 bits +/- 4kB.
And a constant prefix instruction can extend the disp13 field
to 26+13=39 or 26+26+13=65(64) bits.
On Wed, 11 Jun 2025 17:33:35 +0000, Anton Ertl wrote:
quadibloc <quadibloc@gmail.com> writes:
However, if the memory reference instructions had 5 bits for the >>>destination register, 5 bits for the index register, 5 bits for the base >>>register, and 16 bits for the displacement, then there would only be one >>>bit left for the opcode.
The solution of RISC architectures has been to not have displacement
and index registers at the same time (MIPS and its descendants do not
have base+index addressing at all). The solution of CISC
architectures has been to allow bigger instructions, and possibly
different displacement sizes (e.g., 8 bits and 32 bits for IA-32 and
AMD64).
And what I've chosen is...
- to have an architecture which superficially resembles RISC,
- but which offers all the capabilities of CISC
- and which tries to approach achieving the same code density as such
classic CISC machines as the System/360
- and in addition which offers VLIW features as well
To look like RISC, and yet to have the code density of CISC is to
attempt to achieve two goals which seem to be in profound conflict with
each other. So it shouldn't be surprising that in order to do this, I've
had to break a few rules and sacrifice some elegance.
As I've striven to achieve what seemed impossible - even if some may say
I'm tilting at windmills, as no one really cares that much about code
density any more
John Savard
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Store a pair of registers into two different memory locations
atomically is more powerful.
On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:
Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.
I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.
MitchAlsup1 wrote:
On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:
Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.
I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.
It had the advantage of meaningfully reusing some address modes
rather than having to add new opcode formats:
PC & autoincrement => immediate value (opcode data type gives size)
PC & autoincrement deferred => absolute address of data
PC & B/W/L relative => PC relative address of data
PC & B/W/L relative deferred => PC relative address of address of data
On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.
(Not a big deal but yet another thing one has to deal with in
Decode and carry with you in any uOps.)
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Store a pair of registers into two different memory locations
atomically is more powerful.
And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.
Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?
That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page
boundary.
To look like RISC, and yet to have the code density of CISC
On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:
Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.
I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.
On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.
This is exactly what made wide VAXs so hard to pipeline.
On 6/11/2025 11:51 AM, Anton Ertl wrote:
Link register: On some architectures there is a register that is a GPR
as far as most instructions are concerned. But the call instruction
with immediate (relative) target uses that register as implicit target
for the return address. MIPS is an example of that. Power has LR as
a special-purpose register.
It is GPR like, but in terms of role, I don't consider it as such.
In RV64, in theory, JAL and JALR could use any register. But, the C ABI >effectively limits this choice to X1.
Implicitly, the 'C' extension and some other (less standardized)
extensions also tend to hard-code the assumption of X1 being the link >register.
Well, it is more a case here of, "try to put something other than the
stack pointer in SP and see how far you get with that".
There are multiple levels of systems (ISA design, OS, ...)
I am not saying they don't look like GPRs in the ISA, but rather that
they aren't really GPRs in terms of roles or behavior, but rather they
are essentially SPRs that just so happen to live in the GPR numbering space.
It might be even due to things as simple as "well, the OS kernel and
program launcher assume that stack is in X2, and system calls assume
stack is in X2, ...". You have little real choice but to put the stack
in X2, and if you try putting something else there, and a system call or >interrupt happens, ..., there is little to say that things wont "go >sideways", so, not really a GPR.
Global Pointer is assumed as such by the ABI, and OS may care about it,
so not really a GPR.
I decided to classify X5/TP as a GPR as its usage is roughly up to the >discretion of the ABI and C runtime library (at least in RISC-V, there
are no hard-coded ISA level assumptions about TP, nor does it cross into
the OS kernel's realm of concern).
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
quadibloc wrote:
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >>> register, and 16 bits for the displacement, then there would only be one >>> bit left for the opcode.
Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.
A separate LEA Load Effective Address instruction to calculate
rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.
For my part, LEA is the other form of LDD (since the signed/unsigned
notation is unused as there is no register distinction between signed
64-bit LD and an unsigned 64-bit LD.
Then rDest is used as a base in the LD or ST.
One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).
My Mantra is to never use instructions to paste constants together.
As I required 5 bits for the opcode to allow both loads and stores for
several sizes each of integer and floating-point operands, I had to save >>> bits somewhere.
The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
There are 8 if you want to detect overflow differently between
signed and unsigned 64-bit values) 99.44% of programs don't care.
Which is why one "cooperates" with signedness in LDs, ignores
signedness in STs, and does exception detection only in calculation instructions.
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Store a pair of registers into two different memory locations
atomically is more powerful.
So 3 bits for the data type for loads and stores,
3-bits for LDs, 2-bits for STs.
which if you
put that in the opcode field uses up almost all your opcodes.
With a Major OpCode size of 6-bits, the LDs + STs with DISP16
uses 3/8ths of the OpCode space, a far cry from "almost all";
but this is under the guise of a machine where GPRs and FPRs
coexist in the same file.
By using 1 Major OpCode to access another 6-bit encoding space
(called XOP1) one then has another 6-bit encoding space where
the typical LDs and STs consume 3/8ths leaving room to encode
indexing, scaling, locking behavior, and ST #value,[address]
which then avoids constant pasting instructions and waste of
registers.
On Wed, 11 Jun 2025 21:35:43 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Store a pair of registers into two different memory locations
atomically is more powerful.
And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.
There are optional ways around those problems::
{
a) if you take a TLB fault on either access--just assume the
. atomic event fails and restart after TLB is repaired.
b) if both cache lines are not writeable--just assume the
. atomic event fails and restart after both lines have
. arrived in a writeable condition.
}
which simplify the problem space.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.Store a pair of registers into two different memory locations
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
atomically is more powerful.
And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.
Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?
That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page boundary.
quadibloc <quadibloc@gmail.com> writes:
To look like RISC, and yet to have the code density of CISC
I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in ><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.
Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
packages:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:
What is the use case for having base and index register and a
16-bit displacement?
The IBM System/360 had a base and index register and a 12-bit
displacement.
Most microprocessors have a base register and a 16-bit displacement.
So this lets my architecture be a superset of both of them.
Great
selling point,
and thus I didn't think too hard about whether it is
"needed", because an architecture that instead tries to only provide genuinely necessary capabilities...
now forces programmers, used to other systems that were more generous in
one way or another, to change their habits!
That would presumably spoil sales or ruin the popularity of the ISA.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:
Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX) >>>>> have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even >>>>> have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.
I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.
Actually, it seems to me that for the first RISC generation, it made
the least sense. You could not afford the transistors to do special
handling of the PC.
Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.
On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.
This is exactly what made wide VAXs so hard to pipeline.
I don't think that would cause a real problem for decoder designers
these days. It might cost some additional transistors, though.
This
design choice in VAX was very likely due to the implementation choices (sequential decoding of instruction parts) they had in mind, and these
days one would probably make a different choice even if one decided to
design an otherwise VAX-style instruction set. How did the NS32k
designers choose in this respect?
That being said, how does the design choice to include PC-relative
addressing in AMD64 and ARM A64 come out in the long run? When AMD64
and ARM A64 was designed, the data was still delivered in the microinstruction in most microarchitectures, and in that context,
PC-relative addressing does not cost extra; you just fill in the data
from the start.
But Intel switched to having separate rename registers in Sandy Bridge (around the time when ARM A64) appeared, and others did the same, so
now there is no space in the microinstructions for including the value
of the PC when the instruction was decoded.
I guess that this value
is stored in a physical register on decoding, and each use of
PC-relative addressing reduces the amount of available physical
registers from the time when the register renamer processes the
instruction until the time when the instruction is processed by the
ROB; can someone confirm this, or is it done in some other way?
- anton
MitchAlsup1 wrote:
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
quadibloc wrote:
On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:
Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.
This is true.
However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >>>> register, and 16 bits for the displacement, then there would only be one >>>> bit left for the opcode.
Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.
A separate LEA Load Effective Address instruction to calculate
rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.
For my part, LEA is the other form of LDD (since the signed/unsigned
notation is unused as there is no register distinction between signed
64-bit LD and an unsigned 64-bit LD.
LEA doesn't need the 3 bits for data type/size.
We can allocate them to index scaling which does need them.
If fp128 is to be (someday) supported then the index scaling must be
at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
single instruction array index calculations up to fp128 octonions.
The index scaling selects the octonion array element and the
displacement selects a coefficient in it.
Not that I have a use for octonions myself,
just thinking of the kids out there.
Then rDest is used as a base in the LD or ST.
One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).
I've had a look at the H&P graph of displacement % usage and see
that there is no significant difference between 12 and 13 bits.
As a second cut at the design, I'd make the immediate 12 bits.
So the immediate constants are either 12, 26+12=38 or 26+26+12=64 bits.
And that leaves 4 bits for function codes or data types.
My Mantra is to never use instructions to paste constants together.
John's scenario chose fixed length 32-bit instructions
so I'm just playing the cards dealt.
This allows an operation with up to 64 bits of immediate to be
defined in just 12 bytes of instruction space (same as My 66000).
It is spread over 3 instruction slots, but those CONST instructions are defined as fused in Decode so its similar to a variable length ISA
in that it requires no extra execute clocks.
As I required 5 bits for the opcode to allow both loads and stores for >>>> several sizes each of integer and floating-point operands, I had to save >>>> bits somewhere.
The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
There are 8 if you want to detect overflow differently between
signed and unsigned 64-bit values) 99.44% of programs don't care.
Which is why one "cooperates" with signedness in LDs, ignores
signedness in STs, and does exception detection only in calculation
instructions.
I have various instructions to check integer down-casts ranges too
and fault on overflow. For checked languages, most overflow range
checks require one extra instruction before the ST to a smaller type.
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Store a pair of registers into two different memory locations
atomically is more powerful.
I want double wide instructions for atomic swap and compare-and-swap.
Those are restricted to naturally aligned addresses and trap if not.
To support these I also need to be able to load and store wide values.
The load and store register pair instructions accept any address but if
you want an atomic guarantee then the address must be naturally aligned.
If not naturally aligned then LD or ST could use 2 separate memory
accesses.
So 3 bits for the data type for loads and stores,
3-bits for LDs, 2-bits for STs.
For integers, the register pair makes ST sizes 1, 2, 4, 8, 16.
I potentially have 5 float types from 1 to 16 bytes.
If I include float register pairs (for complex numbers) then
it could be load and store of 10 float data types.
So 3 bits for data type.
which if you
put that in the opcode field uses up almost all your opcodes.
With a Major OpCode size of 6-bits, the LDs + STs with DISP16
uses 3/8ths of the OpCode space, a far cry from "almost all";
but this is under the guise of a machine where GPRs and FPRs
coexist in the same file.
John said he had a 5-bit opcode.
He also said he wants separate integer and float register files
so that means separate LD, ST and FLD, FST.
For the data types I listed above, but NOT including the float pairs,
it would use opcodes for 8 LD, 5 ST, 5 FLD, 5 FST = 23 of 32 opcodes.
If I include some float pairs for complex fp32, fp64 and fp128 then
that uses up 29 or 32 opcodes. And that is just for loads and stores.
So yes, "almost all".
Either it:
(a) moves the type/size bits somewhere else (the offset field), as I
did,
(b) or drop support for some sizes and require an extra sign or zero
extend instruction to handle the others, as Alpha did.
By using 1 Major OpCode to access another 6-bit encoding space
(called XOP1) one then has another 6-bit encoding space where
the typical LDs and STs consume 3/8ths leaving room to encode
indexing, scaling, locking behavior, and ST #value,[address]
which then avoids constant pasting instructions and waste of
registers.
But you have variable length instructions. I would too.
John's premise assumes they are fixed 32-bits.
I'm running *that* scenario forward to see if we can get a better
result than the RISC-V folks got, where they need 6 instructions
and 24 bytes to do a LD or ST with 64 bit offset.
The CONST prefix instruction approach shows it can be done in
3 instructions of 12 bytes which are fused in Decode so require
no extra working register and no execute clocks for pasting.
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
Even I have refused to contend with all of this, at least for my basic
32-bit instruction set. Some exotic types that I do intend to support
will just have to make do with 48-bit or longer instructions instead.
But signed and unsigned integers aren't _quite_ the same as different
types for load and store. I may have separate integer and floating
registers, but I don't have separate signed and unsigned registers.
Instead, I've followed the System/360. When it comes to load and store,
for integers I have two additional operations - unsigned load and
insert. But only for integers shorter than the register.
Load sign extends. Unsigned Load zero extends. Insert leaves bits in the register preceding what is loaded untouched.
Since arithmetic is two's complement, there is only one add instruction,
and there is only one store instruction, for each length. If we were
really dealing with different types, we would need additional
instructions of those kinds as well.
For floats, I deal with fp32, fp48, fp64, and fp128 only as the primary floating-point types.
John Savard
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.Store a pair of registers into two different memory locations
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).
atomically is more powerful.
And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.
Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?
That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page
boundary.
I don't need it to be atomic for any alignment.
The spec for LD and ST register pair would say that IF the address is
16-byte aligned THEN the operation is guaranteed to be done atomically.
If the address is not aligned it may use two separate operations.
This is the same guarantee as 2, 4 and 8 byte LD or ST.
As to whether the register pair is specified as one field with
and implied increment or two separate fields, I have cases for both.
Once I started adding double-wide operate instructions I found
usages where assuming the register pairs were contiguous
(eg only even numbered registers) was too constraining.
It forces many extraneous MOV's to create the even numbered pairs.
On the other hand, having register pairs specified by two fields
quickly winds up with instructions that have 5 or 6 register fields
(2 dest and 2+1 source, or 2 dest and 2+2 source).
But this only affects Decode as the uOp formats require 6 fields.
On 6/11/2025 11:37 AM, MitchAlsup1 wrote:
On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:
------------------
LR: Functionally, in most ways the same as a GPR, but is assigned a
special role and is assumed to have that role. Pretty much no one uses
it as a base register though, with the partial exception of potential
JALR wonk.
JALR X0, X1, 16 //not technically disallowed...
If one uses the 'C' extension, assumptions about LR and SP are pretty
solidly baked in to the ISA design.
ZR: Always reads as 0, assignments are ignored; this behavior is very un-GPR-like.
GP: Similar situation to LR, as it mostly looks like a GPR.
In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so saving/restoring GP will also implicitly save/restore the dynamic
rounding mode and similar (as opposed to proper RISC-V which has this
stuff in a CSR).
Though, this isn't practically too much different from using the HOB's
of captured LR values to hold the CPU ISA mode and similar (which my
newer X3VM retains, though I am still on the fence about the "put FPSR
bits into HOBs of GP" thing).
Does mean that either dynamic rounding mode is lost every time a GP
reload is done (though, only for the callee), or that setting the
rounding mode also needs to update the corresponding PBO GP pointer
(which would effectively make it semi-global but tied to each PE image).
The traditional assumption though was that dynamic rounding mode is
fully global, and I had been trying to make it dynamically scoped.
So, it may be that having FPSR as its own thing, and then explicitly saving/restoring FPSR in functions that modify the rounding mode, may be
a better option.
Though, OTOH, Quake has stuff like:
typedef float vec3_t[3];
vec3_t v0, v1, v2;
...
VectorCopy(v0, v1);
Where VectorCopy is a macro that expands it out to something like, IIRC,
do { v1[0]=v0[0]; v1[1]=v0[1]; v1[2]=v0[2]; } while(0);
Where BGBCC will naively load each value, widen it to double, narrow it
back to float, and store the result.
I thought I saw a post in this thread which asked why I included VLIW capabilities in my ISA.
Perhaps that post was deleted, or I saw it in another thread and misremembered.
However, I thought it was worth a reply, in case anyone had forgotten
what VLIW was "good for".
Today's microprocessors achieve considerably improved performance
through the use of Out-of-Order Execution. Compare the 486, which
doesn't have it, to the Pentium II, which does have it. Intel's Atom processors originally did not have OoO in order to be small and
inexpensive, but their low performance, and smaller transistors making
more complex chips more easily possible led to even the Atom going OoO.
OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and
Meltdown.
VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
promise of achieving OoO level performance without the costs of OoO.
This is because it lets the pipeline achieve high efficiency by directly indicating within the code itself when succeeding instructions may be executed in parallel, without requiring the computer to make the effort
of determining when this is possible.
Actually, it seems to me that for the first RISC generation, it made
the least sense. You could not afford the transistors to do special
handling of the PC.
Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.
If fp128 is to be (someday) supported then the index scaling must be
at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
single instruction array index calculations up to fp128 octonions.
The index scaling selects the octonion array element and the
displacement selects a coefficient in it.
On 6/12/2025 2:13 PM, MitchAlsup1 wrote:------------------------------
GP: Similar situation to LR, as it mostly looks like a GPR.
In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so
saving/restoring GP will also implicitly save/restore the dynamic
rounding mode and similar (as opposed to proper RISC-V which has this
stuff in a CSR).
With universal constants, you get this register back.
Well, if using an ABI that either allows absolute addressing or PC-rel
access to globals.
The ABI designs I am using in BGBCC and TestKern use a global pointer
for accessing globals, and allocate the storage for ".data"/".bss"
separately from ".text". In this ABI design, the pointer is unavoidable.
Does allow multiple process instances in a single address space with non-duplicated ".text" though (and is more friendly towards NOMMU
operation).
Though, this isn't practically too much different from using the HOB's
of captured LR values to hold the CPU ISA mode and similar (which my
newer X3VM retains, though I am still on the fence about the "put FPSR
bits into HOBs of GP" thing).
Does mean that either dynamic rounding mode is lost every time a GP
reload is done (though, only for the callee), or that setting the
rounding mode also needs to update the corresponding PBO GP pointer
(which would effectively make it semi-global but tied to each PE image). >>>
The traditional assumption though was that dynamic rounding mode is
fully global, and I had been trying to make it dynamically scoped.
The modern interpretation is that the dynamic rounding mode can be set
prior to any FP instruction. So, you better be able to set it rapidly
and without pipeline drain, and you need to mark the downstream FP
instructions as dependent on this.
Errm, there is likely to be a delay here, otherwise one will get a stale rounding mode.
So, setting the rounding mode might be something like:
MOV .L0, R14
MOVTT GP, 0x8001, GP //Set to rounding mode 1, clear flag bits
JMP R14 //use branch to flush pipeline
.L0: //updated FPSR now ready
FADDG R11, R12, R10 //FADD, dynamic mode
Or, use an encoding with an explicit (static) rounding mode:
FADD R11, R12, 1, R10
On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:
On 6/12/2025 8:00 AM, quadibloc wrote:
The IBM System/360 had a base and index register and a 12-bit
displacement.
Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).
I hadn't thought about it that way.
It does make sense that on a timesharing system, virtual memory meant
that different users would not have to share the same memory space, so programs wouldn't have to be relocatable.
But if you drop base registers for that reason, suddenly you are forced
to always use virtual memory.
Of course, then why did the 68020 support it, I could ask.
No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
quadibloc <quadibloc@gmail.com> writes:
To look like RISC, and yet to have the code density of CISC
I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in >><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.
Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
packages:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
Seems to me that the text size is not the interesting
metric here - rather the typical working set size is
far more important.
It may be that text size isn't a particularly
good metric for judging instruction set effectiveness.
Compiler people H A T E pairing {LoB = 0 and 1} and sharing
{Rsecond = Rfirst+1}, they want to be able to allocate any
value into any register without such constraints. After all
register allocation is already NP, pairing and sharing moves
the needle to NP-hard.
OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and
Meltdown.
VLIW, in the sense of the Itanium
This is because it lets the pipeline achieve high efficiency by directly >indicating within the code itself when succeeding instructions may be >executed in parallel, without requiring the computer to make the effort
of determining when this is possible.
On 6/12/2025 8:09 PM, quadibloc wrote:
On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:
On 6/12/2025 8:00 AM, quadibloc wrote:
The IBM System/360 had a base and index register and a 12-bit
displacement.
Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).
I hadn't thought about it that way.
It does make sense that on a timesharing system, virtual memory meant
that different users would not have to share the same memory space, so
programs wouldn't have to be relocatable.
But if you drop base registers for that reason, suddenly you are forced
to always use virtual memory.
No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
quadibloc <quadibloc@gmail.com> writes:
To look like RISC, and yet to have the code density of CISC
I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in >>><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.
Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian >>>packages:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
Seems to me that the text size is not the interesting
metric here - rather the typical working set size is
far more important.
Yes, something like it. But how do you measure it? And do you think
that the text size of binaries for different architectures are not
correlated to the working set sizes of these architectures?x
It may be that text size isn't a particularly
good metric for judging instruction set effectiveness.
Why would it not be a good predictor, and what would you use instead?
On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:
On 6/12/2025 8:00 AM, quadibloc wrote:
On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:
What is the use case for having base and index register and a
16-bit displacement?
The IBM System/360 had a base and index register and a 12-bit
displacement.
Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).
I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.
It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.
Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?
However, some comp.arch regulars seem to consider it quite important,
and they regularly make claims about the code density of various
instruction sets. I have started measuring the text size of programs
in order to provide empirical counterevidence to this wishful
thinking. This apparently has made little impression on those making
such claims, but maybe the rest of you will gain somthing from these
data.
Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.
I'm not convinced that "instruction set effectiveness" is a
useful metric for modern systems.
It
has an impact on Icache, certainly, but can you quantify it
quadibloc <quadibloc@gmail.com> writes:
OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and >>Meltdown.
No. AMD's OoO CPUs never had Meltdown AFAIK. As for Spectre, that
can be fixed (not mitigated) at moderate hardware and performance cost
with Invisible Speculation; I wrote an overview paper about Spectre
and how to fix it <http://www.euroforth.org/ef23/papers/ertl.pdf>, but
the actual Invisible Speculation research was done by others. It's
just that the hardware designers don't want to; apparently the
customers are not interested enough and prefer to pay the performance
and software development cost of Spectre mitigations (or are too
indifferent to care about it at all).
VLIW, in the sense of the Itanium
IA-64 is EPIC, not VLIW.
And IA-64 gives you Spectre, too, in a way-------------------
that is cannot be fixed by Invisible Speculation, because the
speculation is architectural, and there is no explicit "commit" that
turns speculation into non-speculation.
- anton
On 6/12/2025 11:03 PM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in >>> its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.
It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.
Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?
Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just changed the value in the base register to reflect the new location.
This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.
I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.
In fact, I think this was the primary reason, and using
them to relocate code and data was a nice idea that came after.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 6/12/2025 11:03 PM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.
It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.
Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?
Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just
changed the value in the base register to reflect the new location.
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.
Even with that provision, it would not have worked.
quadibloc <quadibloc@gmail.com> schrieb:
I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.
Registers do that, and you only need a single one.
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 6/12/2025 11:03 PM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.
It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.
Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?
Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just
changed the value in the base register to reflect the new location.
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.
Even with that provision, it would not have worked.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).
The problem of the /360 was that they put their base registers in
user space.
The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s.
(Actually, I
believe the UNIVAC had two, one for program and one for data).
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).
The problem of the /360 was that they put their base registers in
user space.
The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).
The problem of the /360 was that they put their base registers in
user space. The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).
All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).
On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
On 6/13/2025 4:52 AM, quadibloc wrote:
I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.
No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.
Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.
And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 6/12/2025 11:03 PM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
The term "base register" is being used in different ways in this thread.
a) We have the base register of base-and-bounds program relocation
{that the user is not allowed to see of know}
b) we have a pointing register (/360) that can have indexing and
offsetting applied to form a virtual address.
Since base-and-bounds slipped into history circa 1980 after even
On Thu, 12 Jun 2025 19:01:52 +0000, MitchAlsup1 wrote:
What code is produced from::
uint32_t function( uint32_t u )
{
int32_t i[99];
return i[u];
}
That wouldn't even compile. The array i is not initialized.
However, I'll assume that this is a fragment of a larger program.
You've stated that this is for a 64-bit machine.
So it takes an index variable as an argument, and returns an element
from an array.
The array is declared as signed 32 bit integers, but the function
returns an unsigned 32 bit integer.
Well, the answer is that it doesn't matter if I use a "load" or an
"unsigned load", since what the function returns is a pointer to a *32-bit-long* value in memory. Which the calling program will interpret
as unsigned.
Or maybe the function return is in register zero. In that case, I will
indeed generate a "load" rather than an "unsigned load" inside the
program. The caller, however, will presumably extract the least
significant bits of that register into a 32-bit long variable before
use, so my "error" will not have disastrous consequences.
John Savard
On 6/13/2025 12:50 PM, quadibloc wrote:
On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
On 6/13/2025 4:52 AM, quadibloc wrote:
I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.
No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.
Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.
And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.
It's all a tradeoff. Yes, occasionally you need an extra instruction,
but you gain four bits for a larger displacement (or something else if
you want). And don't forget, you need the "extra" BALR instructions, or
other ones to load the base register for every 4K chunk of data or instructions, and the loss of an otherwise available register.
Everyone else who has evaluated the tradeoff chose not to use the extra register.
On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote:
On 6/13/2025 12:50 PM, quadibloc wrote:
On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
On 6/13/2025 4:52 AM, quadibloc wrote:
I have been thinking about this, and I don't think that base registers >>>>> only existed to allow program relocation in a crude form that virtual >>>>> memory superseded. They also existed simply to avoid having to have a >>>>> displacement field large enough to address all of memory in every
instruction.
No. If you wanted to address larger than the displacement field, you >>>> still had the index register. And remember that the need for that is >>>> reduced because you could have a 16 bit displacement by using the four >>>> bits free'd up by eliminating the base register field.
Certainly, you can use the index register to address an area larger than >>> the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.
And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.
It's all a tradeoff. Yes, occasionally you need an extra instruction,
but you gain four bits for a larger displacement (or something else if
you want). And don't forget, you need the "extra" BALR instructions, or
other ones to load the base register for every 4K chunk of data or
instructions, and the loss of an otherwise available register.
Everyone else who has evaluated the tradeoff chose not to use the extra
register.
Can you restate what you intended to mean in the last sentence/paragraph
but use different words ??
Certainly the /360 designers, the VAX designers, the x86 designers, and others; looked at the problem and allowed [Rpointer+Rindex+displacement] addressing. So, it is not everyone.
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
As for the X86, I freely confess to not knowing the constraints its
designers were operating under, so I can't really comment.
On 6/12/2025 10:00 AM, quadibloc wrote:
On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:
What is the use case for having base and index register and a
16-bit displacement?
The IBM System/360 had a base and index register and a 12-bit
displacement.
Most microprocessors have a base register and a 16-bit displacement.
Serious overkill...
For [Rb+Disp] with a 32-bit encoding, 9 or 10 is sufficient, if scaled
by the element size, 12 otherwise.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
Did they? Why?
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.
IA-32 has both the segmentation
mechanism inherited from the 80286 (and extended to 32 bit segments)
and paging, so using the addressing modes for any form of virtual
memory was not the intention for providing this addressing mode.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
Did they? Why?
"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks
# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.
# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.
That they got wrong, egregiously so, as the example with passing
a pointer to something from a COMMON block shows.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
For the 8086 architecture, the effective addresses are reg, reg+const
or reg+reg (with severe restrictions on the registers usable for that;
the 8086 does not have GPRs).
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
I think the IA-64 has a lot of interesting features.
It looks like a
processor that was designed a while ago before the current batch of >superscalars machines became popular.
If the register file were used as a flat register file, instead
of one that rotates the registers it might be simpler to use.
I have been working with m68k code recently. The issues with it become >apparent when looking at the output of compiled code. A lot of memory to >memory moves. I see that it has great code density, but I wonder how
that correlates to performance, given all the memory ops. A RISC
architecture may have 30% worse code density, but it might run 5x as fast.
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
Did they? Why?
"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks
# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.
Up to that point I was thinking of MIPS, Alpha, and RISC-V with their reg+const addressing, and I thought: Ok, these machines actually
support absolute (aka direct) addressing by using the zero register as
reg. But of course nobody ever uses the zero register for addressing,
and absolute addressing is not used.
# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.
And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.
On 6/14/2025 8:40 AM, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
Did they? Why?
"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks
# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.
Up to that point I was thinking of MIPS, Alpha, and RISC-V with their
reg+const addressing, and I thought: Ok, these machines actually
support absolute (aka direct) addressing by using the zero register as
reg. But of course nobody ever uses the zero register for addressing,
and absolute addressing is not used.
# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.
And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.
On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register. The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now contained the address of the first "useful" instruction as the base
register in future instructions.
Thomas Koenig <tkoenig@netcologne.de> writes:
That they got wrong, egregiously so, as the example with passing
a pointer to something from a COMMON block shows.
I missed or did not understand that example. What's the issue?
On Sat, 14 Jun 2025 16:04:42 +0000, quadibloc wrote:
so let's go to "Annie Get Your Gun" for the other song... "Anything You
Can Do".
Although, in my case, it's more like anything almost any other computer
can do, Concertina II can do _almost_ as well, rather than better. Its
level of versatility means that it loses a little in code density.
So an all-out implementation would presumably have a lot of cache in
addition to a lot of pins, to support a wide data path. Unfortunately,
while chips can put their floating-point ALU to sleep during integer
code, there's probably no practical way to put OoO circuitry to sleep
during VLIW code, because it's too intimately tied into everything - but maybe one could have two control units sharing the same ALUs so that
this could be managed.
But then, today's microprocessors have thousands of pins, and yet they
don't have enormously wide data paths. Apparently their control
interfaces had to get way more complex than, say, what worked back in
the Socket 7 days.
John Savard
On 6/13/2025 3:01 PM, MitchAlsup1 wrote:
On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote: >>--------------------
The VAX was in a different situation. Being a virtual memory design,
they didn't need it for the reason that the S/360 did. I am not an
expert, but ISTM that the VAX designers wanted to include almost
anything in the ISA to close the "semantic gap", and certainly didn't
feel constrained to keep instructions within 32 bits, so adding the 3
input address calculation, with potentially large offsets seemed
reasonable to them. For various reasons, this all proved not to be a
good choice eventually.
As for the X86, I freely confess to not knowing the constraints its
designers were operating under, so I can't really comment.
There is no X86.
On 6/14/2025 8:40 AM, Anton Ertl wrote:...
Thomas Koenig <tkoenig@netcologne.de> writes:
"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks
# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.
And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.
On S/360, that is exactly what you did. The first instruction in an >assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.
The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now >contained the address of the first "useful" instruction as the base
register in future instructions. This allowed the OS to load the
program to any address in real memory, thus to have more than one
program resident in real memory at the same time and the CPU could
switch among them. By the time virtual memory came along with the S/370
(and OK, the 367/67) this was, of course no longer needed, but it was
kept for upward compatibility.
On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:
Packing and unpacking decimal floats can be done inexpensively and fast
relative to the size, speed of the decimal float operations. For my own
implementation I just unpack and repack for all ops and then registers
do not need any more than 128-bits.
I also unpack the hidden first bit on IEEE-754 floats.
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:
On 6/13/2025 4:52 AM, quadibloc wrote:
I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.
No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.
Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.
There is _nothing_ wrong with having base + (scaled) index register instructions. That is just 15 bits for three registers (assuming
32 GPRs), which leads ample space for opcodes, scaling and
maybe, if you feel so inclined, a small offset.
There is _everything_ wrong with mandating a base register for
every load/store operation, and trying to cram in a large offset
as well.
If you want to step through arrays, you can also use something
like POWER's "load or store with update". ldu puts the effective
address of the memory instruction into the address register,
so you can use that with arbitrary step sizes.
Stephen Fuld wrote:
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
      SUBROUTINE FOO
      REAL A,B
      COMMON /COM/ A,B
      REAL C
      CALL BAR(B)
C ....
      END
      SUBROUTINE BAR(A)
      REAL A
      A = A + 1
      END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the
COMMON block is at some fixed displacement from the start of the
program. So the program can "compute" the real address of the data in
common blocks from the address in its base register.
If the program was relocated after the call to BAR but before using
the reference to access argument A then it reads the wrong location.
On 6/14/2025 11:51 AM, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>>> instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer >>>>> placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form >>>>> between routines, because they knew even then that it would never work. >>>>> At least not when subroutines are *compiled separately*, which was the >>>>> normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the COMMON >>> block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.
Guaranteed, with a 12-bit offset?
First let me say that I may have misinterpreted your recent comments.
The visible base register mechanism IBM chose prevents any relocation of
the program once it is first loaded.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:
Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).
SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END
SUBROUTINE BAR(A)
REAL A
A = A + 1
END
What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.
FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.
BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.
No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.
Correct.
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the COMMON
block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.
Guaranteed, with a 12-bit offset?
On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:
That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.
It prevents relocation of programs currently in use that are already in memory.
It facilitates loading programs from object files on disk into any
desired part of memory, which is the usual meaning of "program
relocation" among System/360 programmers, perhaps because they had no
other type of it available.
Implementing the 360 architecture with the addition of a base and bounds mechanism instead of full-blown virtual memory was perfectly possible. However, the System/360 was originally conceived as a computer for use
in batch processing.
Hence, TSS/360 was a kludge and ran slowly, and it
took the 360/67 with special hardware to facilitate timesharing for IBM
to have something that addressed that function effectively.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 6/14/2025 8:40 AM, Anton Ertl wrote:...
Thomas Koenig <tkoenig@netcologne.de> writes:
"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks
# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.
And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.
On S/360, that is exactly what you did. The first instruction in an >>assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.
That's not loading once and leaving it alone, but yes, that can work,
too, as shown in modern dynamic-linking ABIs.
The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now >>contained the address of the first "useful" instruction as the base >>register in future instructions. This allowed the OS to load the
program to any address in real memory, thus to have more than one
program resident in real memory at the same time and the CPU could
switch among them. By the time virtual memory came along with the S/370 >>(and OK, the 367/67) this was, of course no longer needed, but it was
kept for upward compatibility.
An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic
linking became fashionable; on Linux at first dynamically-linked
On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps >>> the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).
The problem of the /360 was that they put their base registers in
user space.
A base register not part of GPRs is a descriptor (or segment).
And we don't want to go there.
           The other machines made it invisible from user space >> and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
It also fails when one has 137 different things to track with those descriptors.
which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).
Still insufficient for modern use cases.
In any case, it's no problem to add a virtual-memory mechanism that is
not visible to user-level, or maybe even kernel-level (does the
original S/360 have that?) programs, whether it's paged virtual memory
or a simple base+range mechanism.
On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:
On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:
That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.
It prevents relocation of programs currently in use that are already in
memory.
Actually, to be more precise, it prevents doing this _in a manner that
is fully transparent to the programmer_.
So IBM could have created a time-sharing operating system that ran on
models of the System/360 other than the model 67 with its Dynamic
Address Translation hardware as follows:
Require that programs only use one set of static base registers for
their entire run;
Require that programs describe the base registers they use in a standard header;
Require that programs set a flag when they have finished initializing
those base registers (and do so very quickly after being started).
If those conditions are met, then a program in memory can indeed be
moved to somewhere else in memory, as the operating system will know
which base registers to adjust.
Well, sort of. Such programs would not be able to use flat addresses to
pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
changes to calling conventions; for example, all routines in a program
might need to share a common area for data values, and always use the
same base register to point to it.
So you would have special time-sharing versions of all the compilers.
On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:
I have found out that I was mistaken in my earlier posting.
TSS/360 may have been a slow, inefficient, and poorly received
time-sharing operating system for the System/360 by IBM.
However, it only ran on the System/360 Model 67, and so it did *not*
attempt the kind of kludge I described as a desperate way of working
without the availability of address translation. Its poor performance
must have been the result of other causes.
IBM also had something called TSO, for Time-Sharing Option, and that did
run on System/360 models other than the Model 67, and so IBM may
actually have used the kind of kludge I had described after all.
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.
Surely that is an unfair characterization.
After all, as Ivan Godard has reminded us on several occasions, out of
order execution has a very large cost in transistors. So, while it is a
way of achieving high performance, it comes at a cost both in die size
and in power consumption.
If the same benefits could be obtained through VLIW techniques without
those costs - but with an overhead cost of extra bits in the
instructions - that would be a very promising technology. So their
problem wasn't that they forgot what they knew about OoO, but rather
perhaps that their knowledge of the limitations of VLIW was
insufficient.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked
s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.
Care was required to ensure that libraries which
were used in the same application were statically link at unique
and non-overlapping addresses. Difficult when you only had half
the 386 linear address space available.
quadibloc <quadibloc@gmail.com> schrieb:
On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:
I have found out that I was mistaken in my earlier posting.
TSS/360 may have been a slow, inefficient, and poorly received
time-sharing operating system for the System/360 by IBM.
However, it only ran on the System/360 Model 67, and so it did *not*
attempt the kind of kludge I described as a desperate way of working
without the availability of address translation. Its poor performance
must have been the result of other causes.
IBM also had something called TSO, for Time-Sharing Option, and that did
run on System/360 models other than the Model 67, and so IBM may
actually have used the kind of kludge I had described after all.
IIRC, TSO came later.
On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:
On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:
That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.
It prevents relocation of programs currently in use that are already in
memory.
Actually, to be more precise, it prevents doing this _in a manner that
is fully transparent to the programmer_.
So IBM could have created a time-sharing operating system that ran on
models of the System/360 other than the model 67 with its Dynamic
Address Translation hardware as follows:
Require that programs only use one set of static base registers for
their entire run;
Require that programs describe the base registers they use in a standard header;
Require that programs set a flag when they have finished initializing
those base registers (and do so very quickly after being started).
If those conditions are met, then a program in memory can indeed be
moved to somewhere else in memory, as the operating system will know
which base registers to adjust.
Well, sort of. Such programs would not be able to use flat addresses to
pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
changes to calling conventions; for example, all routines in a program
might need to share a common area for data values, and always use the
same base register to point to it.
So you would have special time-sharing versions of all the compilers.
John Savard
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Also, for development projects, there are always differences in
opinion if choosing path A or B is the right way, because both will
have advantages and disadvantages, and people will have different
opinions of what is likely to succeed. Even after termination of
a project, you will in all likelyhood find people who say "But it
could have succeeded, we should have tried this or that".
Thomas Koenig <tkoenig@netcologne.de> writes:
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>> it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Corporations are organized hierarchically.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
An interesting development is that, e.g., on Ultrix on DecStations >>>programs were statically linked for a specific address. Then dynamic >>>linking became fashionable; on Linux at first dynamically-linked
s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.
I guess you mean dynamically linked libraries.
Anyway, it also happened in Linux.
Care was required to ensure that libraries which
were used in the same application were statically link at unique
and non-overlapping addresses. Difficult when you only had half
the 386 linear address space available.
Address space was not the problem. HDD sizes in the early 1990s were
well below 1GB, so all the libraries plus all the executables
installed on one system (or available in one Linux distribution) could
easily fit in that address space with ample address space left for
data. The problem was that it required a lot of coordination, at
least in the way that was used on Linux, don't know about SVR3.
Every library binary was linked for a specific address, so those
producing the library binaries had to coordinate which addresses they
could use.
On 6/13/2025 11:42 AM, MitchAlsup1 wrote:
On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Even with that provision, it would not have worked.
All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems. >>>> The Univac 1108 and followons is another. There may be others, perhaps >>>> the some of CDC systems as they and the Univac systems shared a common >>>> designer (Seymour Cray).
The problem of the /360 was that they put their base registers in
user space.
A base register not part of GPRs is a descriptor (or segment).
And we don't want to go there.
           The other machines made it invisible from user space >>> and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
It also fails when one has 137 different things to track with those
descriptors.
Fails isn't the correct word, but more awkward certainly is. I can't
speak to the Burroughs machines (I am sure Scott can), but on the Univac
1100 series it was a single instruction to change the base register to
any other entry from a table of them that was set up at link time. The
table could contain a lot of entries (it varied over time), but
certainly many more than 137 (could be thousands)
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Corporations are organized hierarchically.
Have you ever worked in a large corporation? (Just asking).
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
An interesting development is that, e.g., on Ultrix on DecStations >>>>programs were statically linked for a specific address. Then dynamic >>>>linking became fashionable; on Linux at first dynamically-linked
s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>at a specific address.
I guess you mean dynamically linked libraries.
No, I meant statically linked libraries.
I may have
misunderstood your statement about linux vis-a-vis static shared libraries.
Yes, this was the same problem in SVR3.2. SVR4 showed up around
1990 with shared objects and static libraries went the way of
the Dodo bird.
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Corporations are organized hierarchically.
Have you ever worked in a large corporation? (Just asking).
Define large.
When Burroughs bought Sperry in 1986, Unisys had 120,000 employees.
(a decade later, that was down to 20,000, three decades after
that, it's up to 22,000).
On 6/12/2025 7:00 PM, MitchAlsup1 wrote:
On Thu, 12 Jun 2025 21:30:39 +0000, BGB wrote:
On 6/12/2025 2:13 PM, MitchAlsup1 wrote:------------------------------
The modern interpretation is that the dynamic rounding mode can be set >>>> prior to any FP instruction. So, you better be able to set it rapidly
and without pipeline drain, and you need to mark the downstream FP
instructions as dependent on this.
Errm, there is likely to be a delay here, otherwise one will get a stale >>> rounding mode.
RM is "just 3-bits" that get read from control register and piped
through instruction queue to function unit. Think of the problem
one would have if a hyperthreaded core had to stutter step through
changing RM ...
To do it more quickly, one would likely need special logic in the
pipeline for getting the updated RM to the FPU in a more timely manner.
If done (as-is) in a lax way: Held in the HOBs of GP/GBR or similar,
which is handled as an SPR that gets broadcast out of the regfile.
Then one has the latency issue:
The new value needs to reach the regfile (WB stage);
The value then needs to make its way to the relevant ID2/RF stage (next
cycle after WB).
A lazy option would be to add an interlock so that any dynamic rounding
mode instruction would generate pipeline stalls for any in-flight modifications to GBR (as opposed to using a branch or a series of NOPs).
This was not done in my existing implementation.
But, IME, the "fenv.h" stuff, and FENV_ACCESS, is rarely used.
So, making "fesetround()" or similar faster doesn't seem like a high priority.
If having "fsetround()" as a function call, can also ensure the needed
delay as-is by using a non-default register during the return (mostly to hinder the branch predictor).
So, setting the rounding mode might be something like:
  MOV .L0, R14
  MOVTT GP, 0x8001, GP //Set to rounding mode 1, clear flag bits
  JMP R14        //use branch to flush pipeline
  .L0:           //updated FPSR now ready
  FADDG R11, R12, R10 //FADD, dynamic mode
Setting RM to a constant (known) value::
   HRW rd,RM,#imm3   // rd gets old value
It is possible,
Could almost alias the bits to part of SR, where SR does generally have
a more timely update process (could reduce latency to 2 cycles).
At present, the RM field is held in GBR(51:48), with fast update options either being a MOVTT (can replace the high 16 bits, *1) or BITMOV,
*1: There is a MOVTT Imm5/Imm6 variant, currently can only modify
(63:60) though.
Though, this strategy is only directly usable in XG3 (where GBR is
mapped to R3/X3), N/A in XG1 or XG2, where GBR is in CR space and so
would require 3 instructions.
Implicitly, the fragment assumed XG3, but then this leaves open the
issue of whether to use my former ASM syntax or RISC-V style ASM syntax (BGBCC can sorta accept either, with my newer X3VM experiment defaulting
to RISC-V syntax).
Can note that the RISC-V F/D instructions define a fixed rounding modes
in the instruction, with rounding modes for a dynamic rounding mode
(though, IIRC, no way to update the dynamic RM within the scope of the
base ISA; so one needs Zicsr and similar to pull it off).
scott@slp53.sl.home (Scott Lurndal) writes:
Yes, this was the same problem in SVR3.2. SVR4 showed up around 1990
with shared objects and static libraries went the way of the Dodo bird.
In Linux, the transition was in 1995. Solaris (Sun's port of SVR4)
appeared in 1992.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
An interesting development is that, e.g., on Ultrix on DecStations >>>>>programs were statically linked for a specific address. Then dynamic >>>>>linking became fashionable; on Linux at first dynamically-linked
s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>>at a specific address.
I guess you mean dynamically linked libraries.
No, I meant statically linked libraries.
Static linking does not require any coordination. Every executable
gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
not shared.
Therefore, I reduced the index register and base register fields toThis is going to hurt register allocation.
three bits each, using only some of the 32 integer registers for those
purposes.
On 6/12/2025 7:11 PM, quadibloc wrote:
On Thu, 12 Jun 2025 19:24:36 +0000, MitchAlsup1 wrote:
On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote: -----------------------
But, this sort of thing is semi-common in VLIWs, along with things like
not having either pipeline interlocks or register forwarding (so, say,
every instruction has a 3 cycle latency, so you need to wait 2 bundles
to use a result else you get a stale value, etc).
Contrast, spending 1 or 2 bits per instruction word or similar (to daisy chain groups of instructions), and still having things like forwarding
and interlocks, does not result in same the severe hit to code density.
However, register forwarding does have a dark side: It has a fairly
steep cost curve. So, with forwarding, once you try to cross ~ 2 or 3
wide, the costs here grow out of control.
So, as the core gets wider, the cost of the register file will exceed
that of the function units (and one may find that it is cheaper to go multi-core than to make the core wider).
Or, to try to go wider while keeping cost under control, give up on
niceties like register forwarding.
One of the dominant use-cases for VLIW is in GPUs and similar.
But, then seemingly battles for control against "lots of in-order
superscalar RISC cores".
So, for more traditional 3D rendering tasks, VLIW did well, but for
things like GPGPU or ray-tracing, the "crapton of in-order cores"
strategy works well.
The main merit of OoO is when the overriding priority is maximizing per-thread performance. But, in other cases, cramming more cores on the
die may offer more performance than one can get from a smaller number of faster cores.
Well, and all the battles over things like memory coherence.
For a small number of fast cores, coherence makes sense.
For large number of cores, weaker models may be preferable (or, say, essentially treating parts of the memory map as read-only to most of the cores).
Where, LIW (in a partial contrast to VLIW) can have merit if the goal is
to optimize for per-core cheapness. The per-core cost for a LIW can be
lower than that of an in-order superscalar, but with the drawback that
the compiler will need to be aware of pipeline specifics.
Say one could have cores designed like, say:
2 or 3 wide;
Explicit parallelism;
No register forwarding;
Maybe optional interlocks;
Weak memory coherence;
...
And then trying to optimize for fitting as many cores as possible on the chip, even if per-thread performance is relatively low, and trying to prioritize having very high memory bandwidth.
Therefore, I reduced the index register and base register fields toThis is going to hurt register allocation.
three bits each, using only some of the 32 integer registers for those
purposes.
I vaguely remember reading somewhere that it doesn't have to be too bad:
e.g. if you reduce register-specifiers to just 4bits for a 32-register architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.
scott@slp53.sl.home (Scott Lurndal) writes:
Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.
Probably not, but I don't think the reason is that "working set size"
would produce significantly different results.
However, apparently code size is important enough in some markets that
ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
(which allows mixing 16-bit and 32-bit encodings); Power specified VLE
(are there any implementations of that?); and RISC-V specified the C extension, which is implemented widely AFAICT.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.
Probably not, but I don't think the reason is that "working set size"
would produce significantly different results.
However, apparently code size is important enough in some markets that
ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
(which allows mixing 16-bit and 32-bit encodings); Power specified VLE
(are there any implementations of that?); and RISC-V specified the C
extension, which is implemented widely AFAICT.
AFAICS main target of those are small embedded microcontrollers
running code mostly from flash. Apparently flash size have important
impact on microcontroller cost. Also, biggest available flash
was limited and if program exceeded available flash one would have
to switch to different (possibly significantly more expensive)
hardware.
On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:
Then again, SW over the last
15 years has demonstrated no ability to use "lots" of small cheap cores.
And as long as that remains true, out-of-order execution will continue
to be popular, and there will also be strong pressure to find exotic materials that can be used to make faster transistors - and faster interconnects between them.
While I am willing to agree that we can do better in using multiple
cores, I also think that even after we do all that we can in that area,
a single core that is N times faster will still be better than N cores.
But on the other hand, no matter what new technologies we discover to
make cores faster, there will still be a hard limit to how fast a core
can be.
So both faster cores, and more efficient ways to use multiple cores,
will always be important.
John Savard
On 6/16/2025 9:17 AM, Stefan Monnier wrote:
Therefore, I reduced the index register and base register fields toThis is going to hurt register allocation.
three bits each, using only some of the 32 integer registers for those >>>> purposes.
I vaguely remember reading somewhere that it doesn't have to be too bad:
e.g. if you reduce register-specifiers to just 4bits for a 32-register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.
I can see that it isn't too hard on the logic for the register
allocator,
but I suspect it will lead to more register saving and
restoring.
Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8. So the odds increase that you must save one of
those 8 and perhaps restore it after the two instructions have
completed.
It sure seems ugly to me.
On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding stops
making sense.
Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.
In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer, one advanced feature I wish to include in the Concertina II
- I think I took a stab at it in the original Concertina - is dataflow computing.
Dataflow computing is where the program explicitly states how arithmetic units are to be connected together to perform multiple operations in a chained fashion, usually taking vectors as input and producing vectors
as output.
The ENIAC, "before von Neumann ruined it", worked that way. So data
isn't even being put in registers, let alone memory, between severalas operations, thus making the computer faster.
In Concertina, unlike Concertina II, I didn't worry about having
instructions that had awkward and special rules for length decoding,
though. A dataflow instruction would involve a chain of operations with
an arbitrary length up to some limit.
However, the solution suggests itself.
I used pointers to pseudo-immediates to prevent the variation in length
of immediate values from making length decoding for instructions
complicated.
In some early iterations of Concertina II, I used a similar pointer
mechanism as my method of allowing instructions longer than 32 bits. The pointers were four bits long for that, instead of five, since now they
were halfword addresses instead of byte addresses. This could be brought
back - but just for dataflow instructions and any similar exotic cases.
John Savard
On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:
On 6/16/2025 9:17 AM, Stefan Monnier wrote:
Therefore, I reduced the index register and base register fields toThis is going to hurt register allocation.
three bits each, using only some of the 32 integer registers for those >>>>> purposes.
I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.
I can see that it isn't too hard on the logic for the register
allocator,
You are missing the BIG problem::
Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally. How does on fix this kind of problem without adding more
passes over the intermediate representation ??
          but I suspect it will lead to more register saving and >> restoring.
And reg-reg MOVment.
On Tue, 17 Jun 2025 13:12:27 +0000, quadibloc wrote:
On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or sometimes >>>>> SUM( Results[type] × Operands[type] ); hardly more than quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding stops
making sense.
Out of order execution can cost more than the functional units in a CPU. >>> But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.
In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer, one advanced feature I wish to include in the Concertina II
- I think I took a stab at it in the original Concertina - is dataflow
computing.
Dataflow computing is where the program explicitly states how arithmetic
units are to be connected together to perform multiple operations in a
chained fashion, usually taking vectors as input and producing vectors
as output.
Do you remember WHY data-flow failed ???
It failed because it exposed TOO MUCH ILP and then this, in turn,
required
too much logic to manage efficiently--often running into queue overflow >problems (Reservation station entries) that could cause lock up if not >managed correctly.
On 6/17/2025 7:58 AM, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding stops
making sense.
Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.
It depends on the use case.
But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.
Like, if there were a 200% or more difference for OoO performance vs
in-order performance relative to clock speed; maybe...
But, seemingly the delta is often modest enough that one can still make
a case for in-order in cases where you don't actually need maximum single-thread performance.
Often, as noted, each population member (or test member, or whatever it
is called) would usually be represented in some bit-redundant format
(such as each bit expanded out to a full byte for majority-8, or 3
parallel copies for majority-3).
Majority-8 was usually lookup table driven.
Majority-3 was usually (A&B)|(B&C)|(A&C).
Can note, the Zen+ in my main PC seemingly has an odd property:
Under 25% CPU load, per thread performance is at maximum;
Around 25-50%, per-thread drops, but still often positive benefit;
Over 50%, per-thread drops notably,
so 100% isn't much better than 50%.
Granted, the 50-100% domain is mostly hyperthreading territory.
But, it seems like there is some shared resource that becomes a
bottleneck by around the time one hits 4 threads.
Had noted that it seems to apply mostly to memory-medium and
memory-heavy use-cases, where:
memory-medium: ~ 10 to 100MB of working data;
memory-heavy: over 100MB of working data.
Where, most of the data is touched continuously.
If the task is primarily bound by things like branching or ALU/FPU,
there does not seem to be a fall-off.
John Savard
On 6/17/2025 7:58 AM, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or
sometimes SUM( Results[type] × Operands[type] ); hardly more than
quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the cost
of the forwarding exceeds exceeds that of the function units (such
as an additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding stops
making sense.
Out of order execution can cost more than the functional units in a
CPU. But since a faster CPU - or, more specifically, a CPU with
faster single-thread performance - is much more useful than more
CPUs, it's still well worth the cost.
It depends on the use case.
But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.
Like, if there were a 200% or more difference for OoO performance vs in-order performance relative to clock speed; maybe...
But, seemingly the delta is often modest enough that one can still
make a case for in-order in cases where you don't actually need
maximum single-thread performance.
On 6/17/2025 1:58 PM, Michael S wrote:
On Tue, 17 Jun 2025 13:34:20 -0500
BGB <cr88192@gmail.com> wrote:
On 6/17/2025 7:58 AM, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or
sometimes SUM( Results[type] × Operands[type] ); hardly more
than quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the
cost of the forwarding exceeds exceeds that of the function
units (such as an additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding
stops making sense.
Out of order execution can cost more than the functional units in
a CPU. But since a faster CPU - or, more specifically, a CPU with
faster single-thread performance - is much more useful than more
CPUs, it's still well worth the cost.
It depends on the use case.
But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.
Like, if there were a 200% or more difference for OoO performance
vs in-order performance relative to clock speed; maybe...
But, seemingly the delta is often modest enough that one can still
make a case for in-order in cases where you don't actually need
maximum single-thread performance.
For Arm architecture, the dufference in single-thread performance
between the fastest available in-order cores (ARM Cortex-A520) and
the fastest available OoO cores (Apple M4, Qualcomm Oryon) is huge. Probably, over 5x. Even Arm's own ARM Cortex-X925 is several times
faster than A520.
For ARM, main reference points I had was A53 vs A72.
A72 was faster, but but not drastically...
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.^^
On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:
On 6/16/2025 9:17 AM, Stefan Monnier wrote:You are missing the BIG problem::
I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-registerI can see that it isn't too hard on the logic for the register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.
allocator,
Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally.
but I suspect it will lead to more register saving andAnd reg-reg MOVment.
restoring.
Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8.
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.^^
still?
I understand that was your sales pitch, and I assume you had good
reasons to think it was indeed true, but is it still the case now?
AFAICT (see for example Anton's benchmarks in this regard) with current
CPUs, "LBIO cores" are not terribly more power-efficient than big OoO
cores.
Or at least, it seems that the big OoO cores are not significantly less power-efficient when they are computing at the same speed as LBIO cores.
Stefan
On 6/17/2025 12:59 PM, quadibloc wrote:
On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:
But that is NOT the arithmatic you are looking at::
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power. >>>
So the arithmetic is: can you OoO core be 10× faster than the LBIO core >>> ??
And the answer is NO.
But my code runs faster on the OoO core than on ten LBIO cores, because
nobody knows how to make effective use of ten cores to solve the
problem.
So the fact that it uses 10x the electrical power, while only having 2x
the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.
That's why OoO chips sell so well.
Errm, this doesn't agree with my experience.
More like the OoO chips are around 20-40% faster, but depending on
workload.
John Savard
On Tue, 17 Jun 2025 20:43:00 +0000, Stefan Monnier wrote:
What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's
algorithm, which does not proceed "instruction by instruction" but
instead takes a whole function (or basic bloc), builds an interference
graph from it, then chooses registers for the vars looking only at that
interference graph.
I am regurgitating conversations I have had with compile people over
the last 40 years. Noting I have seen in ISA design has moderated
these problems--but I, personally, have not been inside a compiler
for 41 years, either (1983). So, find a compiler writer to set this
record straight. I continue to be told: it is enough harder that you
should design ISA so you don't need pairing or sharing, ever.
MitchAlsup1 [2025-06-17 17:45:23] wrote:
On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:
On 6/16/2025 9:17 AM, Stefan Monnier wrote:You are missing the BIG problem::
I vaguely remember reading somewhere that it doesn't have to be too bad: >>>> e.g. if you reduce register-specifiers to just 4bits for a 32-register >>>> architecture and kind of "randomize" which of the 16 values refer toI can see that it isn't too hard on the logic for the register
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.
allocator,
Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator
notices that Rk and RM need to be paired or shared and they were not
originally.
What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's algorithm, which does not proceed "instruction by instruction" but
instead takes a whole function (or basic bloc), builds an interference
graph from it, then chooses registers for the vars looking only at that interference graph.
but I suspect it will lead to more register saving andAnd reg-reg MOVment.
restoring.
Of course. The point is simply that in practice (for some particular compiler at least), the cost of restricting register access by using
only 4bits despite the existence of 32 registers was found to be small.
Note also that you can reduce this cost by relaxing the constraint and
using 5bit for those instructions where there's enough encoding space.
(or inversely, increase the cost by using yet fewer bits for those instructions where the encoding space is really tight).
There's also a good chance that you can further reduce the cost by using
a sensible mapping from 4bit specifiers instead a randomized one.
IOW, the point is that just because you have chosen to have 2^N
registers in your architecture doesn't mean you have to offer access to
all 2^N registers in every instruction that can access registers.
It's clearly more convenient if you can offer that access, but if needed
you can steal a bit here and there without having too serious an impact
on performance.
Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8.
Right. But in practice, the register allocator can often choose the
rest of the register assignment such that one of those 8 is available.
Stefan
Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).
On 6/17/2025 4:19 PM, MitchAlsup1 wrote:
On Tue, 17 Jun 2025 19:04:49 +0000, BGB wrote:
On 6/17/2025 12:59 PM, quadibloc wrote:
On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:
But that is NOT the arithmatic you are looking at::
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the
power.
So the arithmetic is: can you OoO core be 10× faster than the LBIO
core
??
And the answer is NO.
But my code runs faster on the OoO core than on ten LBIO cores, because >>>> nobody knows how to make effective use of ten cores to solve the
problem.
So the fact that it uses 10x the electrical power, while only having 2x >>>> the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.
That's why OoO chips sell so well.
Errm, this doesn't agree with my experience.
More like the OoO chips are around 20-40% faster, but depending on
workload.
Then you are latency bound, not compute bound.
Possibly...
A lot of the code doesn't do that much math or dense logic on the data.
But, a whole lot of mostly shoveling data around, often through lookup
tables or similar.
But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and similar."
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.^^
still?
I understand that was your sales pitch, and I assume you had good
reasons to think it was indeed true, but is it still the case now?
AFAICT (see for example Anton's benchmarks in this regard) with current
CPUs, "LBIO cores" are not terribly more power-efficient than big OoO cores.
Or at least, it seems that the big OoO cores are not significantly less >power-efficient when they are computing at the same speed as LBIO cores.
Ah, well, pairing is a different problem than the "incomplete register >specifiers" I'm talking about. Indeed, it can be much more difficult to >adapt a Chaitin-style allocator to handle pairing because it can't be >expressed simply in the interference graph.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).
As systems got larger they needed to run more than 15 concurrent regions >(storage protect key=0 for kernel, 1-15 for regions).
On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:
On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:
On 6/16/2025 6:37 PM, MitchAlsup1 wrote:
The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.
Quadratic sill a lot worse than linear.
You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).
Yes, but that's not necessarily the point at which forwarding stops
making sense.
Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.
In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer,
On Tue, 17 Jun 2025 1:20:40 +0000, quadibloc wrote:
On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:
Then again, SW over the last
15 years has demonstrated no ability to use "lots" of small cheap cores.
And as long as that remains true, out-of-order execution will continue
to be popular, and there will also be strong pressure to find exotic
materials that can be used to make faster transistors - and faster
interconnects between them.
While I am willing to agree that we can do better in using multiple
cores, I also think that even after we do all that we can in that area,
a single core that is N times faster will still be better than N cores.
But that is NOT the arithmatic you are looking at::
A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.
So the arithmetic is: can you OoO core be 10× faster than the LBIO core
??
And the answer is NO.
The highest performing OoO is M4 right now and it is 2× faster than
Opteron Rev F (after normalizing frequency)--perhaps I should say
2× more instructions per clock. If M4 area was equal to Opteron area
(highly doubtful after normalizing) it would still be a factor of 5-6×
more area than 12 LBIO cores.
On 6/17/2025 1:43 PM, Stefan Monnier wrote:
What do you mean by "a few instructions later"? The above was stated in[...]
the context of a register allocator based on something like Chaitin's
algorithm, which does not proceed "instruction by instruction" but
Fwiw, here is some old code of mine, a region allocator in C that should still work today... Sorry for butting in:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
Ah, well, pairing is a different problem than the "incomplete register >>specifiers" I'm talking about. Indeed, it can be much more difficult to >>adapt a Chaitin-style allocator to handle pairing because it can't be >>expressed simply in the interference graph.I remember reading a paper about register allocation for register
pairs, but don't find that paper right now. Anyway, what I read was
based on graph-colouring IIRC and it looked pretty plausible and not
too complicated. But the devil is in the details.
On Sat, 14 Jun 2025 16:45:23 +0000, Stephen Fuld wrote:
On 6/14/2025 3:48 AM, Thomas Koenig wrote:
Which made nonsense the concept of making data relocatable by
always using base registers.
Forgive me, but I don't see why. When the program is linked, the COMMON
block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.
The purpose of a COMMON block is to share variables between the main
program and subroutines.
On the System/360, a FORTRAN compiler typically compiled each subroutinein a program separately from every other subroutine. They just got
linked together by the linking loader in order to run.
So no subroutine would know where a COMMON block created by loader for
the main program would be unless that information was given to it - and
the loader would give it that information, in the form of a full 24-bit address constant, so it didn't have to be passed as a parameter.
On Wed, 18 Jun 2025 14:10:42 +0000, Thomas Koenig wrote:
quadibloc <quadibloc@gmail.com> schrieb:
In fact, just as I've included VLIW as a basic feature in the Concertina >>> II, as a way to explicitly do what OoO does transparently for the
programmer,
Have enough instruction in the queue to deal with memory delays which
cannot be determined by the compiler in a reasonably way? Howo does
it do that?
VLIW only deals with one of the things OoO solves; stuff like read-after-write pipeline hazards. It doesn't address cache misses in
any way.
So that's total bad news, right?
It proves VLIW is useless?
John Savard
On Wed, 18 Jun 2025 18:16:37 +0000, MitchAlsup1 wrote:
On Wed, 18 Jun 2025 15:14:06 +0000, quadibloc wrote:
So that's total bad news, right?
Grim, maybe, Bad, not necessarily.
It proves VLIW is useless?
Not at all--it demonstrates the VLIW is less than ideal when dealing
with unpredictable latencies.
I'm surprised, though, that you did not continue onwards, and comment on
the part where I blamed you for finding a resolution to this problem.
Because, unless my memory is very faulty, you noted that the OoO implementation of the 6600 _is_ adequate for dealing with unpredictable latencies, such as those from cache misses (even if the 6600 didn't have
a cache; instead, it had extra memory under program control)... and so
it seemed to me that since VLIW can theoretically handle register
hazards almost as well as Tomasulo, it could complement a 6600-style
pipeline to provide a match for the resource hog OoO style in common use today.
John Savard
Anton Ertl [2025-06-18 07:31:55] wrote:
I remember reading a paper about register allocation for register
pairs, but don't find that paper right now. Anyway, what I read was
based on graph-colouring IIRC and it looked pretty plausible and not
too complicated. But the devil is in the details.
Preston Briggs (who used to be a regular here) discusses such an
allocator in his PhD thesis
(https://repository.rice.edu/items/2ea2032a-0872-43a1-90c0-564c1dd2275f).
On Thu, 19 Jun 2025 1:03:37 +0000, MitchAlsup1 wrote:
The resolution to the problem means the VLIW-ness of that ISA is no
longer necessary.
That may be.
But because I'm not the expert on things like this that you are, I don't
feel that I can dispute the conventional wisdom. The conventional
wisdom, as practiced by Intel and AMD and pretty much the whole CPU
industry is that the dynamic scheduling design as used in the Control
Data 6600 is inadequate, and one has to go to register rename or the equivalent Tomasulo Algorithm in order to achieve acceptable
performance.
The 6600 doesn't cover all register hazards.
VLIW can deal with register hazards, but it doesn't help at all with cache misses - as I have no
reason to doubt your claim that the 6600 mechanism is adequate to deal
with cache misses, though, that's why I noted combining the two as an
option.
Maybe you are right that this is useless, but I'm not in a position to dispute what Patterson and Hennessey have proclaimed and the industry
has accepted.
But I'm saying that even if Patterson and Hennessey _are_ right, adding
VLIW provides a method by which your goal - getting rid of the bulk of
the transistor and power overhead of OoO by going to the 6600 design -
would _still_ be achievable, since adding VLIW is essentially trivial.
Sure, I could be wrong - and 6600 by itself is plenty good enough. But
given all the naysayers out there, a way out of the GBOoO rut that
people might be willing to believe could work has got to have some
value.
John Savard
I've decided that this would be a good time to review the difference
between the 6600 scoreboard and modern OoO.
Having refreshed my memory, I see the issue is that when there is a WAR hazard, a 6600-style scoreboard simply stalls the functional unit.
Tomasulo or register renaming provides extra storage, either in the reservation stations or in rename registers, so that if the desired
result register is not yet available, the result can just go in an extra place.
This suggests that, just as some caches are designed in a very simple fashion, one could have a "stupid" form of register rename - say each register has its own rename register - that could be added to a
scoreboard. I would have thought that people are already doing this, but they're calling it register rename full OoO and not scoreboarding plus, because that's better marketing wise if nothing else.
RISC mitigates WAR hazards by having 32 registers instead of, say, 8 or
16.
VLIW marks out groups of instructions that don't have RAW or WAR
hazards. A scoreboard keeps track of dependencies, so it can delay only
those instructions affected by a cache miss. Since a 6600 scoreboard
does have to _detect_ WAR hazards, even if it doesn't handle them as
well as Tomasulo, indeed putting a bit in to indicate one is present
isn't needed, so you are right there... at least for an older-style
computer.
But a lot of computers these days have multiple copies of each
functional unit; that is, they're superscalar. So indicating that
several instructions can be executed together with no need for any
thought would seem to make things go faster.
Except, of course, when there's a chance one is trying to execute instructions at a time when all dependencies are not resolved - some registers aren't loaded yet with the data some of those instructions
will need. They all have to go through the scoreboard to check that. But
the instructions in a group are guaranteed not to depend on *each
other*, so they can be checked against the scoreboard _in parallel_.
That's what the VLIW bits can help with.
John Savard
On Thu, 12 Jun 2025 8:38:06 +0000, Anton Ertl wrote:[program counter as GPR]
Nowadays, you can afford it, but the question still is whether it is
cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.
Consider that in a 8-wide machine, IP gets added to 8 times per cycle, >whereas no GPR has a property anything like that.
On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:
Packing and unpacking decimal floats can be done inexpensively and fast
relative to the size, speed of the decimal float operations. For my own
implementation I just unpack and repack for all ops and then registers
do not need any more than 128-bits.
I also unpack the hidden first bit on IEEE-754 floats.
The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.
On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:
Ideally, one has an ISA where nearly all registers are the same:
No distinction between base/index/data registers;
No distinction between integer and floating point registers;
No distinction between general registers and SIMD registers;
...
But I felt that this was OK, since as everybody knows, strings really
only have to be able to be at least 80 characters long. Hmm... wait a
moment, aren't 132-character strings sometimes needed?
Oh, well.
John Savard
On Fri, 20 Jun 2025 15:30:59 +0000, quadibloc wrote:
On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:------------------------
Ideally, one has an ISA where nearly all registers are the same:
  No distinction between base/index/data registers;
  No distinction between integer and floating point registers;
  No distinction between general registers and SIMD registers;
  ...
But I felt that this was OK, since as everybody knows, strings really
only have to be able to be at least 80 characters long. Hmm... wait a
moment, aren't 132-character strings sometimes needed?
Line printers are/were 132 characters wide.
I also unpack the hidden first bit on IEEE-754 floats.
The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Corporations are organized hierarchically.
Have you ever worked in a large corporation? (Just asking).
quadibloc [2025-06-14 15:22:20] wrote:
I also unpack the hidden first bit on IEEE-754 floats.
The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.
Do you have any evidence that hiding the leading 1 bit takes more time
than not hiding? I can think of reasons why either of the two options
could be marginally cheaper than the other, but in all cases I can think
of, it would make *very little* difference, if any.
Stefan
Creating the hidden bit is 2-gates of delay (H,F,D,Q).
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:Hydro had 77K employees in 130 countries, there was no such thing as a
Thomas Koenig <tkoenig@netcologne.de> writes:
quadibloc <quadibloc@gmail.com> schrieb:...
On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:
The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.
It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.
Corporations are organized hierarchically.
Have you ever worked in a large corporation? (Just asking).
simple hierarchical setup.
Rather more like a loose federation across varying local environments.
Creating the hidden bit is 2-gates of delay (H,F,D,Q).
How come it's not free in hardware?
Is it only because of denormalized?
Stefan
On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:
Creating the hidden bit is 2-gates of delay (H,F,D,Q).How come it's not free in hardware?
Is it only because of denormalized?
hidden = operand.exponent != 0
You DO end up special casing Infinities and NaNs; anyway.
Special = operand.exponent == 0b11111111111
Which is an 11-input AND gate.
On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:
Creating the hidden bit is 2-gates of delay (H,F,D,Q).
How come it's not free in hardware?
Is it only because of denormalized?
hidden = operand.exponent != 0
Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).
You DO end up special casing Infinities and NaNs; anyway.
Special = operand.exponent == 0b11111111111
Which is an 11-input AND gate.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:
Creating the hidden bit is 2-gates of delay (H,F,D,Q).
How come it's not free in hardware?
Is it only because of denormalized?
hidden = operand.exponent != 0
Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).
You DO end up special casing Infinities and NaNs; anyway.
Special = operand.exponent == 0b11111111111
Which is an 11-input AND gate.
But do explicit bit lead to a difference? IIUC FPU need to
special cases anyway. I would guess that a flag normal/special
could save some time, but once FPU knows that it deals with
normal numbers hidden bit should be effectively free.
On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:
Creating the hidden bit is 2-gates of delay (H,F,D,Q).
How come it's not free in hardware?
Is it only because of denormalized?
   hidden = operand.exponent != 0
Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).
You DO end up special casing Infinities and NaNs; anyway.
  Special = operand.exponent == 0b11111111111
Which is an 11-input AND gate.
But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and similar."
BGB <cr88192@gmail.com> wrote:
But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and
similar."
I have a contrived program which on machines from about 2010 peaked
at about 10 MIPS, on earlier machines starting form about 1990 it
was closer to 2 MIPS. Basicaly program is doing pointer chasing
in somewhat irregular pattern covering whole memory. AFAICS it
needed 2 RAM accesses per instruction (one for second level page
table entry, one for actual data), on modern machines with multilevel
page tables it may be more (but modern machines tend to have quite
large caches and few top levels of page tables may fit in the on
chip cache).
While this is very unnatural program it clearly shows that modern
machines are fast only when caching/prefetching works as expected
and badly behaving programs may be much slower than execution
speed of the core. And of course, withing the core there are
more factors that can cause slowdown.
So any speed claim are probabilistic and implicitly or explicitely
assume some program behaviour.
On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote:
VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
promise of achieving OoO level performance without the costs of OoO.
Pick a VLIW that was successful like x86 or ARM in the marketplace.
Include pairs of short instructions as part of the ISA, but make the
short instructions 14 bits long instead of 15 so they get only 1/16 of
the opcode space. This way, the compromise is placed in something that's
less important. In the CISC mode, 17-bit short instructions will still
be present, after all.
However, try as I may, it may well be that the cost of this will turn
out to be too great. But if I can manage it, a significant restructuring
of the opcodes of this iteration of Concertina II may be coming soon.
More importantly, I need 256-character strings if I'm using them as
translate tables. Fine, I can use a pair of registers for a long string.
On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.
On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:
What is Concertina 2?
Roughly speaking, it is a design where most of the non-power of 2 data
types are being supported {36-bits, 48-bits, 60-bits} along with the
standard power of 2 lengths {8, 16, 32, 64}.
This creates "interesting" situations with respect to instruction
formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
On 6/17/2025 10:59 AM, quadibloc wrote:
So the fact that it uses 10x the electrical power, while only having 2x
the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.
Can you break your processing down into units that can be executed in parallel, or do you get into an interesting issue where step B cannot
proceed until step A is finished?
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
On 7/28/2025 6:18 PM, John Savard wrote:
On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.
So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.
The use of addressing modes drops off pretty sharply though.
Like, if one could stat it out, one might see a static-use pattern
something like:
80%: [Rb+disp]
15%: [Rb+Ri*Sc]
3%: (Rb)+ / -(Rb)
1%: [Rb+Ri*Sc+Disp]
<1%: Everything else
Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.
Granted, the dominance of [Rb+Disp] does drop off slightly when
considering dynamic instruction use. Part of it is due to the
prolog/epilog sequences.
If one had instead used (SP)+ and -(SP) addressing for prologs and
epilogs, then one might see around 20% or so going to these instead.
Or, if one had PUSH/POP, to PUSH/POP.
The discrepancy then between static and dynamic instruction counts them
being mostly due to things like loops and similar.
Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
seemed to be in the area. Many loops end up unreached in many
iterations, or only running a few times, so possibly counter-intuitively
it is often faster to assume that a loop body will likely only cycle 2
or 3 times rather than 100s or 1000s, and trying to aggressively
optimize loops by assuming large N tends to be detrimental to performance.
Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.
One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
ISA has a lot of registers, the relative benefit of LoadOp is reduced.
LoadOp being mostly a benefit if the value is loaded exactly once, and
there is some other ALU operation or similar that can be fused with it.
Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
z=arr[i]+x;
But, the relative incidence of things like this is low enough as to not
save that much.
The other thing is that one has to implement it in a way that does not increase pipeline length,
since if one makes the pipeline linger for
sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.
One can be like, "But what if the local variables are not in registers?"
but on a machine with 32 or 64 registers, most likely your local
variable is already going to be in a register.
So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".
That being said, though, designing a new machine today like the VAX
would be a huge mistake.
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
Yeah.
There are some living descendants of that family, but pretty much
everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.
John Savard
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
John Savard
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 49:48:25 |
Calls: | 10,397 |
Calls today: | 5 |
Files: | 14,067 |
Messages: | 6,417,314 |
Posted today: | 1 |