Forum: >>> Magnum BBS <<<

Re: Why I've Dropped In

From MitchAlsup1@21:1/5 to David Chmelik on Thu May 22 17:42:14 2025

On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

What is Concertina 2?

Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.

This creates "interesting" situations with respect to instruction
formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu May 22 18:03:34 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

What is Concertina 2?

Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.

Both sets are congruent to zero modulo 4. Therefore, the
only proper solution becomes that modulo value, which amounts
in this case to a 4-bit digit/nibble. Any size data type can
be constructed from a variable number of nibbles up
to some architectural max (e.g. 400 bits for a 100 nibble
operand). The processor can treat them as binary or BCD
depending on the requirements of the application (e.g. BCD
fits COBOL well).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri May 23 12:37:38 2025

On Thu, 22 May 2025 18:03:34 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

What is Concertina 2?

Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.

Both sets are congruent to zero modulo 4.

Restricted because it does not support 28-bits, 40-bits, 44-bits,...

Therefore, the
only proper solution becomes that modulo value, which amounts
in this case to a 4-bit digit/nibble. Any size data type can
be constructed from a variable number of nibbles up
to some architectural max (e.g. 400 bits for a 100 nibble
operand). The processor can treat them as binary or BCD
depending on the requirements of the application (e.g. BCD
fits COBOL well).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri May 23 13:24:06 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 22 May 2025 18:03:34 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

What is Concertina 2?

Roughly speaking, it is a design where most of the non-power of 2
data types are being supported {36-bits, 48-bits, 60-bits} along
with the standard power of 2 lengths {8, 16, 32, 64}.

Both sets are congruent to zero modulo 4.

Restricted because it does not support 28-bits, 40-bits, 44-bits,...

28/4 = 7 digits, 40/4 = 10 digits. Works just fine.

The burroughs medium systems loaded 10 digits at a time from
memory when processing an operand.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 11 05:56:33 2025

quadibloc <quadibloc@gmail.com> schrieb:

Since the basis of the ISA is a RISC-like ISA,

[...]

3) Use only four base registers instead of eight.
4) Use only three index registers instead of seven.
5) Use only six index registers instead of seven, and use only four base registers instead of eight when indexing is used.

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Jun 11 16:37:27 2025

On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

On 6/11/2025 12:56 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

Since the basis of the ISA is a RISC-like ISA,

[...]

3) Use only four base registers instead of eight.
4) Use only three index registers instead of seven.
5) Use only six index registers instead of seven, and use only four base >>> registers instead of eight when indexing is used.

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

Agreed.

Ideally, one has an ISA where nearly all registers are the same:
No distinction between base/index/data registers;
No distinction between integer and floating point registers;
No distinction between general registers and SIMD registers;
...

Agreed:: But most architectures get the FP registers wrong under that distinction, and apparently everyone gets the SIMD registers wrong.

Maybe it should be stated:: There is one register file of k-bits per
register (where K=32, 64, 128} and that there is no distinction between
what kind of data can go in what register.

Though, there are tradeoffs. For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.

Disagree:: One uses the SP as a base register "all the time",
one uses LR as a JMP source "every subroutine return".
Either is generally done using GPRs, and thus the problem
is to guarantee that you don't have so many of them that
you can't use them naturally in your ISA>

So, if ZR/LR/SP/GP are "not GPR", this is fine.
Pretty much everything else is best served by being a GPR or suitably
GPR like.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 16:49:06 2025

On Wed, 11 Jun 2025 14:12:04 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.

And that is why you don't do it that way.

We can all agree that [Rbase+Rindex<<scale+Displacement] is (IS) the
proper way to abstract address-generation. The problem is how does
one "get there". Another way to look at this is that the AGEN unit
is built to do index scaling AND to do DISPlacement addition as its
primitive. The rest is routing of operands to AGEN--and in this case
we KNOW that DISP is a constant at DECODE time and can address its instruction-queueing appropriately.

In my case I broke it into 2 sets of patterns::

MEM Rd,[Rbase+DISP16]
and
MEM Rd,[Rbase+Rindex<<scale]

both of which fit in 32-bits. With 6-bit Major OpCode, this eats
up 3/8ths of the OpCode space (There are 2× as many LDs as STs)
in the Major OpCode repository. THEN one finds a way to add
DISP32 and DISP64 (or ABS64) constants to the second form.

DISP16 covers 70%-ile of memory references, base+index covers
another 20%-ile, so one only needs [b+i<<s+DISP] 5%-10% of the
time. But every time you can use it it save executing another
instruction (sometimes 2).

As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
bits somewhere.

My LDs are content free (LDs don't care if they are loading
integer, or floating point data, or SIMD data, ...

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those purposes.

This is going to hurt register allocation.

A standard RISC would not have an index register field, only a base
register field, meaning array accesses would require multiple
instructions.

The 68000 only had base-index addressing with an 8-bit displacement;
true base-index addressing with a normal displacement arrived in the
68020, but the instructions using it took up 48 bits.

I'll agree the 68000 architecture did have a serious mistake. It was
CISC, so it didn't need to be RISC-like, but the special address
registers should only have been used as base registers; the regular arithmetic registers should have been the ones used as index registers,
since one has to do arithmetic to produce valid index values.

The separate address registers would then have been useful, by allowing
those of the eight (rather than 16 or 32) general registers that would
have been used up holding static base register values to be freed up.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 11 17:05:13 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.

What is the use case for having base and index register and a
16-bit displacement?

16 bit is (usually) large enough to addresses relative to a stack or
frame pointer. It is rarely useful to address members of a struct,
which are usually smaller.

The use case for base + index exists, for things like

for (i=0; i<n; i++)
a[i] = b[i] + c[i]

you only need four registers instead of six.

If you want to have base + index + offset, it would probably be
wise to restrict yourself to a smaller offset, or go big and
allow a 32-bit offset.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Wed Jun 11 16:51:29 2025

BGB <cr88192@gmail.com> writes:

For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.

So, if ZR/LR/SP/GP are "not GPR", this is fine.

I assume you mean Zero register, link register, stack pointer, global
pointer. On most register architectures (those with GPRs) all of them
are addressed as GPRs in most instructions. Specifically:

Zero register: The CISCs (S/360, PDP-11, VAX, IA-32, AMD64) don't have
a zero register, but use immediate 0 instead. Most RISCs have a
register (register 0 or 31) that is addressed like a GPR, but really
is a special-purpose register: It reads as 0 and writing to it has no
effect. Power has some instructions that treat register 0 as zero
register and others that treat it as GPR.

Link register: On some architectures there is a register that is a GPR
as far as most instructions are concerned. But the call instruction
with immediate (relative) target uses that register as implicit target
for the return address. MIPS is an example of that. Power has LR as
a special-purpose register.

Stack pointer: That's just software-defined on many register
architectures, i.e., one could change the ABI to use a different stack
pointer, and the resulting code would have the same size and speed.
An interesting case is RISC-V. In RV64G it's just software-defined,
but the C (compressed) extension defines some instructions that
provide smaller instructions for a specific assignment of SP to the
GPRs; I expect that similar things happen for other compressed
instruction set extensions.

Global pointer: That's just software-defined on all register
architectures I am aware of.

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to quadibloc on Wed Jun 11 17:33:35 2025

quadibloc <quadibloc@gmail.com> writes:

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.

The solution of RISC architectures has been to not have displacement
and index registers at the same time (MIPS and its descendants do not
have base+index addressing at all). The solution of CISC
architectures has been to allow bigger instructions, and possibly
different displacement sizes (e.g., 8 bits and 32 bits for IA-32 and
AMD64).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Jun 11 19:08:06 2025

On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

For example, SPRs can be, by definition,
not the same as GPRs. Say, if you have an SP or LR, almost by
definition, you will not be using it as a GPR.

So, if ZR/LR/SP/GP are "not GPR", this is fine.

I assume you mean Zero register, link register, stack pointer, global pointer. On most register architectures (those with GPRs) all of them
are addressed as GPRs in most instructions. Specifically:

Zero register: The CISCs (S/360, PDP-11, VAX, IA-32, AMD64) don't have
a zero register, but use immediate 0 instead. Most RISCs have a
register (register 0 or 31) that is addressed like a GPR, but really
is a special-purpose register: It reads as 0 and writing to it has no
effect. Power has some instructions that treat register 0 as zero
register and others that treat it as GPR.

My 66000 is a RISC architecture that does NOT have a zero register.
Most instructions have the ability to use the 5-bit register
specifier as a 5-bit immediate, and for these instructions; #0
signifies zero in both integer and floating point senses. #1
signifies 0x0000000000000001 or 0x7FE0000000000000, ... so that
one can do FADD R7,R19,#7 as a single 32-bit instruction word,
saving instructions and code space. {Brian gets credit for this}

Over on the memory side:: Rbase = 0 implies IP is the Base register
Rindex = 0 implies no indexing (but still
having
access to DISP32 and DISP64 constants)

Over on the call/return side: When safe stack is in use, RETaddr
goes on the top of CSP and R0 is not modified, but when safe stack
is not in use, R0 <= RETaddr. The RET instruction, then, does the
right thing based on the status of the Safe-Stack in use flag.
CSP (call stack pointer) is used to hold RETaddr and preserved
registers in a way the called program can neither read nor write
adding safety against actual attacks, and bad programming.

And then there is the CALX instruction--which is a LDD IP,[address]--
which transfers control through a table in memory to an entry point
in the table. Good for external linkage and method calls. An
interesting point, here, is that this is only for CALL/RET and not
for branching--thus, it can be predicted better than with typical jump-predict-tables because you are not predicting at time of JMP
but you are predicting at the time of the LDD; so, you can use
LOBs of the address to help with the prediction.

Link register: On some architectures there is a register that is a GPR
as far as most instructions are concerned. But the call instruction
with immediate (relative) target uses that register as implicit target
for the return address. MIPS is an example of that. Power has LR as
a special-purpose register.

You could call my use of safe-stack as putting LR in a "more-better"
place than in a GPR.

Stack pointer: That's just software-defined on many register
architectures, i.e., one could change the ABI to use a different stack pointer, and the resulting code would have the same size and speed.
An interesting case is RISC-V. In RV64G it's just software-defined,
but the C (compressed) extension defines some instructions that
provide smaller instructions for a specific assignment of SP to the
GPRs; I expect that similar things happen for other compressed
instruction set extensions.

My 66000 did something similar:: The ENTER and EXIT instructions
use SP == R31 (or CSP) implicitly, values needing preserved are
placed in memory the callee cannot LD not ST. Other than ENTER
EXIT, and RET, SP could be any register.

Interesting point:: the compiled code is not sensitive to the
setting of the safe-stack flag--only the thread control regs.
The only pieces of SW that need cognition of safe-stack are
longjump() and stack-walk-back as used by TRY-THROW-CATCH.
Both use the EXIT instruction in a "special" way to peal back
layers on the stack.

Global pointer: That's just software-defined on all register
architectures I am aware of.

In My 66000, it is simply an address constant. There is no rational
to consume a register to get at something one can access with a
longer address constant.

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to quadibloc on Wed Jun 11 14:56:56 2025

quadibloc wrote:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.

Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.

A separate LEA Load Effective Address instruction to calculate rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an alternative.
Then rDest is used as a base in the LD or ST.

One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

As I required 5 bits for the opcode to allow both loads and stores for several sizes each of integer and floating-point operands, I had to save
bits somewhere.

The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

So 3 bits for the data type for loads and stores, which if you
put that in the opcode field uses up almost all your opcodes.
So you take the data types out of the disp16 field and now your
offset range is 13 bits +/- 4kB.

And a constant prefix instruction can extend the disp13 field
to 26+13=39 or 26+26+13=65(64) bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 19:16:29 2025

On Wed, 11 Jun 2025 17:34:54 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 16:49:06 +0000, MitchAlsup1 wrote:

On Wed, 11 Jun 2025 14:12:04 +0000, quadibloc wrote:

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those
purposes.

This is going to hurt register allocation.

Yes. It will. Unfortunately.

Basically, as should be apparent by now, my overriding goal in defining
the Concertina II architecture - and its predecessor as well - was to
make it "just as good", or at least "just _about_ as good", as both the
68020 and the IBM System/360.

This meant that I had to be able to fit base plus index plus
displacement into 32 bits, since the System/360 did that, and I had to
have 16-bit displacements because the 68020, and indeed x86 and most microprocessors did that.

There is enough evidence that 12-bit positive displacement (/360 model)
is insufficient for modern applications, that I was surprised the RISC-V
went in that direction. EMBench has many subroutines with more than 4K
of stack variables that cause RISC-V to emit a LUI just to set the 12th
or 13th bit and access. SPARC had enough problems with 13-bits that any-
one with their ear to the rail should have heard the consternation.

And I had to have register-to-register operate instructions that fit
into only 16 bits. Because the System/360 had them, and indeed so do
many microprocessors.

Otherwise, my ISA would be clearly and demonstrably inferior. Where I couldn't attain a full match, I tried to be at least "almost" as good.
So either my 16-bit operate instructions have to come in pairs, and have
a very restricted set of operations, or they require the overhead of a
block header. I couldn't attain the goal of matching the S/360
completely, but at least I stayed close.

So while having 32 registers like a RISC, I ended up having some
purposes for which I could only use a set of eight registers. Not great,
but it was the tradeoff that was left to me given the choice I made.

So here it is - an ISA that offers RISC-like simplicity of decoding, but
an instruction set that approaches CISC in code compactness - and which offers a choice of RISC, CISC, or VLIW programming styles. Which may
lead to VLIW speed and efficiency on suitable implementations.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jun 11 21:17:46 2025

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

quadibloc wrote:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base
register, and 16 bits for the displacement, then there would only be one
bit left for the opcode.

Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.

A separate LEA Load Effective Address instruction to calculate rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.

For my part, LEA is the other form of LDD (since the signed/unsigned
notation is unused as there is no register distinction between signed
64-bit LD and an unsigned 64-bit LD.

Then rDest is used as a base in the LD or ST.

One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

My Mantra is to never use instructions to paste constants together.

As I required 5 bits for the opcode to allow both loads and stores for
several sizes each of integer and floating-point operands, I had to save
bits somewhere.

The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

There are 8 if you want to detect overflow differently between
signed and unsigned 64-bit values) 99.44% of programs don't care.
Which is why one "cooperates" with signedness in LDs, ignores
signedness in STs, and does exception detection only in calculation instructions.

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

So 3 bits for the data type for loads and stores,

3-bits for LDs, 2-bits for STs.

which if you
put that in the opcode field uses up almost all your opcodes.

With a Major OpCode size of 6-bits, the LDs + STs with DISP16
uses 3/8ths of the OpCode space, a far cry from "almost all";
but this is under the guise of a machine where GPRs and FPRs
coexist in the same file.

By using 1 Major OpCode to access another 6-bit encoding space
(called XOP1) one then has another 6-bit encoding space where
the typical LDs and STs consume 3/8ths leaving room to encode
indexing, scaling, locking behavior, and ST #value,[address]
which then avoids constant pasting instructions and waste of
registers.

I Also snuck in CALX--which is simply a LDD IP,[address] saving
either the LDD or the JMP depending on how you look at it.

So you take the data types out of the disp16 field and now your
offset range is 13 bits +/- 4kB.

The S.E.L. machines that did this only supported signed partial
word LDs (saving a bit)--probably not the best choice in todays
analysis and language uses.

Secondarily, one had LDbyte and LDsized instruction variants.

And a constant prefix instruction can extend the disp13 field
to 26+13=39 or 26+26+13=65(64) bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 11 21:26:02 2025

On Wed, 11 Jun 2025 18:14:34 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 17:33:35 +0000, Anton Ertl wrote:

quadibloc <quadibloc@gmail.com> writes:

However, if the memory reference instructions had 5 bits for the >>>destination register, 5 bits for the index register, 5 bits for the base >>>register, and 16 bits for the displacement, then there would only be one >>>bit left for the opcode.

The solution of RISC architectures has been to not have displacement
and index registers at the same time (MIPS and its descendants do not
have base+index addressing at all). The solution of CISC
architectures has been to allow bigger instructions, and possibly
different displacement sizes (e.g., 8 bits and 32 bits for IA-32 and
AMD64).

And what I've chosen is...

- to have an architecture which superficially resembles RISC,

yes

- but which offers all the capabilities of CISC

I chose only "some" of the CISC characteristics

- and which tries to approach achieving the same code density as such
classic CISC machines as the System/360

I am getting 1.1 VAX instruction counts compared to MIPS* getting 1.5
VAX instruction counts. (*) R3000 and most other RISCs

- and in addition which offers VLIW features as well

Which I never saw any purpose for. VLIW ties one to multiples of a
particular width. We now have access to machines which are 1-wide,
3-wide, 4-wide, 6-wide, 8-wide, and now 10-wide. There seems to be
no least common multiple or greatest common divisor.

To look like RISC, and yet to have the code density of CISC is to
attempt to achieve two goals which seem to be in profound conflict with
each other. So it shouldn't be surprising that in order to do this, I've
had to break a few rules and sacrifice some elegance.

As I've striven to achieve what seemed impossible - even if some may say
I'm tilting at windmills, as no one really cares that much about code
density any more

I care--getting rid of instructions that just paste constants together
is 1/4 of the way from MIPS's 1.5× VAX to My 66000's 1.1× VAX.

The x86 memory references are another 1/4
The CISC ENTER EXIT instructions are another 1/4
Leaving 3-other things for the last 1/4 ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jun 11 21:35:43 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.

Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?

That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jun 11 18:00:33 2025

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.

It had the advantage of meaningfully reusing some address modes
rather than having to add new opcode formats:
PC & autoincrement => immediate value (opcode data type gives size)
PC & autoincrement deferred => absolute address of data
PC & B/W/L relative => PC relative address of data
PC & B/W/L relative deferred => PC relative address of address of data

On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.
(Not a big deal but yet another thing one has to deal with in
Decode and carry with you in any uOps.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jun 11 23:01:04 2025

On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.

It had the advantage of meaningfully reusing some address modes
rather than having to add new opcode formats:
PC & autoincrement => immediate value (opcode data type gives size)
PC & autoincrement deferred => absolute address of data
PC & B/W/L relative => PC relative address of data
PC & B/W/L relative deferred => PC relative address of address of data

I can argue that there are other ways to encode each of the above
without using "address modes".

On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.

This is exactly what made wide VAXs so hard to pipeline. At least
when I use IP as a means to access something in my ISA, the IP used
is the IP of the first word of the instruction {rather than a run-
ning copy of IP}

(Not a big deal but yet another thing one has to deal with in
Decode and carry with you in any uOps.)

It becomes quadratically harder as instruction width increases,
cubically if the accessed operands have variable widths.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jun 11 23:13:00 2025

On Wed, 11 Jun 2025 21:35:43 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.

There are optional ways around those problems::
{
a) if you take a TLB fault on either access--just assume the
. atomic event fails and restart after TLB is repaired.
b) if both cache lines are not writeable--just assume the
. atomic event fails and restart after both lines have
. arrived in a writeable condition.
}
which simplify the problem space.

But it is (IS I S ) the critical problem to be solved--how does
one appear to hold onto {not a cache line but its} write permission
in the face of uncertain delay (of related memory references)
and data access through the cache hierarchy; and all the coherence
traffic that transpires under this delay.

Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?

a) you don't try because
b) you can't under all conditions
what you do do is to see if both are writeable and if so
proceed to perform both; otherwise fail both.

My 66000 cache coherence protocol has a NAK but this feature can
only be used under very tight HW restrictions, and the feature
is under control of thread priority (so higher priority wins
any conflicts). The tight restrictions would take 1000-2000
words to adequately explain the subtle nuances that must be
avoided.

That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page
boundary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to quadibloc on Thu Jun 12 06:30:31 2025

quadibloc <quadibloc@gmail.com> writes:

To look like RISC, and yet to have the code density of CISC

I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in <2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.

Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
packages:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Data from <2025Mar3.174417@mips.complang.tuwien.ac.at> for NetBSD
packages.

bash grep xz
710838 42236 m68k
748354 159304 40930 vax
829077 176836 42840 amd64
855400 164188 aarch64
877284 186924 48032 sparc
882847 187203 49866 i386
898532 179844 earmv7hf
962128 205776 54704 powerpc
1004864 192256 53632 sparc64
1025136 51160 mips64eb
1147664 232688 63456 alpha
1172692 mipsel

Unfortunately, Debian does not have m68k or vax ports (the
architectures with the smallest code size on NetBSD) in the regular distribution, and NetBSD does not have ARM T32 (armhf) or RV64GC
(riscv64) (the architectures with the smallest code size on Debian) in
its regular distribution, so one cannot compare them directly.
However, taking the bash numbers and computing the relations of these
four architectures to AMD64, we get:

0.747 armhf/amd64 (debian)
0.753 riscv64/amd64 (debian)
0.857 m68k/amd64 (NetBSD)
0.903 vax/amd64 (NetBSD)
1.000 amd64

So it seems that if you want code density, the way to go is to
implement compressed RISC instruction sets. And compile with -Os
(while I have posted cases where it is counterproductive, it tends to
rein in on loop unrolling and inlining).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Jun 12 08:38:06 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX)
have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even
have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.

Actually, it seems to me that for the first RISC generation, it made
the least sense. You could not afford the transistors to do special
handling of the PC.

Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.

On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.

This is exactly what made wide VAXs so hard to pipeline.

I don't think that would cause a real problem for decoder designers
these days. It might cost some additional transistors, though. This
design choice in VAX was very likely due to the implementation choices (sequential decoding of instruction parts) they had in mind, and these
days one would probably make a different choice even if one decided to
design an otherwise VAX-style instruction set. How did the NS32k
designers choose in this respect?

That being said, how does the design choice to include PC-relative
addressing in AMD64 and ARM A64 come out in the long run? When AMD64
and ARM A64 was designed, the data was still delivered in the
microinstruction in most microarchitectures, and in that context,
PC-relative addressing does not cost extra; you just fill in the data
from the start.

But Intel switched to having separate rename registers in Sandy Bridge
(around the time when ARM A64) appeared, and others did the same, so
now there is no space in the microinstructions for including the value
of the PC when the instruction was decoded. I guess that this value
is stored in a physical register on decoding, and each use of
PC-relative addressing reduces the amount of available physical
registers from the time when the register renamer processes the
instruction until the time when the instruction is processed by the
ROB; can someone confirm this, or is it done in some other way?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Thu Jun 12 07:05:22 2025

BGB <cr88192@gmail.com> writes:

On 6/11/2025 11:51 AM, Anton Ertl wrote:

Link register: On some architectures there is a register that is a GPR
as far as most instructions are concerned. But the call instruction
with immediate (relative) target uses that register as implicit target
for the return address. MIPS is an example of that. Power has LR as
a special-purpose register.

It is GPR like, but in terms of role, I don't consider it as such.

In RV64, in theory, JAL and JALR could use any register. But, the C ABI >effectively limits this choice to X1.

So what? The architecture does not. Also, given that static linking
and maybe even whole-program optimization are on the rise, the forces
that coerce you to use the ABI are getting smaller.

Implicitly, the 'C' extension and some other (less standardized)
extensions also tend to hard-code the assumption of X1 being the link >register.

Yes, the C extension is designed for minimizing the code size of
common code, and assumes that the code follows the ABI. But nothing
in the architecture forces you to use compressed instructions.
Whenever it is more advantageous to use an instruction in a way that
cannot be compressed, you just do it. E.g., if you do whole-program optimization, and you have functions

A (called from 10 sites)
B calls A (called from 2 sites, address not taken for indirect calling)
C calls B

Then one can use JAL or JALR with the target X1 to call A, and these instructions may be compressible. And one can use JAL or JALR with a
different target to call B. The benefit of that is that B does not
need to save and restore the address it returns to, eliminating the
code needed for that, and the time needed to perform this saving and
restoring.

However, the architecture specification says that using x1 and x5 are considered to be link registers for branch prediction purposes, so
ideally one will use x5 as target for calls to B, and for further
levels the question is if it is good enough to just use plain
indirect-branch prediction for those calls, or if one invests into the
saving and restoring in order to use the return-address stack for
branch prediction.

Well, it is more a case here of, "try to put something other than the
stack pointer in SP and see how far you get with that".

There are multiple levels of systems (ISA design, OS, ...)

Neither the ISA nor the system call interfaces I have looked at would
cause any problems if I use x2 (sp) on RISC-V for something else.
Maybe in case of a handled signal the OS would write to to a place
pointed to by x2, but that requires 1) installing a signal handler and
2) not using sigaltstack() to tell the OS where to write in such a
case. Of course the signal handler (if any) with see x2 set to point
to the alternative stack, but all the regular user code can use x2 for
whatever purpose seems appropriate. Some programming languages are
designed to work without stack (e.g., early Fortran), some to use
multiple stacks (e.g., Forth).

I am not saying they don't look like GPRs in the ISA, but rather that
they aren't really GPRs in terms of roles or behavior, but rather they
are essentially SPRs that just so happen to live in the GPR numbering space.

As far as the architecture is concerned, they are GPRs. Yes, an ABI
specifies a special role for some of them, but the ABI is software,
not architecture. E.g., in early MIPS ABIs (in particular, on
Ultrix), there was no GP, in later MIPS ABIs, there was.

One can nicely see the role of the ABI in Table 25.1 of <http://staff.ustc.edu.cn/~comparch/reference/riscv-spec%EF%BC%880305%EF%BC%89.pdf>
(page 137); it has a column called "Register" (with names like "x1")
and a column called "ABI name" (with names like "ra"). The caption
says: "Assembler mnemonics for RISC-V integer and floating-point
registers, and their role in the first standard calling convention."

So the architects expect the architecture to live longer than this
ABI.

It might be even due to things as simple as "well, the OS kernel and
program launcher assume that stack is in X2, and system calls assume
stack is in X2, ...". You have little real choice but to put the stack
in X2, and if you try putting something else there, and a system call or >interrupt happens, ..., there is little to say that things wont "go >sideways", so, not really a GPR.

I don't know what OS you have in mind, but in any OS where there is a
boundary between user space and system space, the system does not use
what may be the user-space stack pointer for storing its data, not on
system calls, and certainly not on interrupts. And when I last looked
at Linux system calls, the actual system call interface (not the C
wrapper around it) passed parameters to system calls in registers, not
on the user-level stack.

The only case where a stack pointer register may come into play is
when the OS calls a signal handler, but I have not looked at the
machine-level interface there, so I cannot say for sure. In any case,
that does not affect all the code that is not a signal handler.

Global Pointer is assumed as such by the ABI, and OS may care about it,
so not really a GPR.

Why should the OS kernel care about the global pointer of a user-level
program?

I decided to classify X5/TP as a GPR as its usage is roughly up to the >discretion of the ABI and C runtime library (at least in RISC-V, there
are no hard-coded ISA level assumptions about TP, nor does it cross into
the OS kernel's realm of concern).

Table 25.1 (mentioned above) gives tp as ABI name for x4, and t0 as
ABI name for x5.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Jun 12 09:12:59 2025

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

quadibloc wrote:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >>> register, and 16 bits for the displacement, then there would only be one >>> bit left for the opcode.

Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.

A separate LEA Load Effective Address instruction to calculate
rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.

For my part, LEA is the other form of LDD (since the signed/unsigned
notation is unused as there is no register distinction between signed
64-bit LD and an unsigned 64-bit LD.

LEA doesn't need the 3 bits for data type/size.
We can allocate them to index scaling which does need them.

If fp128 is to be (someday) supported then the index scaling must be
at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
single instruction array index calculations up to fp128 octonions.
The index scaling selects the octonion array element and the
displacement selects a coefficient in it.

Not that I have a use for octonions myself,
just thinking of the kids out there.

Then rDest is used as a base in the LD or ST.

One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

I've had a look at the H&P graph of displacement % usage and see
that there is no significant difference between 12 and 13 bits.

As a second cut at the design, I'd make the immediate 12 bits.
So the immediate constants are either 12, 26+12=38 or 26+26+12=64 bits.
And that leaves 4 bits for function codes or data types.

My Mantra is to never use instructions to paste constants together.

John's scenario chose fixed length 32-bit instructions
so I'm just playing the cards dealt.

This allows an operation with up to 64 bits of immediate to be
defined in just 12 bytes of instruction space (same as My 66000).
It is spread over 3 instruction slots, but those CONST instructions are
defined as fused in Decode so its similar to a variable length ISA
in that it requires no extra execute clocks.

As I required 5 bits for the opcode to allow both loads and stores for
several sizes each of integer and floating-point operands, I had to save >>> bits somewhere.

The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

There are 8 if you want to detect overflow differently between
signed and unsigned 64-bit values) 99.44% of programs don't care.
Which is why one "cooperates" with signedness in LDs, ignores
signedness in STs, and does exception detection only in calculation instructions.

I have various instructions to check integer down-casts ranges too
and fault on overflow. For checked languages, most overflow range
checks require one extra instruction before the ST to a smaller type.

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

I want double wide instructions for atomic swap and compare-and-swap.
Those are restricted to naturally aligned addresses and trap if not.

To support these I also need to be able to load and store wide values.
The load and store register pair instructions accept any address but if
you want an atomic guarantee then the address must be naturally aligned.
If not naturally aligned then LD or ST could use 2 separate memory accesses.

So 3 bits for the data type for loads and stores,

3-bits for LDs, 2-bits for STs.

For integers, the register pair makes ST sizes 1, 2, 4, 8, 16.
I potentially have 5 float types from 1 to 16 bytes.
If I include float register pairs (for complex numbers) then
it could be load and store of 10 float data types.
So 3 bits for data type.

which if you
put that in the opcode field uses up almost all your opcodes.

With a Major OpCode size of 6-bits, the LDs + STs with DISP16
uses 3/8ths of the OpCode space, a far cry from "almost all";
but this is under the guise of a machine where GPRs and FPRs
coexist in the same file.

John said he had a 5-bit opcode.
He also said he wants separate integer and float register files
so that means separate LD, ST and FLD, FST.

For the data types I listed above, but NOT including the float pairs,
it would use opcodes for 8 LD, 5 ST, 5 FLD, 5 FST = 23 of 32 opcodes.
If I include some float pairs for complex fp32, fp64 and fp128 then
that uses up 29 or 32 opcodes. And that is just for loads and stores.

So yes, "almost all".

Either it:
(a) moves the type/size bits somewhere else (the offset field), as I did,
(b) or drop support for some sizes and require an extra sign or zero
extend instruction to handle the others, as Alpha did.

By using 1 Major OpCode to access another 6-bit encoding space
(called XOP1) one then has another 6-bit encoding space where
the typical LDs and STs consume 3/8ths leaving room to encode
indexing, scaling, locking behavior, and ST #value,[address]
which then avoids constant pasting instructions and waste of
registers.

But you have variable length instructions. I would too.

John's premise assumes they are fixed 32-bits.
I'm running *that* scenario forward to see if we can get a better
result than the RISC-V folks got, where they need 6 instructions
and 24 bytes to do a LD or ST with 64 bit offset.

The CONST prefix instruction approach shows it can be done in
3 instructions of 12 bytes which are fused in Decode so require
no extra working register and no execute clocks for pasting.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 12 13:41:06 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 21:35:43 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.

There are optional ways around those problems::
{
a) if you take a TLB fault on either access--just assume the
. atomic event fails and restart after TLB is repaired.
b) if both cache lines are not writeable--just assume the
. atomic event fails and restart after both lines have
. arrived in a writeable condition.
}
which simplify the problem space.

Although it does require some other fallback to ensure
fairness and prevent starvation. A big hammer like
the x86 system bus lock, perhaps, if the atomic can't
complete in some period of time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Thu Jun 12 09:38:31 2025

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.

Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?

That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page boundary.

I don't need it to be atomic for any alignment.
The spec for LD and ST register pair would say that IF the address is
16-byte aligned THEN the operation is guaranteed to be done atomically.
If the address is not aligned it may use two separate operations.
This is the same guarantee as 2, 4 and 8 byte LD or ST.

As to whether the register pair is specified as one field with
and implied increment or two separate fields, I have cases for both.
Once I started adding double-wide operate instructions I found
usages where assuming the register pairs were contiguous
(eg only even numbered registers) was too constraining.
It forces many extraneous MOV's to create the even numbered pairs.

On the other hand, having register pairs specified by two fields
quickly winds up with instructions that have 5 or 6 register fields
(2 dest and 2+1 source, or 2 dest and 2+2 source).
But this only affects Decode as the uOp formats require 6 fields.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jun 12 13:43:59 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

quadibloc <quadibloc@gmail.com> writes:

To look like RISC, and yet to have the code density of CISC

I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in ><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.

Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
packages:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Seems to me that the text size is not the interesting
metric here - rather the typical working set size is
far more important.

Take bash, for instance; in typical operation I
would not expect it to use more than a small fraction of
the total text.

It may be that text size isn't a particularly
good metric for judging instruction set effectiveness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Thu Jun 12 08:44:14 2025

On 6/12/2025 8:00 AM, quadibloc wrote:

On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

What is the use case for having base and index register and a
16-bit displacement?

The IBM System/360 had a base and index register and a 12-bit
displacement.

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

Most microprocessors have a base register and a 16-bit displacement.

So this lets my architecture be a superset of both of them.

But your different ISA format, etc. means that it is not a true
superset. That is, any S/360 program would have to be recompiled to run
on your architecture. So it is only for some sort of "conceptual", but
not actual compatibility that is only for assembler language programmers
(and compiler writers).

Great
selling point,

I suspect that the number of S/360 assembler programs being written
these days is asymptotic to zero, so not so much.

and thus I didn't think too hard about whether it is
"needed", because an architecture that instead tries to only provide genuinely necessary capabilities...

now forces programmers, used to other systems that were more generous in
one way or another, to change their habits!

That would presumably spoil sales or ruin the popularity of the ISA.

I don't think so. And consider how much the other problems that you
have been struggling with would become so much simpler if you eliminated
the base registers and used those for bits in the instructions for other things.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jun 12 18:44:20 2025

On Thu, 12 Jun 2025 8:38:06 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 22:00:33 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 16:51:29 +0000, Anton Ertl wrote:

Program Counter: Some instruction sets (ARM A32, IIRC PDP-11 and VAX) >>>>> have the PC addressed like a GPR, although it clearly is a
special-purpose register. Most RISCs don't have this, and don't even >>>>> have a PC-relative addressing mode or somesuch. Instead, they use
ABIs where global pointers play a big role.

I consider IP as a GPR a mistake--I think the PDP-11 and VAX
people figured this out as Alpha did not repeat this mistake.
Maybe in a 16-bit machine it makes some sense but once you have
8-wide fetch-decode-execute it no longer does.

Actually, it seems to me that for the first RISC generation, it made
the least sense. You could not afford the transistors to do special
handling of the PC.

Instead, you built the increment loop around the IP itself. Basically,
you have a 4-input multiplexer, 1 leg feeds the current IP back to the
adder which is then flopped in the IP register, the other 3 inputs are
for branch displacement, interrupt vector, and JMUP register input.
It is basically a degenerate ALUU+forwarding path.

Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.

Consider that in a 8-wide machine, IP gets added to 8 times per cycle,
whereas no GPR has a property anything like that.

On the con side, for all PC-relative addressing the offset is
relative to the incremented PC after *each* operand specifier.
So two PC-rel references to the same location within a single
instruction will have different offsets.

This is exactly what made wide VAXs so hard to pipeline.

I don't think that would cause a real problem for decoder designers
these days. It might cost some additional transistors, though.

I disagree, the way VLE is implemented in My 66000 allows instruction
boundary determination to be tree-ifide. The way VAX (and PDP-11) did
it does not allow tree-ififcation. My 66000 is quadratic whereas VAX
is higher than cubic when you consider the large operand instructions.
If you don't do the wide operand instructions, VAX is only a little
harder than cubic.

This
design choice in VAX was very likely due to the implementation choices (sequential decoding of instruction parts) they had in mind, and these
days one would probably make a different choice even if one decided to
design an otherwise VAX-style instruction set. How did the NS32k
designers choose in this respect?

That being said, how does the design choice to include PC-relative
addressing in AMD64 and ARM A64 come out in the long run? When AMD64
and ARM A64 was designed, the data was still delivered in the microinstruction in most microarchitectures, and in that context,
PC-relative addressing does not cost extra; you just fill in the data
from the start.

My 66000 also has this property, but also the property that any IP
needed as an operand to any instruction is the virtual address of
the instruction itself (not incremented); and is thus easy to synthesize
in the DECODE pipeline.

But Intel switched to having separate rename registers in Sandy Bridge (around the time when ARM A64) appeared, and others did the same, so
now there is no space in the microinstructions for including the value
of the PC when the instruction was decoded.

K9 was going to unify x86+64, x87, MMX-SBVE into a single register file,
too. These are decisions based on how the microarchitecture takes place.
In K9's case, the unified file was 1/2 the size of the 3 separate files.

I guess that this value
is stored in a physical register on decoding, and each use of
PC-relative addressing reduces the amount of available physical
registers from the time when the register renamer processes the
instruction until the time when the instruction is processed by the
ROB; can someone confirm this, or is it done in some other way?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 12 18:55:49 2025

On Thu, 12 Jun 2025 13:12:59 +0000, EricP wrote:

MitchAlsup1 wrote:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

quadibloc wrote:

On Wed, 11 Jun 2025 5:56:33 +0000, Thomas Koenig wrote:

Having different classes of base and index registers is very
un-RISCy, and not generally a good idea. General purpose registers
is one of the great things that the /360 got right, as the VAX
later did, and the 68000 didn't.

This is true.

However, if the memory reference instructions had 5 bits for the
destination register, 5 bits for the index register, 5 bits for the base >>>> register, and 16 bits for the displacement, then there would only be one >>>> bit left for the opcode.

Plus 3 bits for the load/store operand type & size,
plus 2 or 3 bits for the index scaling (I use 3).
It all won't fit into a 32-bit fixed length instruction.

A separate LEA Load Effective Address instruction to calculate
rDest=[rBase+rIndex<<scale+offset13] indexed addresses is an
alternative.

For my part, LEA is the other form of LDD (since the signed/unsigned
notation is unused as there is no register distinction between signed
64-bit LD and an unsigned 64-bit LD.

LEA doesn't need the 3 bits for data type/size.
We can allocate them to index scaling which does need them.

If fp128 is to be (someday) supported then the index scaling must be
at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
single instruction array index calculations up to fp128 octonions.
The index scaling selects the octonion array element and the
displacement selects a coefficient in it.

Not that I have a use for octonions myself,
just thinking of the kids out there.

An architecture is defined as much by what gets left out as by what
gets left in. You have to draw the line somewhere.

Then rDest is used as a base in the LD or ST.

One or two constant prefix instruction(s) I mentioned before
(6 bit opcode, 26 bit constant) could extend the immediate value
to imm26<<13+imm13 = sign extended 39 bits,
or imm26<<39+imm26<<13+imm13 = 65 bits (truncated to 64).

I've had a look at the H&P graph of displacement % usage and see
that there is no significant difference between 12 and 13 bits.

As a second cut at the design, I'd make the immediate 12 bits.
So the immediate constants are either 12, 26+12=38 or 26+26+12=64 bits.
And that leaves 4 bits for function codes or data types.

My Mantra is to never use instructions to paste constants together.

John's scenario chose fixed length 32-bit instructions
so I'm just playing the cards dealt.

I suggest a new deck in in order.

This allows an operation with up to 64 bits of immediate to be
defined in just 12 bytes of instruction space (same as My 66000).
It is spread over 3 instruction slots, but those CONST instructions are defined as fused in Decode so its similar to a variable length ISA
in that it requires no extra execute clocks.

For that statement to be always true, you would have to have the
final 12-bit constant on all FP calculation instructions.

And I do not believe that you have addresses the universal placement
idea of My 66000::

FDIV R7,#3.141592653589216,R19

at least for the non-commutative calculations.

As I required 5 bits for the opcode to allow both loads and stores for >>>> several sizes each of integer and floating-point operands, I had to save >>>> bits somewhere.

The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,

There are 8 if you want to detect overflow differently between
signed and unsigned 64-bit values) 99.44% of programs don't care.
Which is why one "cooperates" with signedness in LDs, ignores
signedness in STs, and does exception detection only in calculation
instructions.

I have various instructions to check integer down-casts ranges too
and fault on overflow. For checked languages, most overflow range
checks require one extra instruction before the ST to a smaller type.

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

I want double wide instructions for atomic swap and compare-and-swap.
Those are restricted to naturally aligned addresses and trap if not.

My point was:: 2 address atomics are more powerful that 2-wide single
address atomics.

To support these I also need to be able to load and store wide values.
The load and store register pair instructions accept any address but if
you want an atomic guarantee then the address must be naturally aligned.
If not naturally aligned then LD or ST could use 2 separate memory
accesses.

So 3 bits for the data type for loads and stores,

3-bits for LDs, 2-bits for STs.

For integers, the register pair makes ST sizes 1, 2, 4, 8, 16.
I potentially have 5 float types from 1 to 16 bytes.
If I include float register pairs (for complex numbers) then
it could be load and store of 10 float data types.
So 3 bits for data type.

Go ahead and shoot yourself in the foot.

which if you
put that in the opcode field uses up almost all your opcodes.

With a Major OpCode size of 6-bits, the LDs + STs with DISP16
uses 3/8ths of the OpCode space, a far cry from "almost all";
but this is under the guise of a machine where GPRs and FPRs
coexist in the same file.

John said he had a 5-bit opcode.

Which makes 3/8ths into 3/4ths or from quite reasonable to completely unreasonable.

He also said he wants separate integer and float register files
so that means separate LD, ST and FLD, FST.

I have found this to be a burden, not an enhancement.

For the data types I listed above, but NOT including the float pairs,
it would use opcodes for 8 LD, 5 ST, 5 FLD, 5 FST = 23 of 32 opcodes.
If I include some float pairs for complex fp32, fp64 and fp128 then
that uses up 29 or 32 opcodes. And that is just for loads and stores.

So yes, "almost all".

Either it:
(a) moves the type/size bits somewhere else (the offset field), as I
did,
(b) or drop support for some sizes and require an extra sign or zero
extend instruction to handle the others, as Alpha did.

By using 1 Major OpCode to access another 6-bit encoding space
(called XOP1) one then has another 6-bit encoding space where
the typical LDs and STs consume 3/8ths leaving room to encode
indexing, scaling, locking behavior, and ST #value,[address]
which then avoids constant pasting instructions and waste of
registers.

But you have variable length instructions. I would too.

Everything necessary for decoding, determining operands, and routing
operands to a function unit are contained in the first word of the
VLE. Only constants follow this first word.

John's premise assumes they are fixed 32-bits.
I'm running *that* scenario forward to see if we can get a better
result than the RISC-V folks got, where they need 6 instructions
and 24 bytes to do a LD or ST with 64 bit offset.

Whereas My 66000 needs only 3-words and only 1 instruction.

The CONST prefix instruction approach shows it can be done in
3 instructions of 12 bytes which are fused in Decode so require
no extra working register and no execute clocks for pasting.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 12 19:01:52 2025

On Wed, 11 Jun 2025 19:37:03 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

The problem is that there are 7 integer data types,
signed and unsigned (zero extended) 1, 2, 4 and 8 bytes,
and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Even I have refused to contend with all of this, at least for my basic
32-bit instruction set. Some exotic types that I do intend to support
will just have to make do with 48-bit or longer instructions instead.

But signed and unsigned integers aren't _quite_ the same as different
types for load and store. I may have separate integer and floating
registers, but I don't have separate signed and unsigned registers.

But you DO HAVE signed and unsigned versions of LD {B,H,W} don't you ??

Instead, I've followed the System/360. When it comes to load and store,
for integers I have two additional operations - unsigned load and
insert. But only for integers shorter than the register.

What code is produced from::

uint32_t function( uint32_t u )
{
int32_t i[99];
return i[u];
}

a) are you going to signed-word-load the i array and then zero-
extend at word boundary, or
b) do you propagate the unsigned into the load of array of i so
you can return the value directly
??
{{Notice I am using 32-bit data in a 64-bit machine}}

Load sign extends. Unsigned Load zero extends. Insert leaves bits in the register preceding what is loaded untouched.

Since arithmetic is two's complement, there is only one add instruction,
and there is only one store instruction, for each length. If we were
really dealing with different types, we would need additional
instructions of those kinds as well.

For floats, I deal with fp32, fp48, fp64, and fp128 only as the primary floating-point types.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 12 19:19:18 2025

On Thu, 12 Jun 2025 13:38:31 +0000, EricP wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 11 Jun 2025 18:56:56 +0000, EricP wrote:

and potentially 5 float, fp8, fp16, fp32, fp64, fp128.
There might also be special (non-ieee) float formats for AI support.
Plus one might also want some register pair operations
(eg load a complex fp32 value into a pair of fp registers,
store a pair of integer registers as a single (atomic) int128).

Store a pair of registers into two different memory locations
atomically is more powerful.

And more costly, given the potential need for two
TLB lookups (which could access or dirty fault (restartable))
and the potential cache coherency latency.

Holding exclusive access to two cache lines at once
as an atomic unit would complicate the coherency protocol,
particularly with respect to deadlock prevention, nicht wahr?

That's one reason multiword atomics are generally required
to be naturally aligned; to avoid crossing a cache-line or page
boundary.

I don't need it to be atomic for any alignment.
The spec for LD and ST register pair would say that IF the address is
16-byte aligned THEN the operation is guaranteed to be done atomically.

If the container does not cross a cache line boundary it can be
performed atomically, and otherwise it cannot.

If the address is not aligned it may use two separate operations.
This is the same guarantee as 2, 4 and 8 byte LD or ST.

As to whether the register pair is specified as one field with
and implied increment or two separate fields, I have cases for both.
Once I started adding double-wide operate instructions I found
usages where assuming the register pairs were contiguous
(eg only even numbered registers) was too constraining.

Compiler people H A T E pairing {LoB = 0 and 1} and sharing
{Rsecond = Rfirst+1}, they want to be able to allocate any
value into any register without such constraints. After all
register allocation is already NP, pairing and sharing moves
the needle to NP-hard.

It forces many extraneous MOV's to create the even numbered pairs.

In my Samsung GPU I invented a DBLE instruction. Its only job was to
supply register operands to another instruction which would then be
performed double wide. This gets around all the pairing and sharing
problems.

On the other hand, having register pairs specified by two fields
quickly winds up with instructions that have 5 or 6 register fields
(2 dest and 2+1 source, or 2 dest and 2+2 source).

DBLE simply supplies the extra register specification fields;
obviating the problem in expressing the unit-of-work.

But this only affects Decode as the uOp formats require 6 fields.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jun 12 19:13:10 2025

On Wed, 11 Jun 2025 19:47:42 +0000, BGB wrote:

On 6/11/2025 11:37 AM, MitchAlsup1 wrote:

On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:
------------------

LR: Functionally, in most ways the same as a GPR, but is assigned a
special role and is assumed to have that role. Pretty much no one uses
it as a base register though, with the partial exception of potential
JALR wonk.

One can use JALR to call special subroutines that store multiple
registers
on the stack (or restore them later) wrapping prologue and Epilogue into
little subroutine calls that use a separate LR and thus have lower over-
head than a full blown call. Other than this use and some PDP-11-style co-routines the explicit specification of LE is completely unnecessary.

JALR X0, X1, 16 //not technically disallowed...

If one uses the 'C' extension, assumptions about LR and SP are pretty
solidly baked in to the ISA design.

ZR: Always reads as 0, assignments are ignored; this behavior is very un-GPR-like.

GP: Similar situation to LR, as it mostly looks like a GPR.
In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so saving/restoring GP will also implicitly save/restore the dynamic
rounding mode and similar (as opposed to proper RISC-V which has this
stuff in a CSR).

With universal constants, you get this register back.

Though, this isn't practically too much different from using the HOB's
of captured LR values to hold the CPU ISA mode and similar (which my
newer X3VM retains, though I am still on the fence about the "put FPSR
bits into HOBs of GP" thing).

Does mean that either dynamic rounding mode is lost every time a GP
reload is done (though, only for the callee), or that setting the
rounding mode also needs to update the corresponding PBO GP pointer
(which would effectively make it semi-global but tied to each PE image).

The traditional assumption though was that dynamic rounding mode is
fully global, and I had been trying to make it dynamically scoped.

The modern interpretation is that the dynamic rounding mode can be set
prior to any FP instruction. So, you better be able to set it rapidly
and without pipeline drain, and you need to mark the downstream FP
instructions as dependent on this.

So, it may be that having FPSR as its own thing, and then explicitly saving/restoring FPSR in functions that modify the rounding mode, may be
a better option.

RM is separate from FPSR in My 66000, and uniquely accessible. -----------------------

Though, OTOH, Quake has stuff like:
typedef float vec3_t[3];
vec3_t v0, v1, v2;
...
VectorCopy(v0, v1);
Where VectorCopy is a macro that expands it out to something like, IIRC,
do { v1[0]=v0[0]; v1[1]=v0[1]; v1[2]=v0[2]; } while(0);

Where BGBCC will naively load each value, widen it to double, narrow it
back to float, and store the result.

Sounds like you should be working on the compiler instead of microarchitectures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 12 19:24:36 2025

On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote:

I thought I saw a post in this thread which asked why I included VLIW capabilities in my ISA.
Perhaps that post was deleted, or I saw it in another thread and misremembered.
However, I thought it was worth a reply, in case anyone had forgotten
what VLIW was "good for".

Today's microprocessors achieve considerably improved performance
through the use of Out-of-Order Execution. Compare the 486, which
doesn't have it, to the Pentium II, which does have it. Intel's Atom processors originally did not have OoO in order to be small and
inexpensive, but their low performance, and smaller transistors making
more complex chips more easily possible led to even the Atom going OoO.

OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and
Meltdown.

VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
promise of achieving OoO level performance without the costs of OoO.

Pick a VLIW that was successful like x86 or ARM in the marketplace.

This is because it lets the pipeline achieve high efficiency by directly indicating within the code itself when succeeding instructions may be executed in parallel, without requiring the computer to make the effort
of determining when this is possible.

That is the theory. But theory works better in theory than in practice.
Itanic made a run at it, but ultimately failed, as OoO was found to be
the better tool. Itanic held some leads for a while, while it and 2×
the number of pins in the memory side and 2× the Function units on the calculation side. When it had an equal number of pins, it was never
ahead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu Jun 12 19:55:06 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Actually, it seems to me that for the first RISC generation, it made
the least sense. You could not afford the transistors to do special
handling of the PC.

Nowadays, you can afford it, but the question still is whether it is cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.

Power has added this as an extended instruction with the v3.1
version of their ISA (Power 10); you can now do loads and stores
to PC with a 34-bit signed offset.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Thu Jun 12 20:50:51 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

If fp128 is to be (someday) supported then the index scaling must be
at least 0..4 (> 2 bits). A scaling of 0..7 or 3 bits would support
single instruction array index calculations up to fp128 octonions.
The index scaling selects the octonion array element and the
displacement selects a coefficient in it.

Index scaling is very nice to have when you add, let's say, a
elements of a real array to the real part of a complex array -
you only need one register for the index variable.

For just doing

for (i=0; i<n; i++)
a[i] = b[i] + c[i]

where a, b and c are all of the same type, you can use
a non-scaled single index register and increment it by
the size of the type.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Jun 13 00:00:57 2025

On Thu, 12 Jun 2025 21:30:39 +0000, BGB wrote:

On 6/12/2025 2:13 PM, MitchAlsup1 wrote:

------------------------------

GP: Similar situation to LR, as it mostly looks like a GPR.
In my CPU core and JX2VM, the high bits of GP were aliased to FPSR, so
saving/restoring GP will also implicitly save/restore the dynamic
rounding mode and similar (as opposed to proper RISC-V which has this
stuff in a CSR).

With universal constants, you get this register back.

Well, if using an ABI that either allows absolute addressing or PC-rel
access to globals.

It is ISA that directly supports access to Globals.

The ABI designs I am using in BGBCC and TestKern use a global pointer
for accessing globals, and allocate the storage for ".data"/".bss"
separately from ".text". In this ABI design, the pointer is unavoidable.

I want a system where .data and .bss can be > 1TB away from each other.
So, that .data grows when ld.so loads another dynamic library, and .bss
grows for the same reasons.

Does allow multiple process instances in a single address space with non-duplicated ".text" though (and is more friendly towards NOMMU
operation).

Though, this isn't practically too much different from using the HOB's
of captured LR values to hold the CPU ISA mode and similar (which my
newer X3VM retains, though I am still on the fence about the "put FPSR
bits into HOBs of GP" thing).

Does mean that either dynamic rounding mode is lost every time a GP
reload is done (though, only for the callee), or that setting the
rounding mode also needs to update the corresponding PBO GP pointer
(which would effectively make it semi-global but tied to each PE image). >>>
The traditional assumption though was that dynamic rounding mode is
fully global, and I had been trying to make it dynamically scoped.

The modern interpretation is that the dynamic rounding mode can be set
prior to any FP instruction. So, you better be able to set it rapidly
and without pipeline drain, and you need to mark the downstream FP
instructions as dependent on this.

Errm, there is likely to be a delay here, otherwise one will get a stale rounding mode.

RM is "just 3-bits" that get read from control register and piped
through instruction queue to function unit. Think of the problem
one would have if a hyperthreaded core had to stutter step through
changing RM ...

So, setting the rounding mode might be something like:
MOV .L0, R14
MOVTT GP, 0x8001, GP //Set to rounding mode 1, clear flag bits
JMP R14 //use branch to flush pipeline
.L0: //updated FPSR now ready
FADDG R11, R12, R10 //FADD, dynamic mode

Setting RM to a constant (known) value::

HRW rd,RM,#imm3 // rd gets old value

Or, use an encoding with an explicit (static) rounding mode:
FADD R11, R12, 1, R10

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Thu Jun 12 20:36:07 2025

On 6/12/2025 8:09 PM, quadibloc wrote:

On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

On 6/12/2025 8:00 AM, quadibloc wrote:

The IBM System/360 had a base and index register and a 12-bit
displacement.

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

I hadn't thought about it that way.

It does make sense that on a timesharing system, virtual memory meant
that different users would not have to share the same memory space, so programs wouldn't have to be relocatable.

But if you drop base registers for that reason, suddenly you are forced
to always use virtual memory.

No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.

snip

Of course, then why did the 68020 support it, I could ask.

Someone more familiar with the 68020 would have to answer that.

But in any case, the answer to Thomas's original question is that there
is no use case for it now, and the cost in instruction bits is too large
to consider using them.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 06:03:02 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.

It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.

Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?

You would only have to store it as an offset to a base pointer,
so basically a "fat pointer" containing both base and index
register. Of course, nobody did that.

They really didn't think that one through.

Which brings me to one of my favorite musings... how would a /360
have looked with the benefit of things that could/should have been
seen at the time? PC-relative branches with 16 bit offset and
ARM-style condition codes come to mind (introduced with the /390,
I believe), as could be binary floats (don't save those few gates).

Also, just discussed: Throw out the base registers and put in
memory operations with 16-bit offset.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jun 13 07:31:20 2025

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

quadibloc <quadibloc@gmail.com> writes:

To look like RISC, and yet to have the code density of CISC

I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in >><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.

Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian
packages:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Seems to me that the text size is not the interesting
metric here - rather the typical working set size is
far more important.

Yes, something like it. But how do you measure it? And do you think
that the text size of binaries for different architectures are not
correlated to the working set sizes of these architectures?

It may be that text size isn't a particularly
good metric for judging instruction set effectiveness.

Why would it not be a good predictor, and what would you use instead?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Jun 13 07:03:07 2025

mitchalsup@aol.com (MitchAlsup1) writes:

Compiler people H A T E pairing {LoB = 0 and 1} and sharing
{Rsecond = Rfirst+1}, they want to be able to allocate any
value into any register without such constraints. After all
register allocation is already NP, pairing and sharing moves
the needle to NP-hard.

Register allocation is NP-complete, and thus also NP-hard (every
NP-complete problem is NP-hard).

I don't know if the additional condition for NP-completeness does not
hold for register allocation with pairing (I doubt it), but the reason
why compiler people dislike it is not this practically irrelevant
theoretical difference. We don't solve even NP-complete problems in
compilers. Instead, we use heuristics to solve a different problem:
produce a good, but not necessarily optimal register allocation;
going for optimality would be NP-complete.

The reason why compiler people dislike pairing is that it introduces
another complication in an already complicated and bug-fraught part of
the compiler. It gets especially complicated if you want to produce a
good solution.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to quadibloc on Fri Jun 13 07:39:55 2025

quadibloc <quadibloc@gmail.com> writes:

OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and
Meltdown.

No. AMD's OoO CPUs never had Meltdown AFAIK. As for Spectre, that
can be fixed (not mitigated) at moderate hardware and performance cost
with Invisible Speculation; I wrote an overview paper about Spectre
and how to fix it <http://www.euroforth.org/ef23/papers/ertl.pdf>, but
the actual Invisible Speculation research was done by others. It's
just that the hardware designers don't want to; apparently the
customers are not interested enough and prefer to pay the performance
and software development cost of Spectre mitigations (or are too
indifferent to care about it at all).

VLIW, in the sense of the Itanium

IA-64 is EPIC, not VLIW. And IA-64 gives you Spectre, too, in a way
that is cannot be fixed by Invisible Speculation, because the
speculation is architectural, and there is no explicit "commit" that
turns speculation into non-speculation.

You can avoid Spectre in IA-64 by avoiding the use of the speculation
features of the architecture (in particular control-speculative or data-speculative loads) or at least avoiding further loads based on
the loaded data while that is still speculative (you probably need to
avoid other things as well). That's the IA-64 equivalent of the
Speculative Load Hardening mitigation which costs more than a factor
of 2 in performance on OoO CPUs. I expect that it costs less
performance on IA-64, but the end result will still be less
performance on IA-64 than on OoO CPUs with the same transistor and
power budget.

This is because it lets the pipeline achieve high efficiency by directly >indicating within the code itself when succeeding instructions may be >executed in parallel, without requiring the computer to make the effort
of determining when this is possible.

Looking at the end result, IA-64 implementations consumed more
transistors and more power than contemporary in-order CPUs that
produced better SPECint results.

For code that spends a lot of time in software-pipelinable loops
(SPECfp), IA-64 looked competetive for a while, but SIMD reduces the
overheads even more (there's a reason why Cray went for it and why
Cray's customers went for his products), and the additional
flexibility of EPIC apparently does not provide a benefit over SIMD in
enough cases.

Concerning the TMS 320C6000, that's designed exactly for the kinds of
loops where EPIC and VLIW are competetive, but even for that, I have
not heard anything about it in recent years (which may or may not mean something).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Fri Jun 13 13:14:29 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 6/12/2025 8:09 PM, quadibloc wrote:

On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

On 6/12/2025 8:00 AM, quadibloc wrote:

The IBM System/360 had a base and index register and a 12-bit
displacement.

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

I hadn't thought about it that way.

It does make sense that on a timesharing system, virtual memory meant
that different users would not have to share the same memory space, so
programs wouldn't have to be relocatable.

But if you drop base registers for that reason, suddenly you are forced
to always use virtual memory.

No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.

Note that the B5500 had a form of virtual memory before the 360
was released. The B6500 (1969) adding paging.

The B3500 operated as you describe, with a hidden base register;
until 1983 when the architecture was enhanced to support
eight hidden base registers (supporting 8 active regions).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jun 13 14:48:20 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

quadibloc <quadibloc@gmail.com> writes:

To look like RISC, and yet to have the code density of CISC

I have repeatedly posted measurements of code sizes of several
programs on various architectures, most recently in >>><2024Jan4.101941@mips.complang.tuwien.ac.at> (in an answer to a
posting from you) and in <2025Mar3.174417@mips.complang.tuwien.ac.at>.

Data from <2024Jan4.101941@mips.complang.tuwien.ac.at> for Debian >>>packages:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Seems to me that the text size is not the interesting
metric here - rather the typical working set size is
far more important.

Yes, something like it. But how do you measure it? And do you think
that the text size of binaries for different architectures are not
correlated to the working set sizes of these architectures?x

Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user. There's only
a 17% difference between riscv and arm64, after all, and
arm64 is far more mature.

It may be that text size isn't a particularly
good metric for judging instruction set effectiveness.

Why would it not be a good predictor, and what would you use instead?

I'm not convinced that "instruction set effectiveness" is a
useful metric for modern systems. Having been involved with
ARM64 from 2012 (including a stint on the technical advisory board),
I've watched the architecture evolve over the last fourteen years
into a rather complicated behemoth - due primarily to the evolving
requirements of the customer base and the desire to target additional application classes. Whether or not the architecture supports
in-line 64-bit constants with a simple instruction encoding doesn't
seem particularly interesting to anyone other than compiler
code generator developers.

I can't say that 'text size' has been
a major consideration by the general end-users community outside of
a small subset of the embedded system development community. It
has an impact on Icache, certainly, but can you quantify it
vis-a-vis the other archtiectural trade-offs between the competing
processor families?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 13 08:15:55 2025

On 6/13/2025 4:52 AM, quadibloc wrote:

On Thu, 12 Jun 2025 15:44:14 +0000, Stephen Fuld wrote:

On 6/12/2025 8:00 AM, quadibloc wrote:

On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

What is the use case for having base and index register and a
16-bit displacement?

The IBM System/360 had a base and index register and a 12-bit
displacement.

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.

No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 08:23:17 2025

On 6/12/2025 11:03 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in
its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.

It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.

Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?

Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just
changed the value in the base register to reflect the new location.
This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Fri Jun 13 17:10:09 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

However, some comp.arch regulars seem to consider it quite important,
and they regularly make claims about the code density of various
instruction sets. I have started measuring the text size of programs
in order to provide empirical counterevidence to this wishful
thinking. This apparently has made little impression on those making
such claims, but maybe the rest of you will gain somthing from these
data.

One problem is that different architectures may make different
decisions about such things as inlining, cloning and loop unrolling.
While your numbers can be indicative, a comparision with -Os would
give better overview of achievable code size.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Fri Jun 13 15:38:43 2025

scott@slp53.sl.home (Scott Lurndal) writes:

Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.

Probably not, but I don't think the reason is that "working set size"
would produce significantly different results.

However, apparently code size is important enough in some markets that
ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
(which allows mixing 16-bit and 32-bit encodings); Power specified VLE
(are there any implementations of that?); and RISC-V specified the C
extension, which is implemented widely AFAICT.

I'm not convinced that "instruction set effectiveness" is a
useful metric for modern systems.

One would have to define that first.

As for code density (however measured), yes, I think that in the
markets that ARM A64 was designed for, that's probably not a top
consideration when selecting an instruction set.

However, some comp.arch regulars seem to consider it quite important,
and they regularly make claims about the code density of various
instruction sets. I have started measuring the text size of programs
in order to provide empirical counterevidence to this wishful
thinking. This apparently has made little impression on those making
such claims, but maybe the rest of you will gain somthing from these
data.

It
has an impact on Icache, certainly, but can you quantify it

One way of quantifying it would be to take (or simulate)
implementations with the same I-cache organization (size, cache line
length, associativity, replacement policy and lack of an uop cache)
and measure the number of I-cache misses for running a specific
program. Actually, simulation is better, running would include
differences in prefetching, which are probably influenced by other considerations than code density.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jun 13 17:22:12 2025

On Fri, 13 Jun 2025 7:39:55 +0000, Anton Ertl wrote:

quadibloc <quadibloc@gmail.com> writes:

OoO comes with a cost, though. It increases transistor costs
considerably. Also, it comes with vulnerabilities like Spectre and >>Meltdown.

No. AMD's OoO CPUs never had Meltdown AFAIK. As for Spectre, that
can be fixed (not mitigated) at moderate hardware and performance cost
with Invisible Speculation; I wrote an overview paper about Spectre
and how to fix it <http://www.euroforth.org/ef23/papers/ertl.pdf>, but
the actual Invisible Speculation research was done by others. It's
just that the hardware designers don't want to; apparently the
customers are not interested enough and prefer to pay the performance
and software development cost of Spectre mitigations (or are too
indifferent to care about it at all).

VLIW, in the sense of the Itanium

IA-64 is EPIC, not VLIW.

IA-64 is an EPIC failure, as are all other VLIW-like architectures.

And IA-64 gives you Spectre, too, in a way
that is cannot be fixed by Invisible Speculation, because the
speculation is architectural, and there is no explicit "commit" that
turns speculation into non-speculation.

-------------------

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 17:40:44 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/12/2025 11:03 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

No. Other systems in the S/360 time frame (i.e. before virtual memory)
used a system "base register", that was hidden from the user (but was in >>> its context), that was set by the OS when the program was loaded, or if
it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage
of not requiring a user register for that purpose, it allowed a program
to be swapped in to a different memory address than it was swapped out
from, a feature the S/360 didn't enjoy.

It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.

Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?

Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just changed the value in the base register to reflect the new location.

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.

Even with that provision, it would not have worked.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Fri Jun 13 17:42:03 2025

quadibloc <quadibloc@gmail.com> schrieb:

I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a displacement field large enough to address all of memory in every instruction.

Registers do that, and you only need a single one.

In fact, I think this was the primary reason, and using
them to relocate code and data was a nice idea that came after.

The literature says otherwise.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 10:57:13 2025

On 6/13/2025 10:40 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/12/2025 11:03 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.

It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.

Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?

Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just
changed the value in the base register to reflect the new location.

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

Got it :-)

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

Short answer. I don't know. Perhaps someone who does can provide the
answer. I agree that a straight pointer would work, and can be computed
in the calling routine, since it knows the base address of the common
block and since it is an argument passed, the compiler would know that
and so know not to need a base register to reference it from the
subroutine. But that doesn't negate the use of base registers in the
more common case of typical code.

This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 11:00:59 2025

On 6/13/2025 10:42 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.

Registers do that, and you only need a single one.

Yup!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 13 18:11:18 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space. The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 13 18:37:29 2025

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/12/2025 11:03 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

No. Other systems in the S/360 time frame (i.e. before virtual memory) >>>> used a system "base register", that was hidden from the user (but was in >>>> its context), that was set by the OS when the program was loaded, or if >>>> it was swapped out, when it was swapped in again. It was reloaded
whenever the program gained control of the CPU. Besides the advantage >>>> of not requiring a user register for that purpose, it allowed a program >>>> to be swapped in to a different memory address than it was swapped out >>>> from, a feature the S/360 didn't enjoy.

It was supposed to, but I belive that was one of the earliest
failures that they noticed, and should have realized before:
their memory to memory instructions did not have base+offset+index.

Also, how was storing a pointer to somewhere supposed to work
for swapping out/swapping in?

Since the "base" pointer was known to the OS, if the program was swapped
out and swapped back in to a different location in memory, the OS just
changed the value in the base register to reflect the new location.

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

The term "base register" is being used in different ways in this thread.

a) We have the base register of base-and-bounds program relocation
{that the user is not allowed to see of know}
b) we have a pointing register (/360) that can have indexing and
offsetting applied to form a virtual address.

Since base-and-bounds slipped into history circa 1980 after even microprocessors got TLBs, I suggest we use the /360 terminology
for address generation and some kind of MMU/TLB terminology for
relocation and protection.

Back to the question: What should FOO bas to BAR is a pointer to B
if arguments are passed by reference, or the actual value of B if
arguments are passed by copy-in-copy-out.

The former:

LDA R1,[IP,,#COMMON.COM.B-.]
CALL BAR

BAR:
LDD R2,[R1]
ADD R2,R2,#1
STD R2,[R1]
RET

the later:

LDD R1,[IP,,#COMMON.COM.B-.]
CALL BAR
STD R1,[IP,,#COMMON.COM.B-.]

BAR:
ADD R1,R1,#1
RET

Both means work rather well in practice.

This didn't work in the S/360 because the OS didn't know what
register(s) the user program was using as base register(s) so it
couldn't change the values in them if the program was to be relocated.

Even with that provision, it would not have worked.

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 13 11:55:51 2025

On 6/13/2025 11:11 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space.

Yes. That is the point I was making. Now I am lost as to why you said
above "even with that provision, it would not have worked."

The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s.

You and I are in violent agreement! Only John seems to disagree.

(Actually, I
believe the UNIVAC had two, one for program and one for data).

Correct. I didn't want to complicate the discussion. The separation of instructions from data allowed the OS to put them in different memory
modules, thus allowing simultaneous access to the current instruction's
operand and to the next instruction fetch, thus dramatically improving performance. (Remember, no cache). And later follow on systems such as
the 1110 actually had two sets of Instruction and Data bases, and later machines went to 16 base registers. This is similar to the progression
Scott talked about for the Burroughs medium systems.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 13 18:42:40 2025

On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space.

A base register not part of GPRs is a descriptor (or segment).
And we don't want to go there.

The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,

It also fails when one has 137 different things to track with those descriptors.

which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).

Still insufficient for modern use cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Fri Jun 13 18:18:09 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space. The other machines made it invisible from user space
and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,
which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).

On the Burroughs systems had two 'states'. Control state
(which today is called kernel or supervisor mode) was defined by a
BASE register with the value 0. The MCP executed with
BASE=0 and had access to all of memory (directly to the
first 500KB, indirectly for the rest).

Normal state was defined by a non-zero BASE register and
privileged instructions would fault.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Poulsen@21:1/5 to Stephen Fuld on Fri Jun 13 19:49:58 2025

On 2025-06-13, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

All I can say is that it worked in several other contemporaneous architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps
the some of CDC systems as they and the Univac systems shared a common designer (Seymour Cray).

While Seymour Cray worked on the 1103, he was off to CDC long before the
1108 was designed. I don't think the idea of multiprogramming and
swapping (and hence the need for a base/limit register pair) had entered anyone's mind back in the days of the 1103.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 13 13:50:24 2025

On 6/13/2025 12:50 PM, quadibloc wrote:

On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:

On 6/13/2025 4:52 AM, quadibloc wrote:

I have been thinking about this, and I don't think that base registers
only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.

No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.

Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.

And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.

It's all a tradeoff. Yes, occasionally you need an extra instruction,
but you gain four bits for a larger displacement (or something else if
you want). And don't forget, you need the "extra" BALR instructions, or
other ones to load the base register for every 4K chunk of data or instructions, and the loss of an otherwise available register.

Everyone else who has evaluated the tradeoff chose not to use the extra register.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jun 13 21:09:06 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/12/2025 11:03 PM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

The term "base register" is being used in different ways in this thread.

a) We have the base register of base-and-bounds program relocation
{that the user is not allowed to see of know}
b) we have a pointing register (/360) that can have indexing and
offsetting applied to form a virtual address.

Since base-and-bounds slipped into history circa 1980 after even

Actually, closer to 2010 as the B3500 descendents were still
running production then (some of the systems were 25+ years old
when they were retired and replaced by dozens of windows server
boxes).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 13 20:50:49 2025

On Fri, 13 Jun 2025 20:12:07 +0000, quadibloc wrote:

On Thu, 12 Jun 2025 19:01:52 +0000, MitchAlsup1 wrote:

What code is produced from::

uint32_t function( uint32_t u )
{
int32_t i[99];
return i[u];
}

That wouldn't even compile. The array i is not initialized.

However, I'll assume that this is a fragment of a larger program.

You've stated that this is for a 64-bit machine.

So it takes an index variable as an argument, and returns an element
from an array.

The array is declared as signed 32 bit integers, but the function
returns an unsigned 32 bit integer.

Well, the answer is that it doesn't matter if I use a "load" or an
"unsigned load", since what the function returns is a pointer to a *32-bit-long* value in memory. Which the calling program will interpret
as unsigned.

No, the function returns (unsigned) i[u]; the value itself not a pointer
to it. The question is how does CII deal with signed/unsigned mismatches expressly written into the code. Seems to me that having both signed and unsigned LDs and a trifling of pattern recognition, solves the problem.

Or maybe the function return is in register zero. In that case, I will
indeed generate a "load" rather than an "unsigned load" inside the
program. The caller, however, will presumably extract the least
significant bits of that register into a 32-bit long variable before
use, so my "error" will not have disastrous consequences.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Jun 13 22:01:54 2025

On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote:

On 6/13/2025 12:50 PM, quadibloc wrote:

On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:

On 6/13/2025 4:52 AM, quadibloc wrote:

I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.

No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.

Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.

And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.

It's all a tradeoff. Yes, occasionally you need an extra instruction,
but you gain four bits for a larger displacement (or something else if
you want). And don't forget, you need the "extra" BALR instructions, or
other ones to load the base register for every 4K chunk of data or instructions, and the loss of an otherwise available register.

Everyone else who has evaluated the tradeoff chose not to use the extra register.

Can you restate what you intended to mean in the last sentence/paragraph
but use different words ??

Certainly the /360 designers, the VAX designers, the x86 designers, and
others; looked at the problem and allowed [Rpointer+Rindex+displacement] addressing. So, it is not everyone.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Fri Jun 13 23:10:56 2025

On 6/13/2025 3:01 PM, MitchAlsup1 wrote:

On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote:

On 6/13/2025 12:50 PM, quadibloc wrote:

On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:

On 6/13/2025 4:52 AM, quadibloc wrote:

I have been thinking about this, and I don't think that base registers >>>>> only existed to allow program relocation in a crude form that virtual >>>>> memory superseded. They also existed simply to avoid having to have a >>>>> displacement field large enough to address all of memory in every
instruction.

No. If you wanted to address larger than the displacement field, you >>>> still had the index register. And remember that the need for that is >>>> reduced because you could have a 16 bit displacement by using the four >>>> bits free'd up by eliminating the base register field.

Certainly, you can use the index register to address an area larger than >>> the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.

And "reduced" is not the same as "completely eliminated", and so I fail
to see how that makes base registers unnecessary.

It's all a tradeoff. Yes, occasionally you need an extra instruction,
but you gain four bits for a larger displacement (or something else if
you want). And don't forget, you need the "extra" BALR instructions, or
other ones to load the base register for every 4K chunk of data or
instructions, and the loss of an otherwise available register.

Everyone else who has evaluated the tradeoff chose not to use the extra
register.

Can you restate what you intended to mean in the last sentence/paragraph
but use different words ??

Certainly the /360 designers, the VAX designers, the x86 designers, and others; looked at the problem and allowed [Rpointer+Rindex+displacement] addressing. So, it is not everyone.

OK. As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory. They
chose the "visible" (to the program) base register instead of say the "invisible" base register, which was, as I have said, IMO a mistake.

The VAX was in a different situation. Being a virtual memory design,
they didn't need it for the reason that the S/360 did. I am not an
expert, but ISTM that the VAX designers wanted to include almost
anything in the ISA to close the "semantic gap", and certainly didn't
feel constrained to keep instructions within 32 bits, so adding the 3
input address calculation, with potentially large offsets seemed
reasonable to them. For various reasons, this all proved not to be a
good choice eventually.

As for the X86, I freely confess to not knowing the constraints its
designers were operating under, so I can't really comment.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Sat Jun 14 09:26:04 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why? I remember reading that the systems software people
spent a lot of work on an overlay mechanism, so the thinking at IBM at
the time was apparently not about keeping several programs in RAM at
the same time, but about running one program at one time, and finding
ways to make that program fit into available RAM.

In any case, it's no problem to add a virtual-memory mechanism that is
not visible to user-level, or maybe even kernel-level (does the
original S/360 have that?) programs, whether it's paged virtual memory
or a simple base+range mechanism.

As for the X86, I freely confess to not knowing the constraints its
designers were operating under, so I can't really comment.

There is no X86.

For the 8086 architecture, the effective addresses are reg, reg+const
or reg+reg (with severe restrictions on the registers usable for that;
the 8086 does not have GPRs).

For IA-32, the addresses can be reg+reg*1/2/4/8+const, using any
registers (i.e., IA-32 has GPRs); this addressing mode was probably
inspired by the VAX, which was in full reign when IA-32 was designed
(the 386 was released in 1985). IA-32 has both the segmentation
mechanism inherited from the 80286 (and extended to 32 bit segments)
and paging, so using the addressing modes for any form of virtual
memory was not the intention for providing this addressing mode.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sat Jun 14 10:45:57 2025

BGB <cr88192@gmail.com> schrieb:

On 6/12/2025 10:00 AM, quadibloc wrote:

On Wed, 11 Jun 2025 17:05:13 +0000, Thomas Koenig wrote:

What is the use case for having base and index register and a
16-bit displacement?

The IBM System/360 had a base and index register and a 12-bit
displacement.
Most microprocessors have a base register and a 16-bit displacement.

Serious overkill...

For [Rb+Disp] with a 32-bit encoding, 9 or 10 is sufficient, if scaled
by the element size, 12 otherwise.

Maybe for the code you like to compile, but a lot of software
has other needs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 14 10:44:16 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why?

I don't have a BiBTeX entry like you usually do, but you can
find "Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks easly.

A quote:

# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models. This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

That they got wrong, egregiously so, as the example with passing
a pointer to something from a COMMON block shows.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Sat Jun 14 10:48:01 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Poulsen@21:1/5 to Anton Ertl on Sat Jun 14 12:42:19 2025

On 2025-06-14, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

IA-32 has both the segmentation
mechanism inherited from the 80286 (and extended to 32 bit segments)
and paging, so using the addressing modes for any form of virtual
memory was not the intention for providing this addressing mode.

I see this as clearly a case of wanting to have it both ways.
The segmentation mechanism was never loved by anyone. The 8086
rudiments were laughable. The 286 was a better try, but still
a far cry from 1960s experiements. 386 fixed most of the holes,
but by then, everybody just wanted IBM370-style paging in a big,
beautiful flat memory space. But they needed to support a lot of
legacy code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jun 14 15:40:28 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why?

"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks

# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.

Up to that point I was thinking of MIPS, Alpha, and RISC-V with their
reg+const addressing, and I thought: Ok, these machines actually
support absolute (aka direct) addressing by using the zero register as
reg. But of course nobody ever uses the zero register for addressing,
and absolute addressing is not used.

# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.

That they got wrong, egregiously so, as the example with passing
a pointer to something from a COMMON block shows.

I missed or did not understand that example. What's the issue?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Jun 14 15:53:52 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
For the 8086 architecture, the effective addresses are reg, reg+const
or reg+reg (with severe restrictions on the registers usable for that;
the 8086 does not have GPRs).

There is also absolute addressing on the 8086, and IA-32 (that
encoding was repurposed for RIP-relative addressing on AMD64 IIRC).

And I completely ignored the segment registers, which were intended to
to provide a virtual-memory/relocation mechanism, but in MS-DOS
programs were used as a cumbersome way to access more than 64KB of
memory. Only in stuff like PC/IX was it used as intended AFAICT.

The tragedy continued with the 80286, which now supported protected
segments with up to 64KB, well suited for the kind of usage in PC/IX
(but apparently that never had a 286 port) and actually used in Xenix,
but which was against the grain for the kind of usage in MS-DOS.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Sat Jun 14 09:45:23 2025

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON
block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Finch on Sat Jun 14 16:05:53 2025

Robert Finch <robfi680@gmail.com> writes:

I think the IA-64 has a lot of interesting features.

Certainly. But it's interesting how OoO makes each of them
unnecessary.

It looks like a
processor that was designed a while ago before the current batch of >superscalars machines became popular.

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.

If the register file were used as a flat register file, instead
of one that rotates the registers it might be simpler to use.

Do the register rotation at the front end, and the OoO engine just
sees flat register names. The register rename table will be big (and
you will probably want to keep it with every branch or potentially
trapping instruction), but Oracle eventually found a way to deal with
that (but our measurements show that even the fastest SPARCs are still
slow).

I have been working with m68k code recently. The issues with it become >apparent when looking at the output of compiled code. A lot of memory to >memory moves. I see that it has great code density, but I wonder how
that correlates to performance, given all the memory ops. A RISC
architecture may have 30% worse code density, but it might run 5x as fast.

If you compare a MIPS R3000 to a 68020, possibly yes. If you compare
an Onyx or M4 to a Zen5 or Lion Cove (CISC stand-ins for the 68K,
which does not have a modern implementation), the code density is
similar, and the performance, too.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Sat Jun 14 09:24:02 2025

On 6/14/2025 8:40 AM, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why?

"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks

# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.

Up to that point I was thinking of MIPS, Alpha, and RISC-V with their reg+const addressing, and I thought: Ok, these machines actually
support absolute (aka direct) addressing by using the zero register as
reg. But of course nobody ever uses the zero register for addressing,
and absolute addressing is not used.

# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.

On S/360, that is exactly what you did. The first instruction in an
assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register. The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now contained the address of the first "useful" instruction as the base
register in future instructions. This allowed the OS to load the
program to any address in real memory, thus to have more than one
program resident in real memory at the same time and the CPU could
switch among them. By the time virtual memory came along with the S/370
(and OK, the 367/67) this was, of course no longer needed, but it was
kept for upward compatibility.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 16:49:14 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/14/2025 8:40 AM, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why?

"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks

# Now the question was: How much capacity was to be made directly
# addressable, and how much addressable only via base registers? Some
# early uses of base register techniques had been fairly unsuccessful,
# principally because of awkward transitions between direct and
# base addressing. It wasdecided to commit the system completely
# to a base-register technique; the direct part of the address,
# the displacement, was made so small (12 bits, or 4096 characters)
# that direct addressing is a practical programming technique only
# on very small models.

Up to that point I was thinking of MIPS, Alpha, and RISC-V with their
reg+const addressing, and I thought: Ok, these machines actually
support absolute (aka direct) addressing by using the zero register as
reg. But of course nobody ever uses the zero register for addressing,
and absolute addressing is not used.

# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.

On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register. The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now contained the address of the first "useful" instruction as the base
register in future instructions.

And woe betide the programmer who got this wrong, the assembler
would then generate wrong offsets.

From the S/360 assembler manual:

The USING instruction indicates that one
or more general registers are available for
use as base registers. This instruction
also states the base address values that
the assenbler may assume will be in the
registers at object time. Note that a
USING instruction does not load the reg-
isters specified. It is the programmer's
responsibility to see that the specified
base address values are placed into the
registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 14 16:56:24 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

That they got wrong, egregiously so, as the example with passing
a pointer to something from a COMMON block shows.

I missed or did not understand that example. What's the issue?

<102hnqs$3hv4m$3@dont-email.me>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Sat Jun 14 16:50:09 2025

On Sat, 14 Jun 2025 16:12:22 +0000, quadibloc wrote:

On Sat, 14 Jun 2025 16:04:42 +0000, quadibloc wrote:

so let's go to "Annie Get Your Gun" for the other song... "Anything You
Can Do".

Although, in my case, it's more like anything almost any other computer
can do, Concertina II can do _almost_ as well, rather than better. Its
level of versatility means that it loses a little in code density.

So an all-out implementation would presumably have a lot of cache in
addition to a lot of pins, to support a wide data path. Unfortunately,
while chips can put their floating-point ALU to sleep during integer
code, there's probably no practical way to put OoO circuitry to sleep
during VLIW code, because it's too intimately tied into everything - but maybe one could have two control units sharing the same ALUs so that
this could be managed.

Special Note:: When a vVM loop is running, FETCH and DECODE are
quiescent.
The loaded Reservation Station fires off the instructions multiple times
at possibly multi-lanes of width. FETCH-DECODE remains primed with inst- ructions after the loop exits, and is enabled when loop terminates.

So, while you are unlikely to de-power the integer section, you can
depower
FETCH-DECODE and save a bunch of power (~1/3rd).

But then, today's microprocessors have thousands of pins, and yet they
don't have enormously wide data paths. Apparently their control
interfaces had to get way more complex than, say, what worked back in
the Socket 7 days.

GPUs have ~1024 "pins" in and another 1024 "pins" out PER shader core.
If GPUs can afford this pin count, so can CPUs.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Sat Jun 14 17:00:08 2025

On Sat, 14 Jun 2025 6:10:56 +0000, Stephen Fuld wrote:

On 6/13/2025 3:01 PM, MitchAlsup1 wrote:

On Fri, 13 Jun 2025 20:50:24 +0000, Stephen Fuld wrote: >>--------------------

The VAX was in a different situation. Being a virtual memory design,
they didn't need it for the reason that the S/360 did. I am not an
expert, but ISTM that the VAX designers wanted to include almost
anything in the ISA to close the "semantic gap", and certainly didn't
feel constrained to keep instructions within 32 bits, so adding the 3
input address calculation, with potentially large offsets seemed
reasonable to them. For various reasons, this all proved not to be a
good choice eventually.

VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now
while this makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was
an instruction in its own right.

So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.

As for the X86, I freely confess to not knowing the constraints its
designers were operating under, so I can't really comment.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jun 14 17:02:23 2025

On Sat, 14 Jun 2025 9:26:04 +0000, Anton Ertl wrote:

There is no X86.

Certainly not with a capital X

But they sell 100 M with a small x86 every year.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Sat Jun 14 16:39:08 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 6/14/2025 8:40 AM, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks

...

# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.

On S/360, that is exactly what you did. The first instruction in an >assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.

That's not loading once and leaving it alone, but yes, that can work,
too, as shown in modern dynamic-linking ABIs.

The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now >contained the address of the first "useful" instruction as the base
register in future instructions. This allowed the OS to load the
program to any address in real memory, thus to have more than one
program resident in real memory at the same time and the CPU could
switch among them. By the time virtual memory came along with the S/370
(and OK, the 367/67) this was, of course no longer needed, but it was
kept for upward compatibility.

An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic
linking became fashionable; on Linux at first dynamically-linked
libraries were compiled for specific addresses, which required quite a
bit of coordination, so they got rid of that (IIRC in the transition
to libc5) and the libraries had to be position-independent, but the
binaries still were fixed-address. Finally, one wanted to make life
harder for attackers with adress-space layout randomization (ASLR), so everything should become position-independent, and different pieces
would start at random offsets relative to other pieces as well.

So all these techniques got a new life, with ABIs on MIPS and Alpha
where a global pointer is loaded from the link register at the start
of a function and after each "far" call (at the very least for calls
to a different dynamically linked library). The MIPS instruction set
certainly was not designed for that kind of environment, and Alpha has
no new features in that respect AFAICT, but on both such ABIs could be implemented; it took some additional instructions.

AMD64 and ARM A64 were designed when these requirements were already
there, and they added PC-relative addressing, which results in reduced instruction counts (but, as mentioned elsewhere, possibly increased
hardware implementation headaches on modern cores).

Anyway, even in modern paged virtual-memory architectures, we
recreated the need for having several program pieces, each with position-independent code, with calls between them. And the ISAs look
much more like S/360 than like ISAs with "hidden" base registers.
IA-32 has the segment registers with could serve as semi-hidden base
registers, but AMD64 de-emphasized segment registers; AFAIK they are
only used for thread-local data these days, if at all.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Sat Jun 14 17:30:36 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:

Packing and unpacking decimal floats can be done inexpensively and fast
relative to the size, speed of the decimal float operations. For my own
implementation I just unpack and repack for all ops and then registers
do not need any more than 128-bits.

I also unpack the hidden first bit on IEEE-754 floats.

Do you have 65-bit registers, then?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stephen Fuld on Sat Jun 14 13:56:02 2025

Stephen Fuld wrote:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

If the program was relocated after the call to BAR but before using
the reference to access argument A then it reads the wrong location.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 18:51:44 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address
instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer
placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form
between routines, because they knew even then that it would never work.
At least not when subroutines are *compiled separately*, which was the
normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

Guaranteed, with a 12-bit offset?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jun 14 18:39:42 2025

On Sat, 14 Jun 2025 17:30:00 +0000, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 15:15:55 +0000, Stephen Fuld wrote:

On 6/13/2025 4:52 AM, quadibloc wrote:

I have been thinking about this, and I don't think that base registers >>>> only existed to allow program relocation in a crude form that virtual
memory superseded. They also existed simply to avoid having to have a
displacement field large enough to address all of memory in every
instruction.

No. If you wanted to address larger than the displacement field, you
still had the index register. And remember that the need for that is
reduced because you could have a 16 bit displacement by using the four
bits free'd up by eliminating the base register field.

Certainly, you can use the index register to address an area larger than
the displacement field. Otherwise, RISC CPUs wouldn't work. However,
then if you want to do an array access in that wider range, once more
you need extra instructions to calculate the index value.

There is _nothing_ wrong with having base + (scaled) index register instructions. That is just 15 bits for three registers (assuming
32 GPRs), which leads ample space for opcodes, scaling and
maybe, if you feel so inclined, a small offset.

Clairifying:: 15-bis is 3×5-bits of Rdest, Rbase, and Rindex.
and 2-bits of <constant> scale.

There is _everything_ wrong with mandating a base register for
every load/store operation, and trying to cram in a large offset
as well.

My 66000 mandates a base (i/e., pointing) register for each access.
However, R0 is a proxy for IP, so while R0 cannot be a base (point)
register, there is a point register none-the-less.

Note:: address constants are optional when scaled index is in play.

If you want to step through arrays, you can also use something
like POWER's "load or store with update". ldu puts the effective
address of the memory instruction into the address register,
so you can use that with arbitrary step sizes.

You CAN do this, but if your memory references have scaled indexing
you generally SAVE looping (induction) ADD instructions.

for( i = 0; i <max; i++ )
doubleword[i] = word[i]+halfword[i];

With scaled indexing::

MOV Ri,#0
top:
LDH R3,[IP,Ri<<1,halfword-.]
LDW R4,[IP,Ri<<2,word-.]
ADD R5,R3,R4
STD R5,[IP,Ri<<3,doubleword-.]
ADD Ri,Ri,#1
CMP Rt,Ri,Rmax
BLT Rt,top

without scaled indexing::

MOV Ri,#0
MOV Rt,Ri
MOV Rs,Ri
MOV Rq,Ri
SL Rmax,Rmax,#3 // could be #1,#2,#3
// depending on R{t,s,q}
top:
LDH R3,[IP,Rt,halfword-.]
LDW R4,[IP,Rs,word-.]
ADD R5,R3,R4
STD R5,[IP,Rr,doubleword-.]
ADD Rt,Rt,#2
ADD Rs,Rs,#4
ADD Rq,Rq,#8
CMP Rw,Rq,Rmax // Rq is compared to Rmax<<3
BLT Rw,top

4 more setup instructions, 2 more loop ADD instructions, 4 more
registers
in use--and that is without getting LUI or AUPIC for the address
constants.
And for the coupé de-grassé:

With vVM::

MOV Ri,#0
VEC R6,{}
top:
LDH R3,[IP,Ri<<1,halfword-.]
LDW R4,[IP,Ri<<2,word-.]
ADD R5,R3,R4
STD R5,[IP,Ri<<3,doubleword-.]
LOOP LT,Ri,#1,Rmax

One can perform ADD-CMP-BC in 2-3 gate delays longer than the ADD instruction--since a 3-input add is a 3-2 compressor (1 gate) longer
than a 2-input ADD--the first ADD is real, the second ADD is a
subtract of comparand--and you are looking (mainly) at carry out
and sign bit to determine whether the loop continues or terminates.

VEC {} is telling the HW that none of the loop registers is "live"
out of the loop; so, in this case, R[3..5] need not be written
in the loop or exiting the loop. SW programmed write-elision !!

So, we have a 9 instruction loop being compared to a 5 instruction
loop which does the same amount of work, but writes fewer registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to EricP on Sat Jun 14 12:23:59 2025

On 6/14/2025 10:56 AM, EricP wrote:

Stephen Fuld wrote:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

       SUBROUTINE FOO
       REAL A,B
       COMMON /COM/ A,B
       REAL C
       CALL BAR(B)
C ....
       END

       SUBROUTINE BAR(A)
       REAL A
       A = A + 1
       END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the
COMMON block is at some fixed displacement from the start of the
program. So the program can "compute" the real address of the data in
common blocks from the address in its base register.

If the program was relocated after the call to BAR but before using
the reference to access argument A then it reads the wrong location.

That is precisely my point. The mechanism that IBM chose effectively *prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 14 20:06:11 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/14/2025 11:51 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>>> instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer >>>>> placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form >>>>> between routines, because they knew even then that it would never work. >>>>> At least not when subroutines are *compiled separately*, which was the >>>>> normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON >>> block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

Guaranteed, with a 12-bit offset?

First let me say that I may have misinterpreted your recent comments.
The visible base register mechanism IBM chose prevents any relocation of
the program once it is first loaded.

Then we're in agreement. Good :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Sat Jun 14 12:33:21 2025

On 6/14/2025 11:51 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Fri, 13 Jun 2025 17:40:44 +0000, Thomas Koenig wrote:

Suppose you're passing an argument from a COMMON block, a common
occurence back then (pun intended).

SUBROUTINE FOO
REAL A,B
COMMON /COM/ A,B
REAL C
CALL BAR(B)
C ....
END

SUBROUTINE BAR(A)
REAL A
A = A + 1
END

What should FOO pass to BAR? A straight pointer to B is the
obvious choce, but there is no base register in sight that the OS
can know about.

FOO passes a straight 32-bit pointer to B to BAR, using a load address >>>> instruction to calculate the effective address.

BAR then uses an instruction which chooses a register with that pointer >>>> placed in it as its base register to get at the value.

No attempt is made to pass addresses in the short base plus offset form >>>> between routines, because they knew even then that it would never work. >>>> At least not when subroutines are *compiled separately*, which was the >>>> normal practice with System/360 FORTRAN.

Correct.

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON
block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

Guaranteed, with a 12-bit offset?

First let me say that I may have misinterpreted your recent comments.
The visible base register mechanism IBM chose prevents any relocation of
the program once it is first loaded.

As for the common area, the program can compute, using normal 32 bit
arithmetic registers, the starting address of the common block. It
knows the starting address of where the program was loaded from the
BALR instruction executed as the first instruction, to which it can add
the offset of the common block from the program start as given by the
linker. If it then puts that address in a register, subsequent
references to at least the first 4K bytes of the block can be referenced
using that register as the base. Blocks larger than 4K require either
saving then changing the base register contents, or using another base register.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Sat Jun 14 14:37:39 2025

On 6/14/2025 2:26 PM, quadibloc wrote:

On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.

It prevents relocation of programs currently in use that are already in memory.

Yes.

It facilitates loading programs from object files on disk into any
desired part of memory, which is the usual meaning of "program
relocation" among System/360 programmers, perhaps because they had no
other type of it available.

OK. But note that they did have the concept of unloading an active
program from memory, called Rollout/Rollin, but you had to roll the
program back in to the same memory location that it was rolled out from.

Implementing the 360 architecture with the addition of a base and bounds mechanism instead of full-blown virtual memory was perfectly possible. However, the System/360 was originally conceived as a computer for use
in batch processing.

But batch processing didn't mean only one program at a time loaded into
memory. If only one program was loaded in memory at a time, you
wouldn't need any base registers, as all addresses could be zero
relative. Even IBM's two major operating systems, DOS, and OS supported multi-programming.

Hence, TSS/360 was a kludge and ran slowly, and it
took the 360/67 with special hardware to facilitate timesharing for IBM
to have something that addressed that function effectively.

True, but irrelevant.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Jun 15 01:07:27 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 6/14/2025 8:40 AM, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

"Architecture of the IBM System/ 360" by Amdahl, Blaauw and
Brooks

...

# This commitment implies that all programs
# are location-independent, except for constants used to load the
# base registers.

And here it becomes obvious that they had a completely different usage
in mind than what these addressing modes are used for on s390x. And I
guess that already on S/370 and probably even on S/360 they were
usually not used as this sentence suggests: load constants in some
registers at the start, never change them, and use only those
registers as base registers.

On S/360, that is exactly what you did. The first instruction in an >>assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.

That's not loading once and leaving it alone, but yes, that can work,
too, as shown in modern dynamic-linking ABIs.

The
next thing was an assembler directive "Using" with the register number
as an argument. This told the assembler to use that register, which now >>contained the address of the first "useful" instruction as the base >>register in future instructions. This allowed the OS to load the
program to any address in real memory, thus to have more than one
program resident in real memory at the same time and the CPU could
switch among them. By the time virtual memory came along with the S/370 >>(and OK, the 367/67) this was, of course no longer needed, but it was
kept for upward compatibility.

An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic
linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address. Care was required to ensure that libraries which
were used in the same application were statically link at unique
and non-overlapping addresses. Difficult when you only had half
the 386 linear address space available.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Sat Jun 14 20:31:35 2025

On 6/13/2025 11:42 AM, MitchAlsup1 wrote:

On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems.
The Univac 1108 and followons is another. There may be others, perhaps >>> the some of CDC systems as they and the Univac systems shared a common
designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space.

A base register not part of GPRs is a descriptor (or segment).
And we don't want to go there.

The other machines made it invisible from user space >> and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,

It also fails when one has 137 different things to track with those descriptors.

Fails isn't the correct word, but more awkward certainly is. I can't
speak to the Burroughs machines (I am sure Scott can), but on the Univac
1100 series it was a single instruction to change the base register to
any other entry from a table of them that was set up at link time. The
table could contain a lot of entries (it varied over time), but
certainly many more than 137 (could be thousands)

which was a much better solution for the early 1960s. (Actually, I
believe the UNIVAC had two, one for program and one for data).

Still insufficient for modern use cases.

See above. The current (now emulated) systems can have up to 16
descriptors currently active, with thousands an instruction or two away.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Sat Jun 14 20:51:34 2025

On 6/14/2025 2:26 AM, Anton Ertl wrote:

snip

In any case, it's no problem to add a virtual-memory mechanism that is
not visible to user-level, or maybe even kernel-level (does the
original S/360 have that?) programs, whether it's paged virtual memory
or a simple base+range mechanism.

Virtual memory is no problem for you to say now, but this was the early
1960s and no, S/360 didn't have anything like that. The CPU was
implemented in

https://en.wikipedia.org/wiki/Solid_Logic_Technology

and the memory was real cores. Paging would have just been too much to
ask at the time. One of the big changes from S/360 to the S/370 was the addition of paged virtual memory.

As for the base and range mechanism, that is what much of this
discussion is about. S/360 used arbitrary GPRs for the base registers,
which prevented programs from being moved once they were initially
loaded, whereas other contemporaneous systems used hidden base registers visible only to the OS. That is precisely what I regard, and have
stated before as a mistake.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Sat Jun 14 20:34:37 2025

On 6/14/2025 2:49 PM, quadibloc wrote:

On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:

On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.

It prevents relocation of programs currently in use that are already in
memory.

Actually, to be more precise, it prevents doing this _in a manner that
is fully transparent to the programmer_.

So IBM could have created a time-sharing operating system that ran on
models of the System/360 other than the model 67 with its Dynamic
Address Translation hardware as follows:

Require that programs only use one set of static base registers for
their entire run;

Require that programs describe the base registers they use in a standard header;

Require that programs set a flag when they have finished initializing
those base registers (and do so very quickly after being started).

If those conditions are met, then a program in memory can indeed be
moved to somewhere else in memory, as the operating system will know
which base registers to adjust.

Well, sort of. Such programs would not be able to use flat addresses to
pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
changes to calling conventions; for example, all routines in a program
might need to share a common area for data values, and always use the
same base register to point to it.

So you would have special time-sharing versions of all the compilers.

And this is better than what I proposed, and what other vendors did??????

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Sun Jun 15 07:37:44 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:

I have found out that I was mistaken in my earlier posting.

TSS/360 may have been a slow, inefficient, and poorly received
time-sharing operating system for the System/360 by IBM.

However, it only ran on the System/360 Model 67, and so it did *not*
attempt the kind of kludge I described as a desperate way of working
without the availability of address translation. Its poor performance
must have been the result of other causes.

IBM also had something called TSO, for Time-Sharing Option, and that did
run on System/360 models other than the Model 67, and so IBM may
actually have used the kind of kludge I had described after all.

IIRC, TSO came later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Sun Jun 15 08:01:32 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.

Surely that is an unfair characterization.

After all, as Ivan Godard has reminded us on several occasions, out of
order execution has a very large cost in transistors. So, while it is a
way of achieving high performance, it comes at a cost both in die size
and in power consumption.

If the same benefits could be obtained through VLIW techniques without
those costs - but with an overhead cost of extra bits in the
instructions - that would be a very promising technology. So their
problem wasn't that they forgot what they knew about OoO, but rather
perhaps that their knowledge of the limitations of VLIW was
insufficient.

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Also, for development projects, there are always differences in
opinion if choosing path A or B is the right way, because both will
have advantages and disadvantages, and people will have different
opinions of what is likely to succeed. Even after termination of
a project, you will in all likelyhood find people who say "But it
could have succeeded, we should have tried this or that".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sun Jun 15 07:10:39 2025

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.

I guess you mean dynamically linked libraries. My contact with SVR3
has not been intimate enough to experience that. I guess that DG/UX,
which I worked with in 1990 and 1991 was a SVR3 derivative, but I
never noticed that the libraries were dynamically linked (were they)?

Anyway, it also happened in Linux.

Care was required to ensure that libraries which
were used in the same application were statically link at unique
and non-overlapping addresses. Difficult when you only had half
the 386 linear address space available.

Address space was not the problem. HDD sizes in the early 1990s were
well below 1GB, so all the libraries plus all the executables
installed on one system (or available in one Linux distribution) could
easily fit in that address space with ample address space left for
data. The problem was that it required a lot of coordination, at
least in the way that was used on Linux, don't know about SVR3.

Every library binary was linked for a specific address, so those
producing the library binaries had to coordinate which addresses they
could use. When a library grew to need more space than allocated for
it, my guess is that this resulted in work for a lot of people; when a
new library was to be added, its maintainer needed to ask the
coordinator for address space. That approach did not scale, so they
switched from a.out to ELF and either position-independent code or
relocation at library-loading time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Sun Jun 15 07:00:56 2025

On 6/15/2025 12:37 AM, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

On Sun, 15 Jun 2025 4:04:04 +0000, quadibloc wrote:

I have found out that I was mistaken in my earlier posting.

TSS/360 may have been a slow, inefficient, and poorly received
time-sharing operating system for the System/360 by IBM.

However, it only ran on the System/360 Model 67, and so it did *not*
attempt the kind of kludge I described as a desperate way of working
without the availability of address translation. Its poor performance
must have been the result of other causes.

IBM also had something called TSO, for Time-Sharing Option, and that did
run on System/360 models other than the Model 67, and so IBM may
actually have used the kind of kludge I had described after all.

IIRC, TSO came later.

I used TSO on a S/360 model 65 in 1972. It was a dog. In contrast to
TSO on the later virtual systems, it was a separate batch program that
ran under the OS. That program controlled the terminals and swapped
user programs to/from its own memory. So it added another layer of "OS
like" functionality and the resulting overhead. IIRC the site specified
the size and number of user areas within TSO, and users competed for one
of those. It may be (I don't remember) that once a user program was
assigned to a user slot, if it got swapped out (by the TSO program), it
had to be swapped back into the same user area. This eliminated the
relocation problem we have been discussing. It was very slow with even
a few users.

Later, when virtual memory came along with the S/370s, they kept the
same name for a totally different implementation, one totally integrated
into the OS.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to quadibloc on Sun Jun 15 09:48:24 2025

quadibloc wrote:

On Sat, 14 Jun 2025 21:26:10 +0000, quadibloc wrote:

On Sat, 14 Jun 2025 19:23:59 +0000, Stephen Fuld wrote:

That is precisely my point. The mechanism that IBM chose effectively
*prevents* program relocation. That is why I believe it was a mistake
to choose that mechanism.

It prevents relocation of programs currently in use that are already in
memory.

Actually, to be more precise, it prevents doing this _in a manner that
is fully transparent to the programmer_.

So IBM could have created a time-sharing operating system that ran on
models of the System/360 other than the model 67 with its Dynamic
Address Translation hardware as follows:

Require that programs only use one set of static base registers for
their entire run;

Require that programs describe the base registers they use in a standard header;

Require that programs set a flag when they have finished initializing
those base registers (and do so very quickly after being started).

If those conditions are met, then a program in memory can indeed be
moved to somewhere else in memory, as the operating system will know
which base registers to adjust.

Well, sort of. Such programs would not be able to use flat addresses to
pass pointers between routines, because they would not be valid between relocations. A workaround for this issue may be possible, requiring
changes to calling conventions; for example, all routines in a program
might need to share a common area for data values, and always use the
same base register to point to it.

So you would have special time-sharing versions of all the compilers.

John Savard

Won't work.
The 360 program counter contains an absolute physical address that
already includes the program base offset. Also any BAL link registers
(there could be many, different for each subroutine) plus any spilled
link registers (which may be saved at static addresses,
or could be saved at runtime allocated locations).
To relocate, all those links would have to be found and patched.

To be easily relocatable there must be a clear distinction between
program's logical addresses and physical ones, and physical addresses
are only generated during the actual memory access (as MMU's do).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sun Jun 15 13:51:32 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore
it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically. In particular, for
high-budget decisions that also involve many other groups, in
particular marketing (which did a perfect job of hyping IA-64), but
also chipset and board groups, and also involve decisions not to
pursue other projects (such as canceling the other P7 project, and not
to pursue an AMD64-like architecture inside Intel), and also involve high-budget decisions from other corporations, that's something that
is decided at the top of the hierarchy.

I have read about meetings about IA-64 where top management and people
from different groups were present. Accoring to what I read (from one
of the IA-32 implementors), the IA-64 people showed hand-optimized
assembly code for some inner loops, and the account gave the
impression that that's the only performance results that Intel
management used to decide for IA-64 and against extending IA-32 to 64
bits (and other IA-32 projects, such as the original P7).

Also, for development projects, there are always differences in
opinion if choosing path A or B is the right way, because both will
have advantages and disadvantages, and people will have different
opinions of what is likely to succeed. Even after termination of
a project, you will in all likelyhood find people who say "But it
could have succeeded, we should have tried this or that".

Certainly. Even for the EPIC ideas which in the case of IA-64
certainly did not fail for lack of funding or marketing, some people
just fail to accept that they just produce worse performance for many
programs than OoO.

For projects that were killed at an earlier stage rather than pushed
through and failed in the marketplace, there are many more "but what
if"s. Also for projects that are pushed through and fail in the
marketplace, but look technically superior (e.g., Alpha).

But for IA-64, the verdict is pretty clear: For the kind of market it
targets, OoO is superior in performance. And this means that Intel
could not pull people over from IA-32 with better performance, so the
market disadvantage of introducing a new software ecosystem also came
in full force.

Could Intel and HP management have known this at the time? They
certainly knew about the difficulties of introducing a new software
ecosystem, especially Intel, which had enjoyed the benefits of
compatibility for so many years.

Could they have known about OoO performance benefits? I think that
with their in-house OoO projects, they could, pretty early, and
certainly when the Pentium Pro was released; it showed that OoO is
does not slow down the clock, on the contrary (1.5 times faster clock
than the P54C at the same time); it also showed that IA-32 can compete
with RISCs. As for the EPIC side, if they really only showed
hand-optimized kernels, the management should have been pretty
sceptical. If they showed something more real-worldish, the results
should also have made the management sceptical.

HP certainly hedged their bets and did not cancel Onyx (PA-8000) or
disable the 64-bitness of Onyx, but then that was pretty far along at
the time, and the competetive pressure to have a 64-bit architecture
was too large in Unix space to disable the 64-bitness. But they also
followed it with a long sequence PA-8x00 machines until PA-8800/8900 with
up to 1/1.1GHz in 2004/2005 (up from the PA-8000 with up to 180MHz).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sun Jun 15 15:07:38 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were
pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>> it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically.

Have you ever worked in a large corporation? (Just asking).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Jun 15 16:01:41 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

An interesting development is that, e.g., on Ultrix on DecStations >>>programs were statically linked for a specific address. Then dynamic >>>linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.

I guess you mean dynamically linked libraries.

No, I meant statically linked libraries. Shared objects were
introduced in SVR4 with the SUN Solaris collaboration. I may have
misunderstood your statement about linux vis-a-vis static shared libraries.

Anyway, it also happened in Linux.

Care was required to ensure that libraries which
were used in the same application were statically link at unique
and non-overlapping addresses. Difficult when you only had half
the 386 linear address space available.

Address space was not the problem. HDD sizes in the early 1990s were
well below 1GB, so all the libraries plus all the executables
installed on one system (or available in one Linux distribution) could
easily fit in that address space with ample address space left for
data. The problem was that it required a lot of coordination, at
least in the way that was used on Linux, don't know about SVR3.

Yes, it took a lot of coordination. For some applications (e.g.
X11), all the X11 libraries would be combined into a single static
library rather than trying to independently load them.

Every library binary was linked for a specific address, so those
producing the library binaries had to coordinate which addresses they
could use.

Yes, this was the same problem in SVR3.2. SVR4 showed up around
1990 with shared objects and static libraries went the way of
the Dodo bird.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Sun Jun 15 15:55:24 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 6/13/2025 11:42 AM, MitchAlsup1 wrote:

On Fri, 13 Jun 2025 18:11:18 +0000, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Even with that provision, it would not have worked.

All I can say is that it worked in several other contemporaneous
architectures. Scott gave one example, the Burroughs Medium systems. >>>> The Univac 1108 and followons is another. There may be others, perhaps >>>> the some of CDC systems as they and the Univac systems shared a common >>>> designer (Seymour Cray).

The problem of the /360 was that they put their base registers in
user space.

A base register not part of GPRs is a descriptor (or segment).
And we don't want to go there.

The other machines made it invisible from user space >>> and added its contents to every memory access. This does not take
up opcode space and allows swapping in and out of the whole process,

It also fails when one has 137 different things to track with those
descriptors.

Fails isn't the correct word, but more awkward certainly is. I can't
speak to the Burroughs machines (I am sure Scott can), but on the Univac
1100 series it was a single instruction to change the base register to
any other entry from a table of them that was set up at link time. The
table could contain a lot of entries (it varied over time), but
certainly many more than 137 (could be thousands)

The Burroughs medium systems supported 8 active base registers
backed by a set of in-memory translation tables. A task
(thread in modern parlance) could have up to a million environments,
each of which could have from two to eight memory areas[**]. When
the MCP dispatched the task, it used a special instruction called
'branch reinstate virtual (BRV)' which would load the eight base registers
from a selected environment (the hardware task table entry for
the task had a field which described the current environment for
the task). The virtual enter (VEN) instruction would call
a function/subroutine in either the same environment (within
the code segment) or in a new environment (after loading the
appropriate set of base registers[*]). Negative environment
numbers (indicated by a 0b1101 in the most significant digit)
were reserved for the MCP. The HCL (Hypercall) instruction
was used to request service from the MCP.

[*] and rolling the segment in from disk if it had been
rolled out.

[**] up to 500 kilobytes each (1 million digits). This was
to support legacy B3500/B4700/B4800 compatability.

The original B3500 had one base register, loaded by the
MCP when dispatching a task. This limited the application
to 500Kbytes in total memory size (which at the time was
plenty, but by the late 70's, the design for an architecture
that allowed more addressability while maintaining binary
compatability with existing applications.

I'm not as familiar with the Burrough large systems, but as
it was a stack-based machine (48-bit) using protected descriptors
for all data items with the descriptors stored on the stack
(in specially tagged words), there was no real concept of a
base register, other than the root of the current stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Sun Jun 15 16:05:06 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically.

Have you ever worked in a large corporation? (Just asking).

Define large.

When Burroughs bought Sperry in 1986, Unisys had 120,000 employees.

(a decade later, that was down to 20,000, three decades after
that, it's up to 22,000).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Sun Jun 15 16:53:03 2025

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes: >>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

An interesting development is that, e.g., on Ultrix on DecStations >>>>programs were statically linked for a specific address. Then dynamic >>>>linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>at a specific address.

I guess you mean dynamically linked libraries.

No, I meant statically linked libraries.

Static linking does not require any coordination. Every executable
gets its own copy of the library parts it uses linked to fit into the executable's address space, i.e., with static linking libraries are
not shared.

I may have
misunderstood your statement about linux vis-a-vis static shared libraries.

I have never heard "static shared libraries". When I search for
"static shared libraries" (without quotes) in duckduckgo, the first
ten links I get treat "static" and "shared" as separate terms and
usually as opposites. When I search for the term with quotes, it
redirects me to google, which actually gives me a link:

https://people.cs.nycu.edu.tw/~shieyuan/course/spb/lectures/sp15.ppt

and a snippet:

|With static shared libraries, symbols are still bound to addresses at
|link time, but library code is not bound to the executable code until
|run time. [...]

Certainly in Linux the terminology was that they were all shared
libraries (which was synonymous to dynamic linking), and that there
was a transition from a.out to ELF (also often called the transition
from libc4 to libc5).

I had contact with various proprietary Unixes (HP/UX, DG/UX, Ultrix,
and very little with SunOS), but I only remember that Ultrix did not
support dynamic linking.

Yes, this was the same problem in SVR3.2. SVR4 showed up around
1990 with shared objects and static libraries went the way of
the Dodo bird.

In Linux, the transition was in 1995. Solaris (Sun's port of SVR4)
appeared in 1992.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Sun Jun 15 17:59:29 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically.

Have you ever worked in a large corporation? (Just asking).

Define large.

Let's say > 10000 employees, for the sake of a definition.

When Burroughs bought Sperry in 1986, Unisys had 120,000 employees.

(a decade later, that was down to 20,000, three decades after
that, it's up to 22,000).

Both would qualify, I think. Intel had ~ 32500 employees at year's
end in 1994, so it would also qualify.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Jun 15 18:42:42 2025

On Sun, 15 Jun 2025 18:21:48 +0000, BGB wrote:

On 6/12/2025 7:00 PM, MitchAlsup1 wrote:

On Thu, 12 Jun 2025 21:30:39 +0000, BGB wrote:

On 6/12/2025 2:13 PM, MitchAlsup1 wrote:

------------------------------

The modern interpretation is that the dynamic rounding mode can be set >>>> prior to any FP instruction. So, you better be able to set it rapidly
and without pipeline drain, and you need to mark the downstream FP
instructions as dependent on this.

Errm, there is likely to be a delay here, otherwise one will get a stale >>> rounding mode.

RM is "just 3-bits" that get read from control register and piped
through instruction queue to function unit. Think of the problem
one would have if a hyperthreaded core had to stutter step through
changing RM ...

To do it more quickly, one would likely need special logic in the
pipeline for getting the updated RM to the FPU in a more timely manner.

Realistically, it (RM) is no different than a condition code;
except that RM is main effect of an instruction instead of a
side effect of performing an instruction.

If done (as-is) in a lax way: Held in the HOBs of GP/GBR or similar,
which is handled as an SPR that gets broadcast out of the regfile.

Then one has the latency issue:
The new value needs to reach the regfile (WB stage);
The value then needs to make its way to the relevant ID2/RF stage (next
cycle after WB).

Once again, this is no different than a condition code.

A lazy option would be to add an interlock so that any dynamic rounding
mode instruction would generate pipeline stalls for any in-flight modifications to GBR (as opposed to using a branch or a series of NOPs).
This was not done in my existing implementation.

Just track RM as if it were a condition code.

But, IME, the "fenv.h" stuff, and FENV_ACCESS, is rarely used.
So, making "fesetround()" or similar faster doesn't seem like a high priority.

If having "fsetround()" as a function call, can also ensure the needed
delay as-is by using a non-default register during the return (mostly to hinder the branch predictor).

So, setting the rounding mode might be something like:
   MOV .L0, R14
   MOVTT GP, 0x8001, GP //Set to rounding mode 1, clear flag bits
   JMP R14         //use branch to flush pipeline
   .L0:            //updated FPSR now ready
   FADDG R11, R12, R10 //FADD, dynamic mode

Setting RM to a constant (known) value::

    HRW rd,RM,#imm3    // rd gets old value

It is possible,

Of course it is possible, especially if you make it work like it
is supposed to work an not as a hard to do thing.

Could almost alias the bits to part of SR, where SR does generally have
a more timely update process (could reduce latency to 2 cycles).

I can see constant writes to RM as taking ZERO cycles.

At present, the RM field is held in GBR(51:48), with fast update options either being a MOVTT (can replace the high 16 bits, *1) or BITMOV,

It does not mater where it is--what maters is that it can be overwritten
every execution cycle, and that instructions dependent on its current
value are properly sequenced. Reservation stations do that for register
and memory data, why not add RM (and carry) to them ??

*1: There is a MOVTT Imm5/Imm6 variant, currently can only modify
(63:60) though.

Though, this strategy is only directly usable in XG3 (where GBR is
mapped to R3/X3), N/A in XG1 or XG2, where GBR is in CR space and so
would require 3 instructions.

Implicitly, the fragment assumed XG3, but then this leaves open the
issue of whether to use my former ASM syntax or RISC-V style ASM syntax (BGBCC can sorta accept either, with my newer X3VM experiment defaulting
to RISC-V syntax).

Can note that the RISC-V F/D instructions define a fixed rounding modes
in the instruction, with rounding modes for a dynamic rounding mode
(though, IIRC, no way to update the dynamic RM within the scope of the
base ISA; so one needs Zicsr and similar to pull it off).

The IEEE 754-2019 specification causes languages that adopt 754
semantics to follow 754--which has a rather kludgy means to modify
RM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Sun Jun 15 19:24:27 2025

On Sun, 15 Jun 2025 16:53:03 GMT, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Yes, this was the same problem in SVR3.2. SVR4 showed up around 1990
with shared objects and static libraries went the way of the Dodo bird.

In Linux, the transition was in 1995. Solaris (Sun's port of SVR4)
appeared in 1992.

There is a section on the SVR3.2 shared library implementation in this:

<https://www.mirrorservice.org/sites/www.bitsavers.org/pdf/att/unix/ System_V_386_Release_3.2/ UNIX_System_V_386_Release_3.2_Programmers_Guide_Vol1_1989.pdf>

It was using COFF object files.

I think that a lot of the design of current dynamic shared libraries came
from pre-SVR4 SunOS using a.out object files.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Mon Jun 16 14:15:07 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes: >>>>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

An interesting development is that, e.g., on Ultrix on DecStations >>>>>programs were statically linked for a specific address. Then dynamic >>>>>linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked >>>>at a specific address.

I guess you mean dynamically linked libraries.

No, I meant statically linked libraries.

Static linking does not require any coordination. Every executable
gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
not shared.

They were shared between multiple processes on SVR3.2.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Jun 16 12:17:21 2025

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those
purposes.

This is going to hurt register allocation.

I vaguely remember reading somewhere that it doesn't have to be too bad:
e.g. if you reduce register-specifiers to just 4bits for a 32-register architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon Jun 16 23:37:05 2025

On Mon, 16 Jun 2025 22:14:14 +0000, BGB wrote:

On 6/12/2025 7:11 PM, quadibloc wrote:

On Thu, 12 Jun 2025 19:24:36 +0000, MitchAlsup1 wrote:

On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote: -----------------------

But, this sort of thing is semi-common in VLIWs, along with things like
not having either pipeline interlocks or register forwarding (so, say,
every instruction has a 3 cycle latency, so you need to wait 2 bundles
to use a result else you get a stale value, etc).

Contrast, spending 1 or 2 bits per instruction word or similar (to daisy chain groups of instructions), and still having things like forwarding
and interlocks, does not result in same the severe hit to code density.

General purpose ISAs use zero bits to daisy chain instructions, they
use register specifiers, and a memory ordering logic block.

However, register forwarding does have a dark side: It has a fairly
steep cost curve. So, with forwarding, once you try to cross ~ 2 or 3
wide, the costs here grow out of control.

The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.

So, as the core gets wider, the cost of the register file will exceed
that of the function units (and one may find that it is cheaper to go multi-core than to make the core wider).

Now, guess what happens with an execution window of size 300
instructions
and a GPR file, an FPR file, and an SIMD file ??? Do you facilitate the
files for 300 instructions each, 100 instructions each, or something in
between ???

Or, to try to go wider while keeping cost under control, give up on
niceties like register forwarding.

Apple has show it is not just possible but can be power effective. --------------------

One of the dominant use-cases for VLIW is in GPUs and similar.

But, then seemingly battles for control against "lots of in-order
superscalar RISC cores".

So, for more traditional 3D rendering tasks, VLIW did well, but for
things like GPGPU or ray-tracing, the "crapton of in-order cores"
strategy works well.

There are intermediate choices, too::

Blocks not quite worthy of CPU status, but perform workloads
nonetheless::

A) texture
B) interpolation
C) rasterization
D) WARP initialization
E) WARP rebalancing
F) Transcendental calculations

The main merit of OoO is when the overriding priority is maximizing per-thread performance. But, in other cases, cramming more cores on the
die may offer more performance than one can get from a smaller number of faster cores.

G) Crypto engines
H) programmable DMA engines

Well, and all the battles over things like memory coherence.
For a small number of fast cores, coherence makes sense.
For large number of cores, weaker models may be preferable (or, say, essentially treating parts of the memory map as read-only to most of the cores).

Why not BOTH !! and switch between coherent and incoherent as side
effect of a memory instruction touching "something other than DRAM".

Where, LIW (in a partial contrast to VLIW) can have merit if the goal is
to optimize for per-core cheapness. The per-core cost for a LIW can be
lower than that of an in-order superscalar, but with the drawback that
the compiler will need to be aware of pipeline specifics.

Say one could have cores designed like, say:
2 or 3 wide;
Explicit parallelism;
No register forwarding;
Maybe optional interlocks;
Weak memory coherence;
...

6 to 10 wide
Explicit Concurrency
Standard Register dependence order
Standard Memory dependence order
Lamport Atomic dependence order
Lessened Memory consistency When reasonable with
Sequential consistency When required and
Strongly consistency When absolutely required

And then trying to optimize for fitting as many cores as possible on the chip, even if per-thread performance is relatively low, and trying to prioritize having very high memory bandwidth.

Currently, inter-core signals/messages are too expensive: especially
when a HyperVisor has to get in the way. Then again, SW over the last
15 years has demonstrated no ability to use "lots" of small cheap cores.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Stefan Monnier on Mon Jun 16 18:26:01 2025

On 6/16/2025 9:17 AM, Stefan Monnier wrote:

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those
purposes.

This is going to hurt register allocation.

I vaguely remember reading somewhere that it doesn't have to be too bad:
e.g. if you reduce register-specifiers to just 4bits for a 32-register architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

I can see that it isn't too hard on the logic for the register
allocator, but I suspect it will lead to more register saving and
restoring. Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8. So the odds increase that you must save one of
those 8 and perhaps restore it after the two instructions have completed.

It sure seems ugly to me.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Tue Jun 17 11:44:33 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.

Probably not, but I don't think the reason is that "working set size"
would produce significantly different results.

However, apparently code size is important enough in some markets that
ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
(which allows mixing 16-bit and 32-bit encodings); Power specified VLE
(are there any implementations of that?); and RISC-V specified the C extension, which is implemented widely AFAICT.

AFAICS main target of those are small embedded microcontrollers
running code mostly from flash. Apparently flash size have important
impact on microcontroller cost. Also, biggest available flash
was limited and if program exceeded available flash one would have
to switch to different (possibly significantly more expensive)
hardware.

Some Linux distributions took advantage of smaller instructions
and compiled a lot of programs to save space. But I doubt that
possibility of such a saving would be enough to motivate
developement of more space efficient encoding. IIUC 64-bit
ARM droppend most of space saving features of Thumb2, so
apparently they did not consider them important enough for
bigger machines.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Waldek Hebisch on Tue Jun 17 16:00:34 2025

On 17/06/2025 13:44, Waldek Hebisch wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

Without data, it's all speculation. Given, however, that there
doesn't seem to be a rush to replace x86 or arm64 with
armhf or riscv64, I don't believe that the text size is
particularly interesting to the general user.

Probably not, but I don't think the reason is that "working set size"
would produce significantly different results.

However, apparently code size is important enough in some markets that
ARM introduced not just Thumb, but Thumb2 (aka ARM T32), and MIPS
followed with MIPS16 (repeating the Thumb mistake) and later MicroMIPS
(which allows mixing 16-bit and 32-bit encodings); Power specified VLE
(are there any implementations of that?); and RISC-V specified the C
extension, which is implemented widely AFAICT.

Power PC cores with VLE have definitely been implemented - I used one
many years ago. You find them in PPC-based microcontrollers such as the MPC5534 with the e200z3 core, made by Freescale (now part of NXP). The
PPC core microcontrollers are popular in the automotive industry, but I
don't know off-hand if any of the modern families have VLE.

AFAICS main target of those are small embedded microcontrollers
running code mostly from flash. Apparently flash size have important
impact on microcontroller cost. Also, biggest available flash
was limited and if program exceeded available flash one would have
to switch to different (possibly significantly more expensive)
hardware.

Flash size can be a very real part of the cost of small
microcontrollers. There are ARM (typically Cortex-M0+) core
microcontrollers down to 4 KB flash, though for current devices, there
is rarely much money to save going below 16 KB. Flash size is also a
limiting factor for future development - typically, you can get the same microcontroller in the same packing with different flash and ram sizes,
for relatively cheap upgrades. But once you exceed the largest flash
size in the family, you face a costly board redesign.

Small microcontrollers also don't have normal caches - but they might
have a very small amount of buffer attached to the flash memory
controller. Code density is definitely significant for these devices.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 17 17:41:10 2025

On Tue, 17 Jun 2025 1:20:40 +0000, quadibloc wrote:

On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:

Then again, SW over the last
15 years has demonstrated no ability to use "lots" of small cheap cores.

And as long as that remains true, out-of-order execution will continue
to be popular, and there will also be strong pressure to find exotic materials that can be used to make faster transistors - and faster interconnects between them.

While I am willing to agree that we can do better in using multiple
cores, I also think that even after we do all that we can in that area,
a single core that is N times faster will still be better than N cores.

But that is NOT the arithmatic you are looking at::

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

So the arithmetic is: can you OoO core be 10× faster than the LBIO core
??
And the answer is NO.

The highest performing OoO is M4 right now and it is 2× faster than
Opteron Rev F (after normalizing frequency)--perhaps I should say
2× more instructions per clock. If M4 area was equal to Opteron area
(highly doubtful after normalizing) it would still be a factor of 5-6×
more area than 12 LBIO cores.

But on the other hand, no matter what new technologies we discover to
make cores faster, there will still be a hard limit to how fast a core
can be.

Right now, you could make transistors infinitely fast, and clock speeds
would not move much due to the dominance of wire delay in CPU design.

So both faster cores, and more efficient ways to use multiple cores,
will always be important.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Jun 17 17:45:23 2025

On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

On 6/16/2025 9:17 AM, Stefan Monnier wrote:

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those >>>> purposes.

This is going to hurt register allocation.

I vaguely remember reading somewhere that it doesn't have to be too bad:
e.g. if you reduce register-specifiers to just 4bits for a 32-register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

I can see that it isn't too hard on the logic for the register
allocator,

You are missing the BIG problem::

Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator
notices that Rk and RM need to be paired or shared and they were not originally. How does on fix this kind of problem without adding more
passes over the intermediate representation ??

but I suspect it will lead to more register saving and
restoring.

And reg-reg MOVment.

Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8. So the odds increase that you must save one of
those 8 and perhaps restore it after the two instructions have
completed.

It sure seems ugly to me.

It has been under study by compiler people since at least 1963 without
much forward progress.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 17 17:52:50 2025

On Tue, 17 Jun 2025 13:12:27 +0000, quadibloc wrote:

On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding stops
making sense.

Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.

In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer, one advanced feature I wish to include in the Concertina II
- I think I took a stab at it in the original Concertina - is dataflow computing.

Dataflow computing is where the program explicitly states how arithmetic units are to be connected together to perform multiple operations in a chained fashion, usually taking vectors as input and producing vectors
as output.

Do you remember WHY data-flow failed ???

It failed because it exposed TOO MUCH ILP and then this, in turn,
required
too much logic to manage efficiently--often running into queue overflow problems (Reservation station entries) that could cause lock up if not
managed correctly.

GBOoO machines don't have this problem since DECODE will stall if the instruction queues overflow.

The ENIAC, "before von Neumann ruined it", worked that way. So data
isn't even being put in registers, let alone memory, between severalas operations, thus making the computer faster.

In Concertina, unlike Concertina II, I didn't worry about having
instructions that had awkward and special rules for length decoding,
though. A dataflow instruction would involve a chain of operations with
an arbitrary length up to some limit.

However, the solution suggests itself.

I used pointers to pseudo-immediates to prevent the variation in length
of immediate values from making length decoding for instructions
complicated.

In some early iterations of Concertina II, I used a similar pointer
mechanism as my method of allowing instructions longer than 32 bits. The pointers were four bits long for that, instead of five, since now they
were halfword addresses instead of byte addresses. This could be brought
back - but just for dataflow instructions and any similar exotic cases.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Jun 17 11:09:33 2025

On 6/17/2025 10:45 AM, MitchAlsup1 wrote:

On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

On 6/16/2025 9:17 AM, Stefan Monnier wrote:

Therefore, I reduced the index register and base register fields to
three bits each, using only some of the 32 integer registers for those >>>>> purposes.

This is going to hurt register allocation.

I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

I can see that it isn't too hard on the logic for the register
allocator,

You are missing the BIG problem::

Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally. How does on fix this kind of problem without adding more
passes over the intermediate representation ??

Good point. Thanks.

but I suspect it will lead to more register saving and >> restoring.

And reg-reg MOVment.

Yes. I should have mentioned that as well.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 17 18:14:19 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 17 Jun 2025 13:12:27 +0000, quadibloc wrote:

On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results × Operands or sometimes >>>>> SUM( Results[type] × Operands[type] ); hardly more than quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding stops
making sense.

Out of order execution can cost more than the functional units in a CPU. >>> But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.

In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer, one advanced feature I wish to include in the Concertina II
- I think I took a stab at it in the original Concertina - is dataflow
computing.

Dataflow computing is where the program explicitly states how arithmetic
units are to be connected together to perform multiple operations in a
chained fashion, usually taking vectors as input and producing vectors
as output.

Do you remember WHY data-flow failed ???

It failed because it exposed TOO MUCH ILP and then this, in turn,
required
too much logic to manage efficiently--often running into queue overflow >problems (Reservation station entries) that could cause lock up if not >managed correctly.

Yes. Dataflow (something my advisor specialized in in the late 70's
and 80s) works best with macro operations, not micro operations.

Was part of a startup in 2000 that did dataflow using XML datagrams
as the unit of transport (instead of a bit). A XML document would
be received by the "system" and routed through a series of
transformations using a dataflow engine resulting in an XML document
(in degenerate form, HTML) as output (with side effects such as
database updates along the way).

The transformations included applying XSL stylesheets, making
database accesses and updating fields in the XML with the results,
and a few other macro-operations.

There was a nice GUI to create the flow graph for the engine.

Was eventually purchased by Verisign at the end of the dot-bomb.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Jun 17 18:52:19 2025

On Tue, 17 Jun 2025 18:34:20 +0000, BGB wrote:

On 6/17/2025 7:58 AM, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding stops
making sense.

Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.

It depends on the use case.

But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.

Like, if there were a 200% or more difference for OoO performance vs
in-order performance relative to clock speed; maybe...

GBOoO like Opteron is actually 2× faster than LBIO
GBOoO like M4 is actually 4× faster than LBIO

But, seemingly the delta is often modest enough that one can still make
a case for in-order in cases where you don't actually need maximum single-thread performance.

Forwarding decreases latency. Lower Latency is ALLWAYS better when
measured in picoseconds (not clocks).
-----------------

Often, as noted, each population member (or test member, or whatever it
is called) would usually be represented in some bit-redundant format
(such as each bit expanded out to a full byte for majority-8, or 3
parallel copies for majority-3).
Majority-8 was usually lookup table driven.
Majority-3 was usually (A&B)|(B&C)|(A&C).

Majority-3 is 1 gate delay (inverting) 222AOI
------------

Can note, the Zen+ in my main PC seemingly has an odd property:
Under 25% CPU load, per thread performance is at maximum;
Around 25-50%, per-thread drops, but still often positive benefit;
Over 50%, per-thread drops notably,
so 100% isn't much better than 50%.

Granted, the 50-100% domain is mostly hyperthreading territory.

If it hurts, stop doing it. That is turn of HypoThreading.

But, it seems like there is some shared resource that becomes a
bottleneck by around the time one hits 4 threads.

Scheduling across multiple cores is known to be more than cubic.

Had noted that it seems to apply mostly to memory-medium and
memory-heavy use-cases, where:
memory-medium: ~ 10 to 100MB of working data;
memory-heavy: over 100MB of working data.
Where, most of the data is touched continuously.

Suspect cache hierarchy.

If the task is primarily bound by things like branching or ALU/FPU,
there does not seem to be a fall-off.

Suspect cache.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB on Tue Jun 17 21:58:17 2025

On Tue, 17 Jun 2025 13:34:20 -0500
BGB <cr88192@gmail.com> wrote:

On 6/17/2025 7:58 AM, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results � Operands or
sometimes SUM( Results[type] � Operands[type] ); hardly more than
quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the cost
of the forwarding exceeds exceeds that of the function units (such
as an additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding stops
making sense.

Out of order execution can cost more than the functional units in a
CPU. But since a faster CPU - or, more specifically, a CPU with
faster single-thread performance - is much more useful than more
CPUs, it's still well worth the cost.

It depends on the use case.

But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.

Like, if there were a 200% or more difference for OoO performance vs in-order performance relative to clock speed; maybe...

But, seemingly the delta is often modest enough that one can still
make a case for in-order in cases where you don't actually need
maximum single-thread performance.

For Arm architecture, the dufference in single-thread performance
between the fastest available in-order cores (ARM Cortex-A520) and
the fastest available OoO cores (Apple M4, Qualcomm Oryon) is huge.
Probably, over 5x. Even Arm's own ARM Cortex-X925 is several times
faster than A520.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB on Tue Jun 17 22:33:36 2025

On Tue, 17 Jun 2025 14:13:02 -0500
BGB <cr88192@gmail.com> wrote:

On 6/17/2025 1:58 PM, Michael S wrote:

On Tue, 17 Jun 2025 13:34:20 -0500
BGB <cr88192@gmail.com> wrote:

On 6/17/2025 7:58 AM, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results � Operands or
sometimes SUM( Results[type] � Operands[type] ); hardly more
than quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the
cost of the forwarding exceeds exceeds that of the function
units (such as an additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding
stops making sense.

Out of order execution can cost more than the functional units in
a CPU. But since a faster CPU - or, more specifically, a CPU with
faster single-thread performance - is much more useful than more
CPUs, it's still well worth the cost.

It depends on the use case.

But, at least in my experience with ARM hardware, in-order actually
seems to hold up pretty well here.

Like, if there were a 200% or more difference for OoO performance
vs in-order performance relative to clock speed; maybe...

But, seemingly the delta is often modest enough that one can still
make a case for in-order in cases where you don't actually need
maximum single-thread performance.

For Arm architecture, the dufference in single-thread performance
between the fastest available in-order cores (ARM Cortex-A520) and
the fastest available OoO cores (Apple M4, Qualcomm Oryon) is huge. Probably, over 5x. Even Arm's own ARM Cortex-X925 is several times
faster than A520.

For ARM, main reference points I had was A53 vs A72.
A72 was faster, but but not drastically...

Arm Cortex A15/A57/A72 family of OoO cores designed in Austin,Tx was no
good.
Arm Cortex A9/A12/A73/A75 family of OoO cores designed in Sophia
Antipolis was significantly better.
The next Austin-designed family (all current middle and high end Arm
cores starting from A76) is better yet. Less so in perf/W and perf/area,
more so in absolute performance.

Cambridge-designed Cortex-A53 is very good design, but it was never
meant to have top performance. And it is old.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 17 16:51:17 2025

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

^^
still?

I understand that was your sales pitch, and I assume you had good
reasons to think it was indeed true, but is it still the case now?

AFAICT (see for example Anton's benchmarks in this regard) with current
CPUs, "LBIO cores" are not terribly more power-efficient than big OoO cores.

Or at least, it seems that the big OoO cores are not significantly less power-efficient when they are computing at the same speed as LBIO cores.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 17 16:43:00 2025

MitchAlsup1 [2025-06-17 17:45:23] wrote:

On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

On 6/16/2025 9:17 AM, Stefan Monnier wrote:

I vaguely remember reading somewhere that it doesn't have to be too bad: >>> e.g. if you reduce register-specifiers to just 4bits for a 32-register
architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

I can see that it isn't too hard on the logic for the register
allocator,

You are missing the BIG problem::

Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator notices that Rk and RM need to be paired or shared and they were not originally.

What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's algorithm, which does not proceed "instruction by instruction" but
instead takes a whole function (or basic bloc), builds an interference
graph from it, then chooses registers for the vars looking only at that interference graph.

but I suspect it will lead to more register saving and
restoring.

And reg-reg MOVment.

Of course. The point is simply that in practice (for some particular
compiler at least), the cost of restricting register access by using
only 4bits despite the existence of 32 registers was found to be small.

Note also that you can reduce this cost by relaxing the constraint and
using 5bit for those instructions where there's enough encoding space.
(or inversely, increase the cost by using yet fewer bits for those
instructions where the encoding space is really tight).

There's also a good chance that you can further reduce the cost by using
a sensible mapping from 4bit specifiers instead a randomized one.

IOW, the point is that just because you have chosen to have 2^N
registers in your architecture doesn't mean you have to offer access to
all 2^N registers in every instruction that can access registers.
It's clearly more convenient if you can offer that access, but if needed
you can steal a bit here and there without having too serious an impact
on performance.

Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8.

Right. But in practice, the register allocator can often choose the
rest of the register assignment such that one of those 8 is available.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 17 21:11:11 2025

On Tue, 17 Jun 2025 20:51:17 +0000, Stefan Monnier wrote:

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

^^
still?

I understand that was your sales pitch, and I assume you had good
reasons to think it was indeed true, but is it still the case now?

I have seen no dramatic change in the ratio of logic to SRAM to pins
in the last 15 years, and if anything, the more layers of metal, the
smaller the LBIO can be whereas GBOoO tend to use more of the layers.

AFAICT (see for example Anton's benchmarks in this regard) with current
CPUs, "LBIO cores" are not terribly more power-efficient than big OoO
cores.

I understand this point, and do not disagree. A lot of work has gone
into decreasing the power consumed by instruction queueing--converting value-capturing reservation stations into value-free reservation stat-
tions has done a lot of this, while, at the same time taking pressure
off of forwarding (at a minor cost in latency).

Execution power is up only by the instruction rate multiplier and
some minor term in power consumed in instructions that get throw away
via misprediction, or get run more than once due to replay.

Or at least, it seems that the big OoO cores are not significantly less power-efficient when they are computing at the same speed as LBIO cores.

The LBIO cores are more heavily dependent on low latency whereas
GBOoO cores are more tolerant of latency and of order.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Jun 17 21:19:23 2025

On Tue, 17 Jun 2025 19:04:49 +0000, BGB wrote:

On 6/17/2025 12:59 PM, quadibloc wrote:

On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:

But that is NOT the arithmatic you are looking at::

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power. >>>
So the arithmetic is: can you OoO core be 10× faster than the LBIO core >>> ??
And the answer is NO.

But my code runs faster on the OoO core than on ten LBIO cores, because
nobody knows how to make effective use of ten cores to solve the
problem.

So the fact that it uses 10x the electrical power, while only having 2x
the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.

That's why OoO chips sell so well.

Errm, this doesn't agree with my experience.

More like the OoO chips are around 20-40% faster, but depending on
workload.

Then you are latency bound, not compute bound.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Jun 17 18:14:48 2025

MitchAlsup1 [2025-06-17 21:18:29] wrote:

On Tue, 17 Jun 2025 20:43:00 +0000, Stefan Monnier wrote:

What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's
algorithm, which does not proceed "instruction by instruction" but
instead takes a whole function (or basic bloc), builds an interference
graph from it, then chooses registers for the vars looking only at that
interference graph.

I am regurgitating conversations I have had with compile people over
the last 40 years. Noting I have seen in ISA design has moderated
these problems--but I, personally, have not been inside a compiler
for 41 years, either (1983). So, find a compiler writer to set this
record straight. I continue to be told: it is enough harder that you
should design ISA so you don't need pairing or sharing, ever.

Ah, well, pairing is a different problem than the "incomplete register specifiers" I'm talking about. Indeed, it can be much more difficult to
adapt a Chaitin-style allocator to handle pairing because it can't be
expressed simply in the interference graph.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Jun 17 21:18:29 2025

On Tue, 17 Jun 2025 20:43:00 +0000, Stefan Monnier wrote:

MitchAlsup1 [2025-06-17 17:45:23] wrote:

On Tue, 17 Jun 2025 1:26:01 +0000, Stephen Fuld wrote:

On 6/16/2025 9:17 AM, Stefan Monnier wrote:

I vaguely remember reading somewhere that it doesn't have to be too bad: >>>> e.g. if you reduce register-specifiers to just 4bits for a 32-register >>>> architecture and kind of "randomize" which of the 16 values refer to
which of the 32 registers for each instruction, it's fairly easy to
adjust a register allocator to handle this correctly (assuming you
choose your instructions beforehand, you simply mark, for each
instructions, the unusable registers as "interfering"), and the end
result is often almost as good as if you had 5bits to specify
the registers.

I can see that it isn't too hard on the logic for the register
allocator,

You are missing the BIG problem::

Register allocator allocated Rk for calculation j and later allocates
Rm for instruction p, then a few instructions later the code generator
notices that Rk and RM need to be paired or shared and they were not
originally.

What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's algorithm, which does not proceed "instruction by instruction" but
instead takes a whole function (or basic bloc), builds an interference
graph from it, then chooses registers for the vars looking only at that interference graph.

I am regurgitating conversations I have had with compile people over
the last 40 years. Noting I have seen in ISA design has moderated
these problems--but I, personally, have not been inside a compiler
for 41 years, either (1983). So, find a compiler writer to set this
record straight. I continue to be told: it is enough harder that you
should design ISA so you don't need pairing or sharing, ever.

but I suspect it will lead to more register saving and
restoring.

And reg-reg MOVment.

Of course. The point is simply that in practice (for some particular compiler at least), the cost of restricting register access by using
only 4bits despite the existence of 32 registers was found to be small.

Note also that you can reduce this cost by relaxing the constraint and
using 5bit for those instructions where there's enough encoding space.
(or inversely, increase the cost by using yet fewer bits for those instructions where the encoding space is really tight).

There's also a good chance that you can further reduce the cost by using
a sensible mapping from 4bit specifiers instead a randomized one.

IOW, the point is that just because you have chosen to have 2^N
registers in your architecture doesn't mean you have to offer access to
all 2^N registers in every instruction that can access registers.
It's clearly more convenient if you can offer that access, but if needed
you can steal a bit here and there without having too serious an impact
on performance.

Consider a two instruction sequence where the output of the
first instruction is an input to the second. The first instruction has
only a choice of 16 registers, not 32. And the second instruction also
has 16 registers, but on average only half of them will be in the 16
included in the first instruction. So instead of 32 registers to chose
from you only have 8.

Right. But in practice, the register allocator can often choose the
rest of the register assignment such that one of those 8 is available.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Stephen Fuld on Tue Jun 17 16:47:55 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

OS/360 "relocatable" ... included address constants in executable images
that had to be modified when first loaded into real storage (which
continued after move to virtual storage).

The initial decision to add virtual memory to all 370s was based on the
fact that OS/360 "MVT" storage management was so bad that (concurrently
loaded) executable sizes had to be specified four times larger than used
... so typical 1mbyte (real storage) 370/165 only ran four concurrently executing regions, insufficient to keep 165 busy and justified. Running
MVT in a (single) 16mbyte virtual address space, aka VS2/SVS (sort of
like running MVT in a CP67 16mbyte virtual machine) allowed concurrently running regions to be increased by a factor of four (modulo 4bit storage protection keys required for isolating each region) with little or no
paging.

As systems got larger they needed to run more than 15 concurrent regions (storage protect key=0 for kernel, 1-15 for regions). As a result they
move to VS2/MVS ... a separate 16mbyte virtual address space for each
region (to eliminate storage protect key 15 limit on concurrently
executing regions). However since OS/360 APIs were heavily pointer
passing, they map an 8mbyte kernel image into every virtual address
space (allowing pointer passing kernel calls to use passed pointer
directly) ... leaving 8mbyte for each region.

However kernel subsystems were also mapped into their own, separate
16mbyte virtual address space. For (pointer passing) application calls
to subsystem, a one megabyte "common segment area" ("CSA") was mapped
into every 16mbyte virtual address space for pointer passing API calls
to subsystems ... leaving 7mbytes for every application.

However, by later half of 70s & 3033 processor, since the total common
segment API data space was somewhat proportional to number of subsystems
and number of concurrently executing regions ... the one mbyte "common
SEGMENT area" was becoming 5-6mbyte "common SYSTEM area", leaving only 2-3mbytes for applications ... but frequently threatening to become
8mbyte (leaving zero bytes for applications).

This was part of desperate need to migrate from MVS to 370/XA and MVS/XA
with 31-bit addressing as well as "access registers" ... where call to subsystem switched the caller's address space pointer to the secondary
address space and loads the called subsystem address space pointer into
the primary address space ... allowing subsystem to directly address
caller's API data in (secondary address space) private area (not needing
to be placed in a "CSA"). The subsystem then returns to the caller
... and the caller's address space pointer is switched back from
secondary to primary.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to BGB on Tue Jun 17 23:35:23 2025

On 6/17/2025 11:16 PM, BGB wrote:

On 6/17/2025 4:19 PM, MitchAlsup1 wrote:

On Tue, 17 Jun 2025 19:04:49 +0000, BGB wrote:

On 6/17/2025 12:59 PM, quadibloc wrote:

On Tue, 17 Jun 2025 17:41:10 +0000, MitchAlsup1 wrote:

But that is NOT the arithmatic you are looking at::

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the
power.

So the arithmetic is: can you OoO core be 10× faster than the LBIO
core
??
And the answer is NO.

But my code runs faster on the OoO core than on ten LBIO cores, because >>>> nobody knows how to make effective use of ten cores to solve the
problem.

So the fact that it uses 10x the electrical power, while only having 2x >>>> the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.

That's why OoO chips sell so well.

Errm, this doesn't agree with my experience.

More like the OoO chips are around 20-40% faster, but depending on
workload.

Then you are latency bound, not compute bound.

Possibly...

A lot of the code doesn't do that much math or dense logic on the data.
But, a whole lot of mostly shoveling data around, often through lookup
tables or similar.

But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and similar."

I disagree. To me, N times faster doesn't mean across the board, i.e.
on every workload, but on average across a variety of workloads.

And, of course, one of the big advantages of OoO is better latency
tolerance i.e. it can often do something useful while waiting for a load instruction to complete. That may explain your interpreter/compiler
results.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jun 18 06:58:51 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

^^
still?

I understand that was your sales pitch, and I assume you had good
reasons to think it was indeed true, but is it still the case now?

AFAICT (see for example Anton's benchmarks in this regard) with current
CPUs, "LBIO cores" are not terribly more power-efficient than big OoO cores.

I have not done any measurements on power-efficiency of in-order
vs. OoO cores myself, but Andrei Frumusanu measured the in-order
Cortex-A55, the OoO Cortex-A75 and the OoO Samsung M4 in the Exynos
9820 and published those measurements on anandtech <https://www.anandtech.com/show/14072/the-samsung-galaxy-s10plus-review/4>

Concerning power-efficiency (on SPEC2006 Int+FP Geomean), <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
is most relevant: If you use the A75 and the A55 at their most
efficient points, the A55 is slightly more efficient, but more than 3
times slower. As soon as you want more performance from the A55, it
becomes so inefficient that the A75 will beat it at power-efficiency.

Concerning area, in <2024Jan24.225412@mips.complang.tuwien.ac.at> I
estimated that the A75 has 3-4 times the size of the A55 (on the same
chip, i.e., the same number of metal layers etc.), for 3-4 times more performance. So in-order does not look more area-efficient, either.

So why do ARM still do in-order Cortex-A cores? Maybe for the bottom
of the smartphone market who only care about cost. And maybe the
smartphone manufacturers want to brag about the number of cores their
SoC has without paying the licensing and area costs for so many OoO
cores.

Or at least, it seems that the big OoO cores are not significantly less >power-efficient when they are computing at the same speed as LBIO cores.

Looking at <https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>
for the 5 SPEC2006 Int+FP Geomean speed, the OoO A75 is more than 3
times more efficient and the OoO Samsung M4 is more than 2 times more
efficient than the in-order A55. OTOH, for the 1.1 SPEC2006 Int+FP
Geomean speed, the A75 at it's slowest speed is slightly less
efficient, and the M4 is about 1.5 times less efficient than the A55
(assuming in both cases that the OoO cores go into a low-power state
after they have finished the job and consume too little additional
energy while waiting to cause a significant change in power
consumption).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jun 18 07:31:55 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Ah, well, pairing is a different problem than the "incomplete register >specifiers" I'm talking about. Indeed, it can be much more difficult to >adapt a Chaitin-style allocator to handle pairing because it can't be >expressed simply in the interference graph.

I remember reading a paper about register allocation for register
pairs, but don't find that paper right now. Anyway, what I read was
based on graph-colouring IIRC and it looked pretty plausible and not
too complicated. But the devil is in the details.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Lynn Wheeler on Wed Jun 18 13:48:15 2025

Lynn Wheeler <lynn@garlic.com> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Yes, but as I have argued before, this was a mistake, and in any event
base registers became obsolete when virtual memory became available
(though, of course, IBM kept it for backwards compatibility).

As systems got larger they needed to run more than 15 concurrent regions >(storage protect key=0 for kernel, 1-15 for regions).

Back in the late 70's, _The Adolescence of P-1_ was published, wherein
the protaganist uses a timing loop to obtain a storage protect key
of zero. Which led to the development of a massively distributed
AI. It's still a fine tale, albeit somewhat dated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Wed Jun 18 14:10:42 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Tue, 17 Jun 2025 12:58:06 +0000, quadibloc wrote:

On Tue, 17 Jun 2025 6:28:43 +0000, BGB wrote:

On 6/16/2025 6:37 PM, MitchAlsup1 wrote:

The cost of register forwarding is:: Results × Operands or sometimes
SUM( Results[type] × Operands[type] ); hardly more than quadratic.

Quadratic sill a lot worse than linear.

You don't have to go very far in a quadratic curve before the cost of
the forwarding exceeds exceeds that of the function units (such as an
additional ALU or similar).

Yes, but that's not necessarily the point at which forwarding stops
making sense.

Out of order execution can cost more than the functional units in a CPU.
But since a faster CPU - or, more specifically, a CPU with faster
single-thread performance - is much more useful than more CPUs, it's
still well worth the cost.

In fact, just as I've included VLIW as a basic feature in the Concertina
II, as a way to explicitly do what OoO does transparently for the
programmer,

Have enough instruction in the queue to deal with memory delays which
cannot be determined by the compiler in a reasonably way? Howo does
it do that?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Wed Jun 18 14:45:24 2025

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 17 Jun 2025 1:20:40 +0000, quadibloc wrote:

On Mon, 16 Jun 2025 23:37:05 +0000, MitchAlsup1 wrote:

Then again, SW over the last
15 years has demonstrated no ability to use "lots" of small cheap cores.

And as long as that remains true, out-of-order execution will continue
to be popular, and there will also be strong pressure to find exotic
materials that can be used to make faster transistors - and faster
interconnects between them.

While I am willing to agree that we can do better in using multiple
cores, I also think that even after we do all that we can in that area,
a single core that is N times faster will still be better than N cores.

But that is NOT the arithmatic you are looking at::

A core 1/2 as fast as an Opteron Rev F is 1/12 the size, 1/10 the power.

So the arithmetic is: can you OoO core be 10× faster than the LBIO core
??
And the answer is NO.

The highest performing OoO is M4 right now and it is 2× faster than
Opteron Rev F (after normalizing frequency)--perhaps I should say
2× more instructions per clock. If M4 area was equal to Opteron area
(highly doubtful after normalizing) it would still be a factor of 5-6×
more area than 12 LBIO cores.

I think that situation is much more nuanced. I recently bought
a few mini PCs and did comparison with bigger machines. Single
core of 6 core Zen 2 machine (Ryzen 5 3600) is about 3 times
faster than resonably modern Celernon, about 10 time faster
than Celeron N3060 and about 7 times faster than a core in
Allwiner H2 chip. Newer 12 core Ryzen 9 7900 has about 60%
faster core than Zen2 and has TDP 65 W, that is 5.5 W per core.
IIUC low end PC-s have TDP about 2-3 W and two cores, so
about 1 W per core. Of course, single core performance on
multicore processor is inflated due to increase of clock
frequency (and power) when only one core is active. Also
I used Dhrystone to get numbers. But I have similar
performance ratios on my real loads (including parallel
ones which use all cores). IIUC in-order mini PC-s
have really poor performace, the best ones are out of
order, but significantly more than big core.

Anyway, my results are in line with results given by Anton,
and do not agree with your estimate.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jun 18 11:22:12 2025

Chris M. Thomasson [2025-06-18 00:47:51] wrote:

On 6/17/2025 1:43 PM, Stefan Monnier wrote:

What do you mean by "a few instructions later"? The above was stated in
the context of a register allocator based on something like Chaitin's
algorithm, which does not proceed "instruction by instruction" but

[...]
Fwiw, here is some old code of mine, a region allocator in C that should still work today... Sorry for butting in:

Hmmm "region" != "register".

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jun 18 11:50:26 2025

Anton Ertl [2025-06-18 07:31:55] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Ah, well, pairing is a different problem than the "incomplete register >>specifiers" I'm talking about. Indeed, it can be much more difficult to >>adapt a Chaitin-style allocator to handle pairing because it can't be >>expressed simply in the interference graph.

I remember reading a paper about register allocation for register
pairs, but don't find that paper right now. Anyway, what I read was
based on graph-colouring IIRC and it looked pretty plausible and not
too complicated. But the devil is in the details.

Preston Briggs (who used to be a regular here) discusses such an
allocator in his PhD thesis (https://repository.rice.edu/items/2ea2032a-0872-43a1-90c0-564c1dd2275f).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Wed Jun 18 09:55:21 2025

On 6/15/2025 12:29 PM, quadibloc wrote:

On Sat, 14 Jun 2025 16:45:23 +0000, Stephen Fuld wrote:

On 6/14/2025 3:48 AM, Thomas Koenig wrote:

Which made nonsense the concept of making data relocatable by
always using base registers.

Forgive me, but I don't see why. When the program is linked, the COMMON
block is at some fixed displacement from the start of the program. So
the program can "compute" the real address of the data in common blocks
from the address in its base register.

The purpose of a COMMON block is to share variables between the main
program and subroutines.

Sure.

On the System/360, a FORTRAN compiler typically compiled each subroutine

in a program separately from every other subroutine. They just got
linked together by the linking loader in order to run.

I'm not sure what you mean by "linkingloader" The linkage editor (IIRC
IEWL) linked together all of the object modules created by the compiler.
Loading the program was a different operation, again IIRC, done by the initiator in each partition)

So no subroutine would know where a COMMON block created by loader for
the main program would be unless that information was given to it - and
the loader would give it that information, in the form of a full 24-bit address constant, so it didn't have to be passed as a parameter.

So the linkage editor knows where the common block (and hence all of the variables within it) is located relative to the start of the program,
but it doesn't know where in real storage the program will be loaded.
So all references to variables in the common block are resolved,
relative to the start of the program at link time. But you still need
the base register scheme to resolve the address based on where in real
storage the program gets loaded. This al has nothing to do with
addresses being passed as parameters.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Wed Jun 18 18:16:37 2025

On Wed, 18 Jun 2025 15:14:06 +0000, quadibloc wrote:

On Wed, 18 Jun 2025 14:10:42 +0000, Thomas Koenig wrote:

quadibloc <quadibloc@gmail.com> schrieb:

In fact, just as I've included VLIW as a basic feature in the Concertina >>> II, as a way to explicitly do what OoO does transparently for the
programmer,

Have enough instruction in the queue to deal with memory delays which
cannot be determined by the compiler in a reasonably way? Howo does
it do that?

VLIW only deals with one of the things OoO solves; stuff like read-after-write pipeline hazards. It doesn't address cache misses in
any way.

Statically scheduled VLIW is dependent on there being no variable
latency results. So,
a) cache misses
b) TLB misses
c) FDIV/SQRT taking variable cycles
d) some kinds of Store buffering

Are all off the table in a VLIW design that are on the table in any
other design with discovered forwarding.

For example, in 1991 we were working on an FDIV algorithm in FMUL
(Goldschmidt) that always delivered correct results in 17 cycles
(rather standard for the day). We discovered that we could deliver
a result in 12 cycles that needed to be fixed 1/128 times. So,
would you rather have a FDIV in 12.25 RMS clocks or 17 static clocks.

Statically scheduled VLIW takes this game away.

So that's total bad news, right?

Grim, maybe, Bad, not necessarily.

It proves VLIW is useless?

Not at all--it demonstrates the VLIW is less than ideal when dealing
with unpredictable latencies.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 01:03:37 2025

On Wed, 18 Jun 2025 22:00:54 +0000, quadibloc wrote:

On Wed, 18 Jun 2025 18:16:37 +0000, MitchAlsup1 wrote:

On Wed, 18 Jun 2025 15:14:06 +0000, quadibloc wrote:

So that's total bad news, right?

Grim, maybe, Bad, not necessarily.

It proves VLIW is useless?

Not at all--it demonstrates the VLIW is less than ideal when dealing
with unpredictable latencies.

I'm surprised, though, that you did not continue onwards, and comment on
the part where I blamed you for finding a resolution to this problem.

The resolution to the problem means the VLIW-ness of that ISA is no
longer necessary.

Because, unless my memory is very faulty, you noted that the OoO implementation of the 6600 _is_ adequate for dealing with unpredictable latencies, such as those from cache misses (even if the 6600 didn't have
a cache; instead, it had extra memory under program control)... and so
it seemed to me that since VLIW can theoretically handle register
hazards almost as well as Tomasulo, it could complement a 6600-style
pipeline to provide a match for the resource hog OoO style in common use today.

But once you have dynamic scheduling*, you not longer need VLIW-ness
to jam all the instructions in per clock--you can do it for a typical
von Neumann instruction set.

VLIW, as used in the past (MultiFlow...to...Mill) are all statically
scheduled.

Once you are no longer statically scheduled, the VLI part is not needed,
and indeed serves as a arbitrary limit to the widths you can achieve in practice.

(*) dynamic scheduling is how one tolerates unknowable latencies.
VLIW scheduling is how one tolerates only known latencies.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jun 19 08:56:34 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Anton Ertl [2025-06-18 07:31:55] wrote:

I remember reading a paper about register allocation for register
pairs, but don't find that paper right now. Anyway, what I read was
based on graph-colouring IIRC and it looked pretty plausible and not
too complicated. But the devil is in the details.

Preston Briggs (who used to be a regular here) discusses such an
allocator in his PhD thesis

Given that I found nothing else, that was probably it.

(https://repository.rice.edu/items/2ea2032a-0872-43a1-90c0-564c1dd2275f).

Oh boy, not only did Rice produce the pdf from the scan of a paper
copy, one of the pages in the pairing chapter was also not properly
scanned. Fortunately, I have a digital copy of Briggs' Thesis, and I
have now temporarily put it online. I will send a copy to Rice, mayby
they will update their copy.

http://www.complang.tuwien.ac.at/anton/tmp/briggs-thesis.ps.gz

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:17:39 2025

On Thu, 19 Jun 2025 1:32:07 +0000, quadibloc wrote:

On Thu, 19 Jun 2025 1:03:37 +0000, MitchAlsup1 wrote:

The resolution to the problem means the VLIW-ness of that ISA is no
longer necessary.

That may be.

But because I'm not the expert on things like this that you are, I don't
feel that I can dispute the conventional wisdom. The conventional
wisdom, as practiced by Intel and AMD and pretty much the whole CPU
industry is that the dynamic scheduling design as used in the Control
Data 6600 is inadequate, and one has to go to register rename or the equivalent Tomasulo Algorithm in order to achieve acceptable
performance.

Nothing prevents a Scoreboard from using a renamed registers.

The 6600 doesn't cover all register hazards.

It Covers: RAW, WAR, WAW, but ignores the RAR hazard.

VLIW can deal with register hazards, but it doesn't help at all with cache misses - as I have no
reason to doubt your claim that the 6600 mechanism is adequate to deal
with cache misses, though, that's why I noted combining the two as an
option.

Maybe you are right that this is useless, but I'm not in a position to dispute what Patterson and Hennessey have proclaimed and the industry
has accepted.

But I'm saying that even if Patterson and Hennessey _are_ right, adding
VLIW provides a method by which your goal - getting rid of the bulk of
the transistor and power overhead of OoO by going to the 6600 design -
would _still_ be achievable, since adding VLIW is essentially trivial.

That is not my goal. My goal is VAX-instruction count with RISC-like pipelineability.

Sure, I could be wrong - and 6600 by itself is plenty good enough. But
given all the naysayers out there, a way out of the GBOoO rut that
people might be willing to believe could work has got to have some
value.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:26:28 2025

On Thu, 19 Jun 2025 2:29:27 +0000, quadibloc wrote:

I've decided that this would be a good time to review the difference
between the 6600 scoreboard and modern OoO.

Having refreshed my memory, I see the issue is that when there is a WAR hazard, a 6600-style scoreboard simply stalls the functional unit.

The Scoreboard stalls DELIVERY of the result (W) until all consumers
of the previous value have consumed that value (AR) something Tomasulo
does not even try to do.

Tomasulo or register renaming provides extra storage, either in the reservation stations or in rename registers, so that if the desired
result register is not yet available, the result can just go in an extra place.

One can use a Skid-Buffer to hold calculated results awaiting write
permission per function unit to avoid the WAR hazard stalls--in
getting instruction calculations started.

This suggests that, just as some caches are designed in a very simple fashion, one could have a "stupid" form of register rename - say each register has its own rename register - that could be added to a
scoreboard. I would have thought that people are already doing this, but they're calling it register rename full OoO and not scoreboarding plus, because that's better marketing wise if nothing else.

Nothing in the scoreboard prevents you from using renamed registers.

RISC mitigates WAR hazards by having 32 registers instead of, say, 8 or
16.

And since the Scoreboard is quadratic in area, gong from 8 registers
(6600) to 32 (R3000) makes the gate count in scoreboard go up by 16×.

VLIW marks out groups of instructions that don't have RAW or WAR
hazards. A scoreboard keeps track of dependencies, so it can delay only
those instructions affected by a cache miss. Since a 6600 scoreboard
does have to _detect_ WAR hazards, even if it doesn't handle them as
well as Tomasulo, indeed putting a bit in to indicate one is present
isn't needed, so you are right there... at least for an older-style
computer.

But a lot of computers these days have multiple copies of each
functional unit; that is, they're superscalar. So indicating that
several instructions can be executed together with no need for any
thought would seem to make things go faster.

Except, of course, when there's a chance one is trying to execute instructions at a time when all dependencies are not resolved - some registers aren't loaded yet with the data some of those instructions
will need. They all have to go through the scoreboard to check that. But
the instructions in a group are guaranteed not to depend on *each
other*, so they can be checked against the scoreboard _in parallel_.
That's what the VLIW bits can help with.

So you say ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Jun 20 05:56:56 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 12 Jun 2025 8:38:06 +0000, Anton Ertl wrote:

[program counter as GPR]

Nowadays, you can afford it, but the question still is whether it is
cost-effective. Looking at recent architectures, none has PC
addressable like a GPR, so no, it does not look to be cost-effective
to have the PC addressable like a GPR. AMD64 and ARM A64 have
PC-relative addressing modes, while RISC-V does not.

Consider that in a 8-wide machine, IP gets added to 8 times per cycle, >whereas no GPR has a property anything like that.

Actually the 6-wide renamer of Golden Cove (Alder Lake P-core) can
handle 6 dependent adds of small constants to GPRs per cycle, and I
would be surprised if the 8-wide Lion Cove (Arrow Lake P-Core) would
not be able to do 8 dependent adds of small constants to GPRs.

However, in the case of the PC, I think you can produce the 8 PC
values with less effort. The decoder knows where each instruction has
started, so it just needs to propagate this information as the instruction-specific PC value to the execution-engine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to quadibloc on Fri Jun 20 17:11:31 2025

quadibloc wrote:

On Sat, 14 Jun 2025 8:35:31 +0000, Robert Finch wrote:

Packing and unpacking decimal floats can be done inexpensively and fast
relative to the size, speed of the decimal float operations. For my own
implementation I just unpack and repack for all ops and then registers
do not need any more than 128-bits.

I also unpack the hidden first bit on IEEE-754 floats.

The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.

You might not be aware that by unpacking the hidden bit, you at the same
time destroy the very nice feature of a maximally dense packing, where a
simple unsigned increment always brings you to the next possible
floating point value, and on rounding up you get from exp + 0xfff..f
mantissa directly to exp+1 + 0x000..0

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 20 15:59:14 2025

On Fri, 20 Jun 2025 15:30:59 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

Ideally, one has an ISA where nearly all registers are the same:
No distinction between base/index/data registers;
No distinction between integer and floating point registers;
No distinction between general registers and SIMD registers;
...

------------------------

But I felt that this was OK, since as everybody knows, strings really
only have to be able to be at least 80 characters long. Hmm... wait a
moment, aren't 132-character strings sometimes needed?

Line printers are/were 132 characters wide.

Oh, well.

Oh well, indeed!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to All on Fri Jun 20 17:12:51 2025

On 20/06/2025 16:59, MitchAlsup1 wrote:

On Fri, 20 Jun 2025 15:30:59 +0000, quadibloc wrote:

On Wed, 11 Jun 2025 9:42:47 +0000, BGB wrote:

Ideally, one has an ISA where nearly all registers are the same:
   No distinction between base/index/data registers;
   No distinction between integer and floating point registers;
   No distinction between general registers and SIMD registers;
   ...

------------------------

But I felt that this was OK, since as everybody knows, strings really
only have to be able to be at least 80 characters long. Hmm... wait a
moment, aren't 132-character strings sometimes needed?

Line printers are/were 132 characters wide.

Some were. Others were only 120 wide and others were 160 wide.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jun 20 12:43:29 2025

quadibloc [2025-06-14 15:22:20] wrote:

I also unpack the hidden first bit on IEEE-754 floats.
The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.

Do you have any evidence that hiding the leading 1 bit takes more time
than not hiding? I can think of reasons why either of the two options
could be marginally cheaper than the other, but in all cases I can think
of, it would make *very little* difference, if any.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Fri Jun 20 18:34:33 2025

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>> pretty far along the way (both were released in November 1995), so
it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>> it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically.

Have you ever worked in a large corporation? (Just asking).

Hydro had 77K employees in 130 countries, there was no such thing as a
simple hierarchical setup.

Rather more like a loose federation across varying local environments.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jun 20 17:38:59 2025

On Fri, 20 Jun 2025 16:43:29 +0000, Stefan Monnier wrote:

quadibloc [2025-06-14 15:22:20] wrote:

I also unpack the hidden first bit on IEEE-754 floats.
The idea is that the ISA may be used for a wide variety of
implementations, and on at least some of them, anything that takes an
amount of time above zero may make a difference.

Do you have any evidence that hiding the leading 1 bit takes more time
than not hiding? I can think of reasons why either of the two options
could be marginally cheaper than the other, but in all cases I can think
of, it would make *very little* difference, if any.

Creating the hidden bit is 2-gates of delay (H,F,D,Q).
Hiding the top bit after rounded number is 0-gates.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jun 20 13:48:20 2025

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Fri Jun 20 18:47:53 2025

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

quadibloc <quadibloc@gmail.com> schrieb:

On Sat, 14 Jun 2025 16:05:53 +0000, Anton Ertl wrote:

The project started in earnest in 1994, when IBM had sold OoO
mainframes for several years and the Pentium Pro and HP PA 8000 were >>>>>> pretty far along the way (both were released in November 1995), so >>>>>> it's not as if the two involved companies did not have in-house
knowledge of the capabilities of OoO. They apparently chose to ignore >>>>>> it.

...

It always surprises me that people think of corporations as
monolithic entities, when they are in fact comprised of very
different groups with very different tasks and very different
agendas and interests.

Corporations are organized hierarchically.

Have you ever worked in a large corporation? (Just asking).

Hydro had 77K employees in 130 countries, there was no such thing as a
simple hierarchical setup.

Rather more like a loose federation across varying local environments.

My point was that, even if the org chart shows a hierarchy, actual
dynamics are _much_ more complex.

That there are a lot of personal and group interests, a lot of
communications upwards are targeted towards what people think the
respecive manager wants to hear. There are people trying to prove
that their work is valuable (and most people think theirs is).
Plus, there is a lot of talk like "X likes this" or "Y likes that",
or "don't let management worry". Plus, there are managers who
routinely ignore feedback because they don't have the spine to
tell their own managers bad news, or who discourage it (later
complaining that people don't inform them).

One classic example was something that I only heard about,
that was way above my level.

At a meeting, some higher-up told his direct reports that anybody
who said "Yes, but" would be removed from his leadership position,
the only acceptable expression was "Yes, and". And guess what -
said manager later sunk a huge project because, ahem, people didn't
tell him the bad news, and ignoring a problem usually doesn't make
it better.

And now I'll stop before I get to the bad stuff :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Jun 20 20:46:44 2025

On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

hidden = operand.exponent != 0

Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).

You DO end up special casing Infinities and NaNs; anyway.

Special = operand.exponent == 0b11111111111

Which is an 11-input AND gate.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Jun 20 18:13:56 2025

MitchAlsup1 [2025-06-20 20:46:44] wrote:

On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

hidden = operand.exponent != 0

IOW denormalized.
I'd attribute the cost to "denormalized" rather than to "hidden bit", then 🙂

You DO end up special casing Infinities and NaNs; anyway.

Special = operand.exponent == 0b11111111111

Which is an 11-input AND gate.

Right, but this one is quite different since the result doesn't depend
on the actual numerical computation on the mantissa (I'd assume that the
number of cycles (or gate delays) to determine the desired output in the
case of Inf/NaN inputs is smaller than to compute an add/mul, so this
test of those 11bits is not on the critical path).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Sat Jun 21 01:48:51 2025

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

hidden = operand.exponent != 0

Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).

You DO end up special casing Infinities and NaNs; anyway.

Special = operand.exponent == 0b11111111111

Which is an 11-input AND gate.

But do explicit bit lead to a difference? IIUC FPU need to
special cases anyway. I would guess that a flag normal/special
could save some time, but once FPU knows that it deals with
normal numbers hidden bit should be effectively free.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Waldek Hebisch on Sat Jun 21 02:51:25 2025

On Sat, 21 Jun 2025 1:48:51 +0000, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

hidden = operand.exponent != 0

Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).

You DO end up special casing Infinities and NaNs; anyway.

Special = operand.exponent == 0b11111111111

Which is an 11-input AND gate.

But do explicit bit lead to a difference? IIUC FPU need to
special cases anyway. I would guess that a flag normal/special
could save some time, but once FPU knows that it deals with
normal numbers hidden bit should be effectively free.

FADD/SUB starts out with an exponent subtract, so you have time to
"invent" the hidden bit at almost zero cost.

FMUL/MAC/DIV/SQRT starts out with a big multiplexer (float,double)
that allows one to invent the hidden bit at near zero cost.

FCMP does not need to "invent" the hidden bit because it is a sign-
magnitude integer compare (with special cases).

So, the cost is between actual zero, and nearly zero if your
circuit designer is worth being on the payroll.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Jun 24 08:15:35 2025

MitchAlsup1 wrote:

On Fri, 20 Jun 2025 17:48:20 +0000, Stefan Monnier wrote:

Creating the hidden bit is 2-gates of delay (H,F,D,Q).

How come it's not free in hardware?
Is it only because of denormalized?

hidden = operand.exponent != 0

Which is an 11-input NAND gate. I suspect you could assume it is 1
and special case the result, but even special casing the result
cannot be less than 1-gate (a multiplexer).

You DO end up special casing Infinities and NaNs; anyway.

Special = operand.exponent == 0b11111111111

Which is an 11-input AND gate.

The way I understand hardware, you would do this in parallel with the
regular fp op, so that all special inputs have a short-circuit result,
and then a final mux which selects either the normal or the special result?

I.e. only that single mux (one or two gate delays?) is part of the
actual latency for fpu ops?

This was the way I implemented FPU emulation on the Mill which has a
"free" mux as one of the phases of all operations.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to BGB on Wed Jun 25 22:24:00 2025

BGB <cr88192@gmail.com> wrote:

But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and similar."

I have a contrived program which on machines from about 2010 peaked
at about 10 MIPS, on earlier machines starting form about 1990 it
was closer to 2 MIPS. Basicaly program is doing pointer chasing
in somewhat irregular pattern covering whole memory. AFAICS it
needed 2 RAM accesses per instruction (one for second level page
table entry, one for actual data), on modern machines with multilevel
page tables it may be more (but modern machines tend to have quite
large caches and few top levels of page tables may fit in the on
chip cache).

While this is very unnatural program it clearly shows that modern
machines are fast only when caching/prefetching works as expected
and badly behaving programs may be much slower than execution
speed of the core. And of course, withing the core there are
more factors that can cause slowdown.

So any speed claim are probabilistic and implicitly or explicitely
assume some program behaviour.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Waldek Hebisch on Wed Jun 25 23:00:14 2025

On Wed, 25 Jun 2025 22:24:00 +0000, Waldek Hebisch wrote:

BGB <cr88192@gmail.com> wrote:

But, if the usual claim is that it is N times faster, this would imply
it is N times faster across the board, rather than "N times fast, but
only if the logic happens to have lots of complex math expressions and
similar."

I have a contrived program which on machines from about 2010 peaked
at about 10 MIPS, on earlier machines starting form about 1990 it
was closer to 2 MIPS. Basicaly program is doing pointer chasing
in somewhat irregular pattern covering whole memory. AFAICS it
needed 2 RAM accesses per instruction (one for second level page
table entry, one for actual data), on modern machines with multilevel
page tables it may be more (but modern machines tend to have quite
large caches and few top levels of page tables may fit in the on
chip cache).

In 1992 over the July 4th weekend, I forgot to stop a simulation
of my 6-wide Mc 88120 processor running Matrix 300. The below
numbers are for the 16KB DM cache::

The first ~2B instructions ran at 5.98 IPC
The next. ~2B instructions ran at 2.4 IPC
The next. ~2B instructions ran at 0.6 IPC
And the disk on the VAX ran out of space on the last transpose of
Matrix300.

When we traced it all down, it was due to TLB thrashing. Converting from 64-entry FA TLB to 256 Entry DM TLB made the anomalies go away
completely.

While this is very unnatural program it clearly shows that modern
machines are fast only when caching/prefetching works as expected
and badly behaving programs may be much slower than execution
speed of the core. And of course, withing the core there are
more factors that can cause slowdown.

Caches, TLBs, and predictors all have to work as anticipated, or
performance drops precipitously.

So any speed claim are probabilistic and implicitly or explicitely
assume some program behaviour.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Jun 28 14:32:44 2025

On Thu, 12 Jun 2025 19:24:36 +0000, MitchAlsup1 wrote:

On Thu, 12 Jun 2025 15:38:30 +0000, quadibloc wrote:

VLIW, in the sense of the Itanium or the TMS 320C6000, offers the
promise of achieving OoO level performance without the costs of OoO.

Pick a VLIW that was successful like x86 or ARM in the marketplace.

I now can see your point more clearly; I've researched the Itanium and the Intel i860 after watching a YouTube video on the i860 in order to update
my web pages on the history of computers.

Intel sold a few of its Touchstone Delta prototypes of its Paragon supercomputer to NASA and the like. This supercomputer was based on the
Intel i860, and apparently it worked well enough there, having an
appropriate instruction load.

According to the video, the i860 failed because the interrupts that
would be encountered frequently on a general-purpose computer severely
degraded performance on the i860 with its long pipeline. Even CISC
machines get longer and longer pipelines in order to improve performance;
but then, Intel's Pentium IV _also_ failed in the marketplace, and was
replaced by Intel Core microprocessors, which were an improved version of
the Pentium III microarchitecture.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc on Tue Jul 22 04:30:28 2025

On Tue, 10 Jun 2025 22:53:27 +0000, quadibloc wrote:

Include pairs of short instructions as part of the ISA, but make the
short instructions 14 bits long instead of 15 so they get only 1/16 of
the opcode space. This way, the compromise is placed in something that's
less important. In the CISC mode, 17-bit short instructions will still
be present, after all.

After this change, I have been busily making minor tweaks to the ISA.

The latest one involved a header format which allowed room for fourteen alternate 17-bit short instructions in a block, in order to permit
a higher level of superscalar operation.

I made opcode space for this header by using two opcodes from the standard memory-reference instruction set for it; they were the ones formerly used
for load address and jump to subroutine with offset.

I was not happy with doing this, however. Right now, I am engaging in a
mighty struggle to squeeze the available opcode space to avoid doing this. However, try as I may, it may well be that the cost of this will turn out
to be too great. But if I can manage it, a significant restructuring of
the opcodes of this iteration of Concertina II may be coming soon.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Tue Jul 22 16:29:28 2025

On Tue, 22 Jul 2025 04:30:28 +0000, John Savard wrote:

However, try as I may, it may well be that the cost of this will turn
out to be too great. But if I can manage it, a significant restructuring
of the opcodes of this iteration of Concertina II may be coming soon.

I have now revised my pages on Concertina II to reflect this latest
change. Its most shocking result is that the three-operand arithmetic instructions in the basic 32-bit instruction set now only have six-bit
opcodes. However, this didn't actually result in the omission of any
useful instructions that had been defined for them when they had seven-bit opcodes.

And the header mechanism, of course, allows the instruction set to be
massively extended. Thus, I shouldn't really view this as an unacceptable
cost requiring me to do a major rollback of the design... I think.

But I'm not sure; cramming more and more stuff in has brought me to a point
of being uneasy.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc on Wed Jul 23 16:03:33 2025

On Fri, 20 Jun 2025 19:46:42 +0000, quadibloc wrote:

More importantly, I need 256-character strings if I'm using them as
translate tables. Fine, I can use a pair of registers for a long string.

I've realized now that I can have eight 256-character string registers
if I instead use the extended register bank of 128 floating-point
registers for the string registers; this provides another use for a
set of registers that would otherwise be little used outside of VLIW
code.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Thu Jul 24 10:47:09 2025

On Sat, 14 Jun 2025 09:24:02 -0700, Stephen Fuld wrote:

On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.

That's almost right. However, you can't really "BALR to the next
instruction", because BALR is a register-to-register instruction.
Therefore, it doesn't reference memory.

It's the register-to-register version of BAL, the jump to subroutine instruction (Branch and Link), and because of that, it doesn't do any branching, and has no branch target.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Jul 26 05:57:49 2025

On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:

On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

What is Concertina 2?

Roughly speaking, it is a design where most of the non-power of 2 data
types are being supported {36-bits, 48-bits, 60-bits} along with the
standard power of 2 lengths {8, 16, 32, 64}.

As this is such a fondly remembered feature, I have finally gotten
around to adding one header type to the ISA that enables it. I do,
however, carefully note that this is a highly specialized feature,
and thus it is not expected to be included in most implementaions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Jul 26 06:14:47 2025

On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:

This creates "interesting" situations with respect to instruction
formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.

Oh, there are indeed challenges, but they're hardly insurmountable.

Compilers are the obvious case. Since the instruction set is built
around 32-bit instructions, obviously the architecture will need to
be running in conventional mode for compilation.

The data width is, of course, specified by the block header. It
isn't a switchable mode. So a program can have memory allocated to
it of different widths, put pointers to those regions of memory in
different base registers, and include code operating on data of
those various lengths.

So the compiler can call subroutines designed to craft things like
36-bit floats for inclusion in object modules. From data placed in
registers by normal code.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Jul 28 23:18:52 2025

On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.

So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.

Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.

So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.

That being said, though, designing a new machine today like the VAX
would be a huge mistake.

But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to BGB on Fri Aug 1 04:42:28 2025

On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.

That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Chris M. Thomasson on Fri Aug 1 04:31:00 2025

On Tue, 17 Jun 2025 12:45:44 -0700, Chris M. Thomasson wrote:

On 6/17/2025 10:59 AM, quadibloc wrote:

So the fact that it uses 10x the electrical power, while only having 2x
the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.

Can you break your processing down into units that can be executed in parallel, or do you get into an interesting issue where step B cannot
proceed until step A is finished?

I'm assuming that the latter case is true often enough for real-world
programs that out-of-order processors with massive overhead and power consumption are worth using instead of many small processors in
parallel with greater throughput.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Fri Aug 1 05:03:07 2025

John Savard <quadibloc@invalid.invalid> schrieb:

On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.

That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.

Which is what everybody does. Loading a register with the address
of a small array on the stack is a simple addition, usually one
cycle latency. If the array came as an argument, it is (usually)
in a register to start with. If you allocate the array dynamically,
you get its address for free after the function call. If you have
enough GP registers, chances are it will still be in the array;
otherwise you can spill it to stack and restre, with restoring
needing one L1 cache access.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Tue Aug 26 21:46:24 2025

BGB <cr88192@gmail.com> posted:

On 7/28/2025 6:18 PM, John Savard wrote:

On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.

So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.

Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.

So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.

The use of addressing modes drops off pretty sharply though.

Like, if one could stat it out, one might see a static-use pattern
something like:
80%: [Rb+disp]
15%: [Rb+Ri*Sc]
3%: (Rb)+ / -(Rb)
1%: [Rb+Ri*Sc+Disp]
<1%: Everything else

Since RISC-V only has [Rb+dips12] the other 20% take at least 2 instructions. Simple math indicates this requires 1.2+ instructions/mem-ref instead of 1.0 instructions/mem-ref. disp12 does not help either.

My 66000 does not have (Rb)+ or -(Rb), and most RISC-machines don't either.
On the other hand, I see more [Rb+Ri<<s+disp] than 1%--more like 3%-4%--
this is partially due to using indexing than incrementation when doing
loops::

MOV Ri,#0
VEC R15,{}
LDD R9,[R3,Ri<<3+disp]
calk
LOOP LT,Ri,#1,Rn
instead of:
MOV Ri,#0
LDA R14,[R3+disp]
VEC R15,{}
LDD R9,(R14)+
calk
LOOP LT,Ri,#1,Rn
{and the second loop has an additional ADD in it}

Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.

Granted, the dominance of [Rb+Disp] does drop off slightly when
considering dynamic instruction use. Part of it is due to the
prolog/epilog sequences.

I have a lot of [IP,DISP] due to the way the compile places data.

If one had instead used (SP)+ and -(SP) addressing for prologs and
epilogs, then one might see around 20% or so going to these instead.
Or, if one had PUSH/POP, to PUSH/POP.

ENTER and EXIT compress prologues and epilogues to a single instruction
each. They also have the option of placing the preserved registers in
a place where the called subroutine cannot damage them.

The discrepancy then between static and dynamic instruction counts them
being mostly due to things like loops and similar.

Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
seemed to be in the area. Many loops end up unreached in many
iterations, or only running a few times, so possibly counter-intuitively
it is often faster to assume that a loop body will likely only cycle 2
or 3 times rather than 100s or 1000s, and trying to aggressively
optimize loops by assuming large N tends to be detrimental to performance.

VAX compilers set the loop-count = 10 and did OK for their era. A
low count (like 10) ameliorates the small loops (letters in a name)
against the larger loops like Matrix300.

Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.

-----------------------

One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
ISA has a lot of registers, the relative benefit of LoadOp is reduced.

LoadOp being mostly a benefit if the value is loaded exactly once, and
there is some other ALU operation or similar that can be fused with it.

Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
z=arr[i]+x;

But, the relative incidence of things like this is low enough as to not
save that much.

The other thing is that one has to implement it in a way that does not increase pipeline length,

This is the key point about LD-OPs:: if you build a pipeline to support
them, then you will suffer when instruction stream is independent RISC-
like instructions--conversely; if you build the pipeline for RISC-like instructions, LD-OPs take a penalty unless you by off on Medium OoO, at
least.

since if one makes the pipeline linger for
sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.

And thus, this is why RISC-machines largely avoid LD-OPs.

One can be like, "But what if the local variables are not in registers?"
but on a machine with 32 or 64 registers, most likely your local
variable is already going to be in a register.

So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".

So does poking your eye with a hot knife.

That being said, though, designing a new machine today like the VAX
would be a huge mistake.

But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.

Yeah.

There are some living descendants of that family, but pretty much
everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Wed Aug 27 01:01:21 2025

John Savard <quadibloc@invalid.invalid> posted:

On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.

That is actually what we did on Mc88100, and while a lot better than
just [Base+Disp] it is still not as good as [RB+Ri<<s+Disp]; the later
saving instructions that merely create constants.

That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	49:48:25
Calls:	10,397
Calls today:	5
Files:	14,067
Messages:	6,417,314
Posted today:	1

Re: Why I've Dropped In

Who's Online

Recent Visitors

System Info