Forum: >>> Magnum BBS <<<

Re: fitting programs in Why I've Dropped In

From John Levine@21:1/5 to All on Sat Jun 14 22:12:15 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.

Did they? Why? I remember reading that the systems software people
spent a lot of work on an overlay mechanism, so the thinking at IBM at
the time was apparently not about keeping several programs in RAM at
the same time, but about running one program at one time, and finding
ways to make that program fit into available RAM.

They did both. OS/360 divided up memory into partitions, at boot time in MFT or dynamically in MVT. Each job step said how big a partition it needed,
and if you ran out of space, your program failed. Many of the utilities
had different versions for different partition sizes, which I assume were
the same code with more or less overlaying.

In any case, it's no problem to add a virtual-memory mechanism that is
not visible to user-level, or maybe even kernel-level (does the
original S/360 have that?) programs, whether it's paged virtual memory
or a simple base+range mechanism.

That's what the 360/67 and 370 DAT did. CP/67 and later VM/370 took
advantange of the fact that nearly everything that affects or observes
the global environment traps in user mode so they could provide a
simulated kernel mode good enough to fool most operating systems.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 15 17:00:47 2025

According to Scott Lurndal <slp53@pacbell.net>:

An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked

s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.

BSD/OS, the commercial descendant of 4BSD, also had them around the same time. I am pretty sure they were separately developed since SVR3 used COFF and BSD
as I recall still used a.out.

There were some configuration files that set the addresses for each library to prevent overlap, and some kludgery that let programs override a few library routines like malloc() with local versions.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 15 17:54:06 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

No, I meant statically linked libraries.

Static linking does not require any coordination. Every executable
gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
not shared.

With traditional static linking they aren't shared, but with static linked shared libraries they are.

On BSD/OS a whole library was linked into a single shared segment with a fixed address, and it created a stub library with the addresses of each routine in the
segment. The standard C library was one shared library, and there were a few other libraries, maybe a math library. In theory you could make your own shared libraries but in practice there were a few shipped with the system, built to ensure that no shared libraries used overlapping addresses.

When you linked a program, your program got the library routine addresses from the stubs, and your program had something at the beginning saying which shared libraries it used. At program startup time, it just mapped in the shared libraries. There was no runtime linking or relocation since the library addresses were all set at the time the library was built.

It wasn't anywhere near as flexible as dynamic libraries, but it worked well for
what it did, every program on the system shared a single copy of the C library (and whatever other shared libraries there were), and program startup was fast since there was no runtime linking or relocation.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 15 20:20:05 2025

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.

While I admit that was less familiar with the lower end systems, I think
the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would
have been significant when they implemented interactive usage.

The 360/30 was byte serial and stored the 16 registers in core (and I mean core.) According to my copy of the 360/30 Functional Characteristics manual, a register to register load took 17us, memory to register took 24us, with an additional 4.5us if it was indexed. I'd think the time to add a system base register would be about the same as the indexing time, as would the comparison to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up, it'd be a significant extra chunk of hardware.

The guys who designed the 360 thought really hard about a design that could scale
up and down and have multiple efficient implementations. The /30 was the most popular of the 360 line. IBM shipped thousands of them. They made a few mistakes
(hex floating point and the high address byte) but not big ones.

Remember that S/360 was mostly aimed at batch processing where each program >> starts and runs until it's done. The higher end systems did multiprogramming so
they could run some other batch program in the short interval while waiting for
a disk or tape or card operation.

True, and good points.

The 360/30's channel was implemented in the CPU microcode, borrowing cycles as needed.
I gather that when it was running a disk operation, the CPU pretty much halted. Swapping on a system that slow would have made no sense.

They included what they called teleprocessing but those systems were transaction
monitors built in the SAGE model with a queue of short chunks of code running to
completion. Relocation and swapping wouldn't help there either.

Agreed. Although how much were the choices in implementing
teleprocessing influenced by the hardware design choices? I don't know
and haven't thought about it at all.

The SAGE programming model has been quite successful for systems that
need fast realtime response, even 70 years later. SABRE (originally on
7090, later on 360s) used it. The CICS database monitor uses it. Take
a look inside the python twisted library, or Javascript node.js, and
there it is.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Al Kossow@21:1/5 to John Levine on Sun Jun 15 18:48:20 2025

On 6/15/25 1:20 PM, John Levine wrote:

The SAGE programming model has been quite successful for systems that
need fast realtime response, even 70 years later.

Do the SDC SAGE programming documents exist on line anywhere?
MANY years ago, one of the Smithsonian curators showed me a line
of binders documenting the software, but it wasn't possible to
scan or copy them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 10:35:20 2025

On 6/15/2025 1:20 PM, John Levine wrote:

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.

While I admit that was less familiar with the lower end systems, I think
the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would
have been significant when they implemented interactive usage.

The 360/30 was byte serial and stored the 16 registers in core (and I mean core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an additional 4.5us if it was indexed. I'd think the time to add a system base register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up, it'd be a significant extra chunk of hardware.

First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!

I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.

Furthermore, since the S/360 used storage keys for protection, there is
no need for a bounds register.

Lastly, since programs were loaded on page boundaries and the max memory
on the /30 (I had to look this up) was 64K, the system base register
would only have had to be 4 bits!, so maybe small enough to invest in
actual hardware to hold it. IF so, it would have been a significant
speedup, as you wouldn't have had to load the base register value from core.

The guys who designed the 360 thought really hard about a design that could scale
up and down and have multiple efficient implementations.

I absolutely believe that. And it was a new concept at the time, so
more kudos!

The /30 was the most
popular of the 360 line. IBM shipped thousands of them. They made a few mistakes
(hex floating point and the high address byte) but not big ones.

I agree about the two you mentioned, but I would also include the
pointer based parameter passing to the OS, for the reasons that Lynn has
so eloquently explained.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Wed Jun 18 19:51:08 2025

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.

First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!

I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.

But the most important goal of the 360 was a single architecture, run the same code on every model. This mutant /30 would presumably have 16 bit direct addresses only, and so much for upward compatibility with models with more memory.

In the IBMSJ architecture article they said:

It was decided to commit the system completely to a base-register technique;
the direct part of the address, the displacement, was made so small (12 bits, or
4096 characters) that direct addressing is a practical programming technique
only on very small models. This commitment implies that all programs are
location-independent, except for constants used to load the base registers.
Thus, all programs can easily be relocated.

I think they meant it was easy to relocate programs when they were loaded, which
is true, no fiddly instruction patching needed. The idea that you would move a program after it was loaded was at the time an exotic high end feature. There was a relocation option for the 7094 but it was an RPQ, not in the regular catalog, and only used for CTSS:

https://bitsavers.org/pdf/ibm/7094/L22-6641-3_RPQ_E07291_880287_7090-7094_Multiprogramming_Package.pdf

The 360/20 was sort of like what you're proposing, a 16 bit system that was as compatible
with real 360s as they could make it. It had 8 registers numbred 8 to 16. In a program
address, if the high bit of the register number was 1, it was a B+D address, but if it was
zero, the low 15 bits were a direct address which was plenty since the biggest /20 was 16K.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 15:30:56 2025

On 6/18/2025 12:51 PM, John Levine wrote:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

The 360/30 was byte serial and stored the 16 registers in core (and I mean >>> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >>> additional 4.5us if it was indexed. I'd think the time to add a system base >>> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.

First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!

I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.

But the most important goal of the 360 was a single architecture, run the same
code on every model. This mutant /30 would presumably have 16 bit direct addresses only, and so much for upward compatibility with models with more memory.

But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most
definitely NOT suggesting a different architecture for the smaller
versus larger systems.

In the IBMSJ architecture article they said:

It was decided to commit the system completely to a base-register technique;
the direct part of the address, the displacement, was made so small (12 bits, or
4096 characters) that direct addressing is a practical programming technique
only on very small models. This commitment implies that all programs are
location-independent, except for constants used to load the base registers.
Thus, all programs can easily be relocated.

I think they meant it was easy to relocate programs when they were loaded, which
is true, no fiddly instruction patching needed.

Agreed.

The idea that you would move a
program after it was loaded was at the time an exotic high end feature.

I understand that. But so was a system needing more than 24 bits of
address, yet you readily admit that not requiring the high order 8 bits
of an address to be zero was a mistake. In both cases, they were
mistakes of not anticipating future developments. Of course, I realize
that, as, I think Yogi Berra said, "Predictions are hard, especially
ones of the future!".

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Thu Jun 19 01:23:52 2025

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most >definitely NOT suggesting a different architecture for the smaller
versus larger systems.

I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 19:41:10 2025

On 6/18/2025 6:23 PM, John Levine wrote:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most
definitely NOT suggesting a different architecture for the smaller
versus larger systems.

I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Wed Jun 18 23:10:07 2025

On 6/18/2025 10:36 PM, quadibloc wrote:

On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

One does not do adequately.

It does/did for many/most architectures, even one contemporaneous with S/360

What one expects memory reference instructions to be able to do is:

Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.

Then the S/360 failed as you had to load either an index register or a
base register. Without that, you would require, in S/360s time, 24 bit offsets.

When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.

Again, many/most contemporaneous architectures didn't support this.

That's how they were able to behave back when memory was 64K bytes in
size.

Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.

But let's say your program has more than 4K of non array data. Then you
either have to reload the base register or use multiple base registers,
which reduces the number of registers available for other things.

Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.

Of course, the other problem is that base registers use up registers.

Yes

So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.

But S/360 (which is what we were discussing) had only 16 GPRs.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Thu Jun 19 12:12:59 2025

John Levine <johnl@taugh.com> schrieb:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems >>like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most >>definitely NOT suggesting a different architecture for the smaller
versus larger systems.

I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

Some thoughts...

Assume memory instructions along the lines of a RISC-style
load/store instruction, using a 16-bit signed displacement.
Storing a pointer to the data and then accessing it via, for
example, 1234(r3) would add the hidden base register, the
index register and the displacement - same amount of effort
as a base register, an index register and a 12-bit displacement.
(It would also make efficient use of the 8-bit and 16-bit
adders of the low-end machines :-)

For instructions which would not need an index register, that
is an additional effort of one addition, which, as you pointed
out, could be significant slowdown especially for the 360/30.

But...

Assume that the machine has a "real" program counter and a
"user-visisble" program counter. Normally, the machine operates
on the real one; the user-visible one is only computed if the user
program asks about this.

Then consider PC-relative branches, and a PC-relative addressing
mode via a special register number, so 1234(PC) would then only
need a single addition and be faster. Tell people about this,
and they will bend over backwards to use it (especially since
+/- 8 kb would be quite large by the standards of the day).

I think such a machine would have, on average, higher performance
than what they actually built.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to quadibloc on Thu Jun 19 09:35:43 2025

quadibloc wrote:

On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

One does not do adequately.

What one expects memory reference instructions to be able to do is:

Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.

When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.

[reg], [reg+disp], [reg+index+disp] are all different address calculations.

The only memory address mode that's functionally mandatory is [reg].
After that the question is which calculations occur frequently enough to warrant being integrated into their own instruction (address modes).

Others are the relegated to separate address calculations, and it
depends on the complexity of a specific address expression how it
maps onto a particular ISA as to how many instructions it takes.

Personally I think those 4 bits for the second address register
would be better allocated to having a 16-bit displacement.
Note also that 360 index register was not scaled and so not
directly usable from array index value for other than byte arrays.

That's how they were able to behave back when memory was 64K bytes in
size.

The program's logical address calculation is independent of how much
physical memory is attached to a system.

Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.

If you want a base+index<<scale address calculation then include
instructions that do just that.

Using an integer general register for program relocation was a flawed
approach. It uses a critical 4 instruction bits for a second register
specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.

The correct design was to have separate base and bounds registers for
program relocation, managed by the OS outside program control.
When the OS switches tasks it loads the integer and float registers,
sets base and bounds physical offsets for it, and Bob's your uncle.
Also all tasks are dynamically relocatable.

The cost is just the two base and bounds relocation registers.
The same ALU is still used for AGEN to calcuate [reg+disp+base] and
send the physical address to the Memory Address Register (MAR).
While the bus cycle sequencer is accessing memory the ALU can be used
to do the bounds check and maybe abort the access.

And IBM could have charged extra for the base and bounds registers
(which would have been present in all models, just enabled by a jumper).

Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.

This is the ISA design trade off - which address calculations occur
frequently enough to warrant their own instructions (address modes).

Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.

John Savard

I would rather have a [base+index<<scale+disp] address mode using
integer registers and let the compiler decide how best to use them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:37:48 2025

On Thu, 19 Jun 2025 5:36:29 +0000, quadibloc wrote:

On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

One does not do adequately.

What one expects memory reference instructions to be able to do is:

Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.

And this is what RISC is BAD at doing.

Consider RISC-V accessing a static array more than 4 Petabytes away
from the address of the instruction being performed. First one has to
create an address into the Literal pool, then load the pointer to the
static variable, then finally LD the static variable::

AUPIC R7,hi(&static_array)
LDD R7,lo(&static_array)(R7)
SLL R6,R6,#2
ADD R8,R5,R6
LDW R7,0(R8)

Whereas a reasonable ISA allows::

LDW R7,[IP,R6<<2,Static_array-.]

1 instruction rather than 5, with a latency of LD-pipeline rather
than 2-lod-pipeline+index shifting.

When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.

That's how they were able to behave back when memory was 64K bytes in
size.

You still want a minimum instruction count, even when memory is 2^64
bytes in size.

Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.

FOr statically linked object modules--maybe.
FOr dynamically linked objects--at best the jury is still out.

Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.

Like my example above. How does CII do with the above ??

Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Thu Jun 19 10:32:40 2025

John Levine wrote:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most
definitely NOT suggesting a different architecture for the smaller
versus larger systems.

I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

The address modes supported by an ISA are just optimization points
not limitations. If and address calculation is more complex than is
supported by the address modes of the LD/ST instructions themselves
then it must be calculated separately using integer instructions
into a temp register then used as an address.

On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions
so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits. Basically construct a large constant from smaller ones the way RISC do.)

Then with the offset in R1 and struct address in R2 do a LD R3,[R2+R1],
using the implicit base+index addition in the address mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Thu Jun 19 14:54:04 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions
so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits. Basically construct a large constant from smaller ones the way RISC do.)

They usually loaded constants from memory close to the routine itself. https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
is a nice introduction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Thu Jun 19 15:11:29 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Note also that 360 index register was not scaled and so not
directly usable from array index value for other than byte arrays.

There is always strength reduction. It seems the original FORTRAN
compiler did a lot of that for the 704, but I'm not sure that the
/360 compilers did - from what I read, they regressed in code
generation quality.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Thu Jun 19 17:52:49 2025

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K. There's one base register, and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

If the idea is that each program is limited to 64K even though the overall system address space is bigger, BTDT on a PDP-11 and would prefer not to go back.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Thu Jun 19 12:36:14 2025

On 6/19/2025 10:52 AM, John Levine wrote:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K.

Index registers. You still need those, for exactly that reason. But
you don't need a second mechanism, i.e. base registers specified in the instruction. One mechanism is sufficient. If you have 32 bit
registers, as the S/360 did, you can address up to 4GB.

There's one base register,
and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

See above

If the idea is that each program is limited to 64K even though the overall system address space is bigger, BTDT on a PDP-11 and would prefer not to go back.

No, that is not the idea. I agree that it would be terrible if it was!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Thu Jun 19 20:25:00 2025

John Levine <johnl@taugh.com> schrieb:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K.

By using registers and constants pointing to elsewhere?

IBM /360 had 48-bit instructions, so an instruciton loading a
32-bit constant into a register would have been entirely feasible.
Load a target address where you need to access things, and do
your memory operations there.

There's one base register,
and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

I'm not sure what proposal you are replying to, or that there wasn't
some miscommunciation somewhere along the line.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Jun 19 21:45:24 2025

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K.

Index registers. You still need those, for exactly that reason. But
you don't need a second mechanism, i.e. base registers specified in the >instruction. One mechanism is sufficient. If you have 32 bit
registers, as the S/360 did, you can address up to 4GB.

I don't get the impression that we are thinking about the same S/360.

The 360 had four instruction formats. RR was register to register, no
problem there. RX was memory to register, with a four bit register operand, four bit base register, four bit index register, and 12 bit displacement.
As I understand it, you'd change that to 16 bit displacement relative to
an implicit base register and still have the optional index register.

But there are two other instruction formats SS and SI that have four bit base register, 12 bit displacement, and no index register. What happens to them? 16 bit displacement so you can only address 64K? Reuse the base register bits as an
index register so you can only address 4K directly?

In case it's not obvious, all programs but the most trivial used multiple base registers. First you'd have one to point to the code and static data. For I/O you'd
set another register to point to an I/O buffer, and use that register as the base register in SS and SI instructions to move stuff in and out of the buffer, then pass the buffer to the operating system, update the register to point to the next buffer and do it again. If you were doing a read-compute-write loop, you'd have one base register for the read buffer and one for the write buffer.

Same for any non-trivial data structrure, you set a register to point to a structure and use it as the base register to refer to fields. A single global base register couldn't do any of that.

I read somewhere that they did simulations and found that a typical program used
four base registers at a time.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Levine on Thu Jun 19 16:05:16 2025

On 6/19/2025 2:45 PM, John Levine wrote:

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K.

Index registers. You still need those, for exactly that reason. But
you don't need a second mechanism, i.e. base registers specified in the
instruction. One mechanism is sufficient. If you have 32 bit
registers, as the S/360 did, you can address up to 4GB.

I don't get the impression that we are thinking about the same S/360.

Could be, and I defer to your greater knowledge of S/350, so I may have
left something out. But in the following, I don't think so. See below.

Look at it this way. (I was led to this by John Savard's comments in
another post in this thread.

because index registers are for
displacing from an address; base registers build an address.

So he seems to regard them as different. But both base registers and
index registers are actually the same GPRs. There is no physical
difference. The only difference is in ones head as how they are used,
and in some cases, not even there (and probably, though not necessarily,
the actual value in them). If you regard them all as index registers
(again, no physical change, just a change in the way you look at it.),
it may make things clearer.

The 360 had four instruction formats.

Yes.

RR was register to register, no
problem there. RX was memory to register, with a four bit register operand, four bit base register, four bit index register, and 12 bit displacement.
As I understand it, you'd change that to 16 bit displacement relative to
an implicit base register and still have the optional index register.

Correct.

But there are two other instruction formats SS and SI that have four bit base register, 12 bit displacement, and no index register. What happens to them?

Let's start with the SS instructions. The documentation says the
instruction has two base registers. But if I said, with no actual
change to the hardware, it has two index registers, it would perform
exactly as it does now. Yes, it only has 12 bit displacements, but that
is no different from what it has now. So other than the name in the documentation, things are exactly as they were. So while you haven't
gained a larger displacement, you haven't lost any addressing capability
that you now have. The actual value in the index register might be
different from what it would have been if you called it a base register,
but, if so, that only means a different value for the displacement. But
if you load the same value in the "index" register as you previously did
the "base" register, the code is indistinguishable.

The SI instructions are similar. Just call the "base register" an index register. You still have to arrange that it points within 4096 of the immediate, but you had to do that with the base register anyway. So
nothing lost, nothing gained.

16
bit displacement so you can only address 64K? Reuse the base register bits as an
index register so you can only address 4K directly?

You have exactly the same capability as you have now, except increased displacement for RX instructions. And, by virtue of the hidden base
register, you gain the ability to relocate programs after their initial
load.

In case it's not obvious, all programs but the most trivial used multiple base
registers.

Sure.

First you'd have one to point to the code and static data.

Perhaps more than one if you have more than 4K of these. With my
proposal, you wouldn't need the one that points to the beginning of the program, as that is the contents of the system base register once the
program is loaded.

For I/O you'd
set another register to point to an I/O buffer, and use that register as the base register in SS and SI instructions to move stuff in and out of the buffer,
then pass the buffer to the operating system, update the register to point to the next buffer and do it again. If you were doing a read-compute-write loop,
you'd have one base register for the read buffer and one for the write buffer.

Same for any non-trivial data structrure, you set a register to point to a structure and use it as the base register to refer to fields. A single global base register couldn't do any of that.

If you use those same registers exactly as you say, but called them
index registers, nothing would change.

I read somewhere that they did simulations and found that a typical program used
four base registers at a time.

I believe that. Just think of them as four index registers.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 19 23:36:42 2025

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
--------------------------

Using an integer general register for program relocation was a flawed approach. It uses a critical 4 instruction bits for a second register specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.

I think they (IBM) originally thought that their base registers would
be fixed register numbers so that relocation software could update
them as segments moved around (pre release), and that they realized
later that this was a folly.

The correct design was to have separate base and bounds registers for
program relocation, managed by the OS outside program control.

Had they (IBM) decided that (say) R13-R16 were relocation registers
{R13==code, R14==data, R15==BSS, R16==stack} that these segments
could be relocated dynamically.

When the OS switches tasks it loads the integer and float registers,
sets base and bounds physical offsets for it, and Bob's your uncle.
Also all tasks are dynamically relocatable.

Yes, when the OS moves one of those segments, the OS changes the value
in the register (and the protection bits in each page).

Data would be accessed with [R14+Rindex+DISP12], ...

And this would lead to lots of addressing problems, not the least of
which was FORTRAN pas by address subroutine arguments--which needed
either indirection in callee or creation of a non-relocatable base
register (using something like LA R2,[R14,,,,]).

But the architectural choice had already been bade and could not be
unmade. So, they (IBM) decided to "live" with it (for then). And
once they discovered "Translation" they decided to live with it
for a long time--until the DAST-box shoed up (/67).

The cost is just the two base and bounds relocation registers.

This was another solution, and probably would have delayed the initial
machine sales by several months; and Somebody up the chain decided
to go with what they had.

The same ALU is still used for AGEN to calcuate [reg+disp+base] and
send the physical address to the Memory Address Register (MAR).

Remember: They (IBM) knew that [base+index+disp112] took only
1 gate delay longer to calculate than [base+index] or [base+disp12]

While the bus cycle sequencer is accessing memory the ALU can be used
to do the bounds check and maybe abort the access.

x286 style.

And IBM could have charged extra for the base and bounds registers
(which would have been present in all models, just enabled by a jumper).

Delay was the enemy.

Anything else would involve slowing down the program by adding extra
instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.

This is the ISA design trade off - which address calculations occur frequently enough to warrant their own instructions (address modes).

I think this was the cross product of::
a) designers missing the base register relocation problem
b) management needing cash flow

Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers,
though, using two or three of them as base registers is not bad enough
to make that necessary.

John Savard

I would rather have a [base+index<<scale+disp] address mode using
integer registers and let the compiler decide how best to use them.

I continue to think that RISC-V has to few addressing modes

MEM Rd,offset12(reg)

is inefficient, and leads to executing more instructions with overall
increase in latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 19 23:18:44 2025

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

quadibloc wrote:

On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

If you mean how does a single program address more than 16MB, the answer >>> is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

One does not do adequately.

What one expects memory reference instructions to be able to do is:

Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.

When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional
instructions.

[reg], [reg+disp], [reg+index+disp] are all different address
calculations.

Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.

The only memory address mode that's functionally mandatory is [reg].

Leading to poor addressability and larger instruction count.

After that the question is which calculations occur frequently enough to warrant being integrated into their own instruction (address modes).

Having spend 7 years doing x86, the answer was clear to me::

[base+Rindex<<2+Displacement]

Others are the relegated to separate address calculations, and it
depends on the complexity of a specific address expression how it
maps onto a particular ISA as to how many instructions it takes.

So, now you are claiming that adding instructions and latency to
memory access is not harming performance !?!?!?!

Clearly you don't "get it"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jun 19 23:45:32 2025

On Thu, 19 Jun 2025 23:05:16 +0000, Stephen Fuld wrote:

On 6/19/2025 2:45 PM, John Levine wrote:

--------------

Let's start with the SS instructions. The documentation says the
instruction has two base registers. But if I said, with no actual
change to the hardware, it has two index registers, it would perform
exactly as it does now.

Had the SS instructions had both base register and index register
many of the relocation problems would have gone away.

Yes, it only has 12 bit displacements, but that
is no different from what it has now.

The difference is that they had not (remotely relocatable) way of

Array_of_struct[i].struct.foo = Array_of_struct[j].struct.bar

If you add index to base, then you have a window where remote relocation
fails, and you have no way to add the index to the constant. On the
other
hand if SS were fo the form:

OP [base+index],[base+index]

relocation works, but now one has to LA a bunch of constants, leading
to longer access sequences (the same problem facing ISAs with poor
address modes.).

So other than the name in the documentation, things are exactly as they were.

Arithmetically, yes; practically, no.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Thu Jun 19 20:36:36 2025

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions
so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
Basically construct a large constant from smaller ones the way RISC do.)

They usually loaded constants from memory close to the routine itself. https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
is a nice introduction.

This way would also need a BAL to copy the PC into a base register,
then L at PC-offset to load a 32-bit offset into an index register,
then an RX instruction using the base+index address.

I was looking for ways that don't require an extra memory access
and can also be used for 32-bit integer calculations.

Ideally an instruction to Load Immediate of 32-bits into a register,
an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
16-bit immediates (a variation on the 48-bit instruction format).

Alternatively a variation on 32-bit formats using two instructions,
a Load Immediate High which shifts the 16-bit immediate to the
dest register upper end, plus an Add Immediate of 16-low bits.
An 8-bit opcode, a 4-bit function code field, a 4-bit source/dest register,
and a 16 bit value. Also useful for many other operations with 16-bit
immediate values, sub, mul, div, and, or, xor.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to ThatWouldBeTelling@thevillage.com on Fri Jun 20 01:10:46 2025

It appears that EricP <ThatWouldBeTelling@thevillage.com> said:

I was looking for ways that don't require an extra memory access
and can also be used for 32-bit integer calculations.

Ideally an instruction to Load Immediate of 32-bits into a register,
an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
16-bit immediates (a variation on the 48-bit instruction format).

On zSeries that's load immediate, LGFI, which puts a 32 bit immediate
value in a register. There's also LAY, load address with a 20 bit
signed displacement rather than 12 bit unsigned, and load address
relative long LARL with a 32 bit displacement added to the current
address.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jun 20 01:32:01 2025

According to MitchAlsup1 <mitchalsup@aol.com>:

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
--------------------------

Using an integer general register for program relocation was a flawed
approach. It uses a critical 4 instruction bits for a second register
specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.

I think they (IBM) originally thought that their base registers would
be fixed register numbers so that relocation software could update
them as segments moved around (pre release), and that they realized
later that this was a folly.

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded. It's really easy to relocate 360 code
at load time, just add the offset to all of the address constants in
memory. After that, as we've seen, not so much.

once they discovered "Translation" they decided to live with it
for a long time--until the DAST-box shoed up (/67).

According to Pugh et al, who were there, they knew about the Atlas One
Level Store (OLS), both the attractive idea of unifying RAM and disk
storage, and that its performance was terrible. (We now call that
thrashing.)

They knew about CTSS, which ran on IBM hardware with base-and-bounds relocation. IBM Research had built a few experimental time-sharing
systems. But the technical risk of adding dynamic address modification
of any sort to what was already a very large and risky project was too
much, so they didn't.

MIT and Bell Labs were already thinking in 1964 about the project that
became Multics, IBM offered some proposed hardware, which they
rejected. That percolated up within IBM, senior management was unhappy
about it and in less than a year came up with the /67 with virtual
memory but Multics already had other plans. IBM produced TSS which was
a disaster (I used it) but the hardware was fine and MTS and CP/67 got
good time-sharing performance.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 01:15:31 2025

On Fri, 20 Jun 2025 0:36:36 +0000, EricP wrote:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions >>> so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
Basically construct a large constant from smaller ones the way RISC do.)

They usually loaded constants from memory close to the routine itself.
https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
is a nice introduction.

This way would also need a BAL to copy the PC into a base register,
then L at PC-offset to load a 32-bit offset into an index register,
then an RX instruction using the base+index address.

I was looking for ways that don't require an extra memory access
and can also be used for 32-bit integer calculations.

/360 assembly performed a lot of "literal Pool" accesses to get the
constants needed to run the program at hand. Back in 1973 when I was
looking; I found a lot of these kinds of accesses, where the linker
would make holes in the subroutine for ease of access to those
constants. It looked strange, but they got it to work.

Ideally an instruction to Load Immediate of 32-bits into a register,
an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
16-bit immediates (a variation on the 48-bit instruction format).

This is one of my railing points:: and ISA should use no instructions
to use a constant as an operand in an instruction, nor use any reg-
isters to hold the use-once constant.

Alternatively a variation on 32-bit formats using two instructions,
a Load Immediate High which shifts the 16-bit immediate to the
dest register upper end, plus an Add Immediate of 16-low bits.
An 8-bit opcode, a 4-bit function code field, a 4-bit source/dest
register,
and a 16 bit value. Also useful for many other operations with 16-bit immediate values, sub, mul, div, and, or, xor.

Universal Constants means you need to do none of this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 20 07:42:02 2025

On 6/19/2025 12:00 PM, quadibloc wrote:

On Thu, 19 Jun 2025 17:52:49 +0000, John Levine wrote:

It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:

Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?

If you mean how does a single program address more than 16MB, ...

No, I mean how does a program address more than 64K. There's one base
register,
and the address field in an instruction is only 16 bits. How am I
supposed to
address a megabyte with only a 16 bit offset?

Given that he continued to write:

If you mean how does a single program address more than 16MB, the answer >>> is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

he gave an answer to addressing more than 64K.

16 MB is addressed by 24 bits, and is thus the entire address
space of System/360. I presume that was just a typo.

I disagree with his solution, because index registers are for
displacing from an address; base registers build an address.

As I posted in a response to John Levine, that distinction is in your
head, not in the hardware. Each are uses of the same 16 GPRs. The
contents of each are added to a displacement in the instruction to get
an address.

Locations in memory should be able to be addressed in a static
manner.

I don't know what that means. What is a "static manner"?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 20 15:55:54 2025

On Fri, 20 Jun 2025 14:51:16 +0000, quadibloc wrote:

On Thu, 19 Jun 2025 21:45:24 +0000, John Levine wrote:

But there are two other instruction formats SS and SI that have four bit
base
register, 12 bit displacement, and no index register. What happens to
them? 16
bit displacement so you can only address 64K? Reuse the base register
bits as an
index register so you can only address 4K directly?

Since he is "really" talking about the fact that using base registers,
in addition to index registers, is a mistake on my new Concertina II
design, the fact that the string and packed decimal memory-to-memory instructions, with no room for indexing, couldn't do without the base register... is merely a historical sidelight.

The System/360 design could just have added 64-bit instructions, I
suppose.

In principle, indeed, one doesn't "need" base registers. One can use the index registers as base registers, and then use another register with
the base plus the array displacement whenever one accesses an array. I
think base registers are a better idea; array accesses are common enough
that saving an instruction for them makes sense.

I did feel the 68000 design made a mistake with its address registers.

The A and D registers provided the ability to write 2 registers per
microCycle, improving 000 and 010 perf.

Using separate registers, on a CISC design with register banks of only 8 registers, for the base registers makes sense. They're mostly static,
and they take up precious register space. But indexes are computed, and
so integer GPRs, not address registers, ought to have been used for
that, in my opinion.

This may have been mitigated, though; I think the 68000 had forms of the arithmetic instructions that worked with the address registers instead.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Fri Jun 20 17:19:37 2025

John Levine <johnl@taugh.com> schrieb:

According to MitchAlsup1 <mitchalsup@aol.com>:

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
--------------------------

Using an integer general register for program relocation was a flawed
approach. It uses a critical 4 instruction bits for a second register
specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.

I think they (IBM) originally thought that their base registers would
be fixed register numbers so that relocation software could update
them as segments moved around (pre release), and that they realized
later that this was a folly.

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to tkoenig@netcologne.de on Fri Jun 20 18:06:18 2025

It appears that Thomas Koenig <tkoenig@netcologne.de> said:

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious project.

Considering that S/360 outsold all of its competitors combined, it's hard to argue it was a major mistake.

They added VM to S/370 but in the intervening years both the hardware and the understanding of how VM works had gotten a lot better. It is my impression that early VM systems were wildly overoptimistic about how little physical memory they needed. Fortunately, Moore's law made memory sizes grow enough to solve that problem by brute force, somewhat aided by better understanding of working sets.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Fri Jun 20 18:31:48 2025

John Levine <johnl@taugh.com> schrieb:

It appears that Thomas Koenig <tkoenig@netcologne.de> said:

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious project.

Brooks himself wrote he considered not adding virtual memory to the /360
a mistake, so...

Considering that S/360 outsold all of its competitors combined, it's hard to argue it was a major mistake.

I think we can be in agreemend that it was, indeed, a mistake,
but oviously not fatal.

IBM had good peripherals, they had a good upgrade path to very
powerful machines, and they were bit-compatible for user programs
(plus, they put in the microcode emulation of the 1401 so their
customers could transition smoothly - that was a genius move,
the /360 could probably would have been far less of a success
if that had not been possible). All of these were good reasons
to buy these machines.

Customers could and did work around the memory fragmentation,
but it didn't make their lives easier.

But IBM severely underestimated the software complexity of the
system they were creating, hence the delays and "The Mythical
Man-Month" (and such abominations as JCL. Which way around is
that COND parameter again? But because I made some money
working on mainframes as a student, I cannot complain - nobody
ever challenged the hours I billed because mainframes are
complex, as everybody knows, and JCL was a large part of that :-)

They added VM to S/370 but in the intervening years both the hardware and the understanding of how VM works had gotten a lot better. It is my impression that
early VM systems were wildly overoptimistic about how little physical memory they needed. Fortunately, Moore's law made memory sizes grow enough to solve that problem by brute force, somewhat aided by better understanding of working
sets.

You mean a major selling point for virtual memory was that people
didn't think they had to buy that much expensive core storage?
Sounds plausible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Jun 20 15:13:35 2025

MitchAlsup1 wrote:

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

quadibloc wrote:

On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

If you mean how does a single program address more than 16MB, the
answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.

One does not do adequately.

What one expects memory reference instructions to be able to do is:

Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.

When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional
instructions.

[reg], [reg+disp], [reg+index+disp] are all different address
calculations.

Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.

Yes, back then there would be just one ALU/AGEN for the core, used for
pretty much all arithmetic, though sometimes it would have a separate incrementer/decrementer so it can overlap it with the ALU/AGEN.

The only memory address mode that's functionally mandatory is [reg].

Leading to poor addressability and larger instruction count.

Its an optimization allowing 16-bit instructions that would only used for
a limited set of operations, just loads and stores of a few data types.

LD and ST for halfword, word, single and double uses 8/256 opcodes
and saves 2 4 instruction bytes on, say, 10% of loads and stores.

VAX usage stats show that 9% to 15% of address specifiers are what
it called register-deferred [reg], register address with no offset.
That was for compiled languages (Fortran, Pascal, Cobol, etc),
none of which were assembler or C where *p pointer access is more common

x64 usage stats show ~7% register-indirect [reg] and 32% displacement [reg+disp], 0.85% scaled indexed, and 21% absolute.

After that the question is which calculations occur frequently enough to
warrant being integrated into their own instruction (address modes).

Having spend 7 years doing x86, the answer was clear to me::

[base+Rindex<<2+Displacement]

I assume you mean a 2-bit scale, not the constant 2.
Yes, that's the maximal answer (though I would have a 3-bit scale).
VAX usage stats show ~8% operand specifiers are indexed, ~10% displacement
(of various sizes) so there are size savings to be had for supporting
smaller variations.

WRT 360, the maximal address mode would be [rBase+rIndex+imm24]
in a 48-bit instruction, and then have smaller variations as optimizations.
eg [rBase+imm12] in a 32-bit instruction.

Others are the relegated to separate address calculations, and it
depends on the complexity of a specific address expression how it
maps onto a particular ISA as to how many instructions it takes.

So, now you are claiming that adding instructions and latency to
memory access is not harming performance !?!?!?!

Clearly you don't "get it"

No, I am pointing to the reality that each ISA chooses certain
operations to perform more optimally than others.
If my ISA has a 3-bit scale field and yours has 2,
and if the expression is an index to an fp64 complex array,
then I use just 1 instruction while you need 2.

360 has [base+index+imm12] but does not have scaled index so for array
indexing on >1 byte it must copy an array index to a temp register,
then left shift. The extra copy is required because shift left
operates on a single source-dest register only.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 20 12:27:19 2025

On 6/20/2025 11:31 AM, Thomas Koenig wrote:

John Levine <johnl@taugh.com> schrieb:

It appears that Thomas Koenig <tkoenig@netcologne.de> said:

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious project.

Brooks himself wrote he considered not adding virtual memory to the /360
a mistake, so...

Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.

I think we can be in agreemend that it was, indeed, a mistake,
but oviously not fatal.

IBM had good peripherals, they had a good upgrade path to very
powerful machines, and they were bit-compatible for user programs
(plus, they put in the microcode emulation of the 1401 so their
customers could transition smoothly - that was a genius move,
the /360 could probably would have been far less of a success
if that had not been possible). All of these were good reasons
to buy these machines.

And the larger machines had emulation of the 7080. And IBM had great
marketing and armies of customer engineers, and relationships with
company CEOs, a huge installed base of EAM machines (card sorters,
tabulators etc.) that was ripe for upgrading, etc. There were many
favtors that contributed to its success.

Customers could and did work around the memory fragmentation,
but it didn't make their lives easier.

But IBM severely underestimated the software complexity of the
system they were creating, hence the delays and "The Mythical
Man-Month" (and such abominations as JCL. Which way around is
that COND parameter again?

All good points.

I do want to say that because I believe that IBM made some mistakes on
the S/360, I don't want to take away their good decisions or detract
from their success.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Fri Jun 20 20:35:23 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

No, I am pointing to the reality that each ISA chooses certain
operations to perform more optimally than others.
If my ISA has a 3-bit scale field and yours has 2,
and if the expression is an index to an fp64 complex array,
then I use just 1 instruction while you need 2.

Hmm... assuming you have base+index addressing without
scaling (and without implied scaling), you can do
(for four-byte sizes)

for (i=0; i<n; i++) {
c[i] = a[i] + b[i]
}

and assuming that R1 points at a[0], R2 at b[0] and R3 at c[0]
and that R4 is zero initially, you can do (pseudo-assembly),
and R7 holds 4*n

.Loop:
ld R5,[R1,R4]
ld R6,[R2,R4]
add R5,R5,R6
st R5,[R3,R4]
add R4,R4,#4
cmp R6,R7
blt .Loop

For this simple loop, there is no disadvantage of not
having scaled index registers. This can be different
when the value of the index variable is needed for
something else, for for accessing something that has
a different size.

360 has [base+index+imm12] but does not have scaled index so for array indexing on >1 byte it must copy an array index to a temp register,
then left shift. The extra copy is required because shift left
operates on a single source-dest register only.

Not needed, see above (too lazy to look up the /360 assembler :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 21:09:30 2025

On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

MitchAlsup1 wrote:

-------------

Having spend 7 years doing x86, the answer was clear to me::

[base+Rindex<<2+Displacement]

I assume you mean a 2-bit scale, not the constant 2.

The example was a WORD being accessed out of an array, so I did mean #2.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 20 21:26:48 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

I do want to say that because I believe that IBM made some mistakes on
the S/360, I don't want to take away their good decisions or detract
from their success.

It was a revolutionary concept and a revolutionary class of machines.

But reading about the systen and its design process makes me itch
to go find my old time machine (I misplaced it somewhere) and
influence the course of computer history by pointing out some
of the quirks and unnecessary complexities to the /360 team.

The other point in time wold probably have been Data General or
DEC to ca. 1975 to head its Fountainhead respectively VAX project
towards RISC and graph-coloring register allocation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 20 21:27:55 2025

On Fri, 20 Jun 2025 20:35:23 +0000, Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

No, I am pointing to the reality that each ISA chooses certain
operations to perform more optimally than others.
If my ISA has a 3-bit scale field and yours has 2,
and if the expression is an index to an fp64 complex array,
then I use just 1 instruction while you need 2.

Hmm... assuming you have base+index addressing without
scaling (and without implied scaling), you can do
(for four-byte sizes)

static int64_t a[100], b[100], c[100]

for (i=0; i<n; i++) {
c[i] = a[i] + b[i]
}

and assuming that R1 points at a[0], R2 at b[0] and R3 at c[0]
and that R4 is zero initially, you can do (pseudo-assembly),
and R7 holds 4*n

..Loop:
ld R5,[R1,R4]
ld R6,[R2,R4]
add R5,R5,R6
st R5,[R3,R4]
add R4,R4,#4
cmp R6,R7
blt .Loop

MOV R4,#0
VEC R16,{}
LDD R5,[R1,R4<<3]
LDD R6,[R2,R4<<3]
ADD R5,R5,R6
STD R5,[R3,R4<<3]
LOOP LT,R4,#1,Rn

The loop consists of 4 instructions of loop workload and 1 instruction
of loop-overhead:: 5-instructions in 5 words.

Whereas: RISC-V would need::

MOV R4,#0
SLA Rn,Rn,#4
loop:
LDD R5,[R1]
LDD R6,[R2]
ADD R5,R5,R6
STD R5,[R3]
ADD R1,R1,#8
ADD R2,R2,#8
ADD R3,R3,#8
ADD R4,R4,#8
BLT R4,Rn,Loop

And at the exit of the loop; R1, R2, and R3 are no longer pointing
at the starting points of their arrays; potentially adding more
instructions; 9 instructions and 9 words.

The loop consist of 4 instructions of the loop workload, and 4
instructions of loop-overhead {and possibly 3 instructions to
recover the array pointers.}

I am sensitive to this because the 88K Greenhills compiler would
produce the later instead of the former even though the former
was significantly faster and smaller and was part of 88K ISA.

For this simple loop, there is no disadvantage of not
having scaled index registers. This can be different
when the value of the index variable is needed for
something else, for for accessing something that has
a different size.

360 has [base+index+imm12] but does not have scaled index so for array
indexing on >1 byte it must copy an array index to a temp register,
then left shift. The extra copy is required because shift left
operates on a single source-dest register only.

Mostly the IBM compilers strength reduced the indexing to appear

for (i=0; i<4×n; i+=4) {
c[i/4] = a[i/4] + b[i/4]
}

Which is still necessary/useful for non-primitive types.

Not needed, see above (too lazy to look up the /360 assembler :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 21:48:37 2025

On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

-------------

[reg], [reg+disp], [reg+index+disp] are all different address
calculations.

Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.

Yes, back then there would be just one ALU/AGEN for the core, used for
pretty much all arithmetic, though sometimes it would have a separate incrementer/decrementer so it can overlap it with the ALU/AGEN.

Mc 88100 had::
a) integer ALU (+ and -)
b) address ALU (+ and <<{0,1,2,3})
c) PC ALU (INC4, Disp16, Disp26)
mostly because we did not want to route data to the ALU, and
occasionally we wanted to use several FUs simultaneously.

Note: Integer adder needed negate to perform SUB, this takes the
same gate delay as AGEN with <<{0,1,2,3} with add-only.

Even Mc66000 had 3 adders {PC, D, A}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Fri Jun 20 21:34:24 2025

On Fri, 20 Jun 2025 18:06:18 +0000, John Levine wrote:

It appears that Thomas Koenig <tkoenig@netcologne.de> said:

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious
project.

Considering that S/360 outsold all of its competitors combined, it's
hard to
argue it was a major mistake.

What outsold the competitors is the ISA remaining stable over machine
size and machine generation--preserving the software investment.

Over in the number crunching side of things (CDC 6600-7600--CRAY)
one had to hold onto Fortran decks and recompile for each machine.

The attack of the Killer Micro's did not appear until circa 1977.

Side note: CRAY sold a lot of vector processors at the time when NEC
had higher performance CPUs and larger memories, because the CRAY
machines had more memory BW that the I/O devices could use. So, a
CRAY could be doing compute while storing the previous workload onto
disk, and while loading the next workload from disk, completely
overlapped with the computation.

IBM had good peripherals, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 20 23:20:12 2025

On 6/20/2025 5:13 PM, quadibloc wrote:

On Sat, 21 Jun 2025 0:06:35 +0000, quadibloc wrote:

On Fri, 20 Jun 2025 23:57:29 +0000, quadibloc wrote:

On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:

The attack of the Killer Micro's did not appear until circa 1977.

That could be considered the very beginning, as that's when the Altair
8800 came out and so on.

And since the context was discussing events before 1977, that's good
enough to say that back then, micros weren't a problem for sure.

But 8-bit microprocessors didn't kill minis and mainframes. They weren't >>> powerful enough to compete. When did micros really become killers?

Well, they certainly were killers when the Pentium II came out in 1997,
but I'd say that's rather a late date.

Instead, micros were lethal to a lot of larger systems even before they
reached that level of performance. In 1987, halfway between those two
dates, Intel came out with the 387. Hardware floating point for a 32 bit >>> system? It's about at that point that anything larger became
questionable.

And I was able to find out that the phrase was coined by Eugene Brooks
in 1990, in the title of a paper at Supercomputing 1990.

1989 certainly included some momentous events - the Cyrix FasMath 83D87,
and the Intel 486, with hadware floating-point standard.

And don't forget, the 486 also included on chip cache.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Stephen Fuld on Sat Jun 21 12:04:30 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 6/15/2025 1:20 PM, John Levine wrote:

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.

While I admit that was less familiar with the lower end systems, I think >>> the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would >>> have been significant when they implemented interactive usage.

The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.

First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!

I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.

Furthermore, since the S/360 used storage keys for protection, there is
no need for a bounds register.

Lastly, since programs were loaded on page boundaries and the max memory
on the /30 (I had to look this up) was 64K, the system base register
would only have had to be 4 bits!, so maybe small enough to invest in
actual hardware to hold it. IF so, it would have been a significant
speedup, as you wouldn't have had to load the base register value from core.

I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional, so in
principle it could load anywhere. Also, it could load multiple
modules as part of single program, so even with use of storage
keys they would run with single key, so no need to go to separete
page.

Actually, with your proposal one would loose or cripple ability to
load modules at different locations (thanks to multiple base
registers such module could access data in different modules).

Councerning hardware base register, note that 360 instructions
were interpreted by microcode. Fetching corresponding microinstruction
would be substantial cost for models keeping microcode in
core (that is 20 and 25). On 30 it would be smaller penalty,
but still non-negligible.

So, non-negligible performance loss and loss of functionality
to get "exotic" feature. If 360 architects consider all of
this I think their decision would not change.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Sat Jun 21 14:33:58 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

John Levine <johnl@taugh.com> schrieb:

Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.

I think we can be in agreemend that it was, indeed, a mistake,
but oviously not fatal.

IBM had good peripherals, they had a good upgrade path to very
powerful machines, and they were bit-compatible for user programs
(plus, they put in the microcode emulation of the 1401 so their
customers could transition smoothly - that was a genius move,
the /360 could probably would have been far less of a success
if that had not been possible). All of these were good reasons
to buy these machines.

Burroughs did the same, adding B300 emulation to the B3500.

Customers could and did work around the memory fragmentation,
but it didn't make their lives easier.

But IBM severely underestimated the software complexity of the
system they were creating, hence the delays and "The Mythical
Man-Month" (and such abominations as JCL. Which way around is
that COND parameter again? But because I made some money
working on mainframes as a student, I cannot complain - nobody
ever challenged the hours I billed because mainframes are
complex, as everybody knows, and JCL was a large part of that :-)

JCL was, indeed rather horrible. Something Burroughs avoided.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 21 14:36:19 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 20 Jun 2025 18:06:18 +0000, John Levine wrote:

It appears that Thomas Koenig <tkoenig@netcologne.de> said:

I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.

Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.

They may not have considered that early enough in the project.

See the message I sent yesterday. They knew about dynamic relocation and
virtual
memory, but considered it too risky to add to an already ambitious
project.

Considering that S/360 outsold all of its competitors combined, it's
hard to
argue it was a major mistake.

What outsold the competitors is the ISA remaining stable over machine
size and machine generation--preserving the software investment.

Over in the number crunching side of things (CDC 6600-7600--CRAY)
one had to hold onto Fortran decks and recompile for each machine.

Burroughs, on the other hand, had binary compatability throughout
the lifetime of their mainframe lines, even after a major architectural redesign in the early 80s compiled applications from 1966 still
ran fine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to quadibloc on Sat Jun 21 14:25:18 2025

quadibloc <quadibloc@gmail.com> wrote:

On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:

The attack of the Killer Micro's did not appear until circa 1977.

That could be considered the very beginning, as that's when the Altair
8800 came out and so on.

And since the context was discussing events before 1977, that's good
enough to say that back then, micros weren't a problem for sure.

But 8-bit microprocessors didn't kill minis and mainframes. They weren't powerful enough to compete. When did micros really become killers?

Well, they certainly were killers when the Pentium II came out in 1997,
but I'd say that's rather a late date.

Instead, micros were lethal to a lot of larger systems even before they reached that level of performance. In 1987, halfway between those two
dates, Intel came out with the 387. Hardware floating point for a 32 bit system? It's about at that point that anything larger became
questionable.

I think you underestimate impact of micros. At lowest end ZX Spectrum
and Commodre 64 gave nontrivial compute power at low cost. There was
IBM PC and 68000-based workstations. So already around 1983 micros
limited market for low end minis (and due to minis market for low end mainfraimes was limited earlier). Around 1990 there were Risc
workstation and minis were legacy. VAX switched to microprocessors
and DEC decided to replace VAX with Alpha. IBM started using
microprocessors for its mainfraimes around 1993.

If you consider that designers have to look forward few years,
then 1977 looks like reasonable boundary date: before micropcessor
were quite unlikely to be good choice, after frequently it had to
be consdered.

BTW: Alreay 8086 could be paired with 8087 which allowed cost
effective floating point computation.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Stefan Monnier on Sat Jun 21 16:50:01 2025

On 21/06/2025 16:39, Stefan Monnier wrote:

What define(s|d) a "mini" or a "mainframe"?
For "micro" AFAIK the definition is/was "single-chip CPU", so I guess
"mini" would be something like "CPU made of 74xxx thingies?" and as for
how to distinguish them from mainframes, I don't know.

The old definition I recall is:
If you can pick it up it's a micro.

If you can't pick it up, but you can see over it, it's a mini.

If you can't see over it it's a mainframe.

It's obviously a bit of a joke - but I don't think I've heard anything
better.

Andy

--
Do not listen to rumour, but, if you do, do not believe it.
Ghandi.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Sat Jun 21 11:39:26 2025

I think you underestimate impact of micros. At lowest end ZX Spectrum
and Commodre 64 gave nontrivial compute power at low cost. There was
IBM PC and 68000-based workstations. So already around 1983 micros
limited market for low end minis (and due to minis market for low end mainfraimes was limited earlier).

What define(s|d) a "mini" or a "mainframe"?
For "micro" AFAIK the definition is/was "single-chip CPU", so I guess
"mini" would be something like "CPU made of 74xxx thingies?" and as for
how to distinguish them from mainframes, I don't know.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sat Jun 21 14:57:16 2025

MitchAlsup1 wrote:

On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

-------------

[reg], [reg+disp], [reg+index+disp] are all different address
calculations.

Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.

Yes, back then there would be just one ALU/AGEN for the core, used for
pretty much all arithmetic, though sometimes it would have a separate
incrementer/decrementer so it can overlap it with the ALU/AGEN.

Mc 88100 had::
a) integer ALU (+ and -)
b) address ALU (+ and <<{0,1,2,3})
c) PC ALU (INC4, Disp16, Disp26)
mostly because we did not want to route data to the ALU, and
occasionally we wanted to use several FUs simultaneously.

Note: Integer adder needed negate to perform SUB, this takes the
same gate delay as AGEN with <<{0,1,2,3} with add-only.

Even Mc66000 had 3 adders {PC, D, A}

I had a look at the 360-30 uArch in the IBM Field Engineeering manual

http://www.bitsavers.org/pdf/ibm/360/fe/2030/Y24-3360-1_2030_FE_Theory_Opns_Jun67.pdf

and there is basically nothing to it.
It is literally just a bunch of registers, an 8-bit ALU, a bunch of 8-bit
buses from the registers to the ALU and a result bus back to the registers,
a microcode read-only memory called CROS Capacitive Read Only Storage cards, and a microcode counter-sequencer. Understandably it's performance was something like 34.5 kIPS, as in 34,500 Instructions Per Second

Here is a picture of the TROS Transformer Read Only Storage
from a model-20 microcode:

https://static.righto.com/images/ibm-360-50/tros.jpg

Ken Shirriff also shows the much fancier -50 uArch:

https://static.righto.com/images/ibm-360-50/diagram-w900.jpg

Simulating the IBM 360/50 mainframe from its microcode http://www.righto.com/2022/01/ibm360model50.html

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Waldek Hebisch on Sat Jun 21 20:32:47 2025

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,

There was no mention of this in the Principle of Operations,
and its timing is given in the System/360 Mode1 30 Functional
Characteristics document, so I don't think this is true.

so in
principle it could load anywhere.

We should also consider what the machine was capable of running.
Like all of /360 it was supposed to hae run OS/360, but
that was running late and was too big, so smaller systems
were used. These were generally only capable of running
one program at a time, so the point where to load becomes
sort of moot. (Also, DOS/360 does not seem to have had a
relocating loader, so everything had to be loaded at
a pre-determined address.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Thomas Koenig on Sun Jun 22 01:26:46 2025

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,

There was no mention of this in the Principle of Operations,
and its timing is given in the System/360 Mode1 30 Functional
Characteristics document, so I don't think this is true.

If you mean loading, it is really OS function, not architecture feature.

so in
principle it could load anywhere.

We should also consider what the machine was capable of running.
Like all of /360 it was supposed to hae run OS/360, but
that was running late and was too big, so smaller systems
were used. These were generally only capable of running
one program at a time, so the point where to load becomes
sort of moot. (Also, DOS/360 does not seem to have had a
relocating loader, so everything had to be loaded at
a pre-determined address.

AFAIK in OS/360 overlays were separately loaded, just like
programs. So even with one program running one was likely
to want several modules, each at it own load adress.

I am not sure what DOS was doing, but many OS/360 programs
were supposed to run under DOS. Since overlays were used
quite a lot I would expect DOS to support them. And due
to conventions with base registers supporting overlays loaded
at arbitrary location probably did not require much of effort.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 22 01:31:52 2025

According to Thomas Koenig <tkoenig@netcologne.de>:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,

There was no mention of this in the Principle of Operations,
and its timing is given in the System/360 Mode1 30 Functional
Characteristics document, so I don't think this is true.

It's on page 11 of Functional Characteristics. Storage Protection
was an optional feature.

We should also consider what the machine was capable of running.
Like all of /360 it was supposed to hae run OS/360, but
that was running late and was too big, so smaller systems
were used.

I saw someone run OS on a 64K /30 but you're right, DOS and TOS
were much more common.

These were generally only capable of running
one program at a time, so the point where to load becomes
sort of moot. (Also, DOS/360 does not seem to have had a
relocating loader, so everything had to be loaded at
a pre-determined address.

I think you're right but I don't understand your point. All models of
the 360 had the same architecture and the same instruction set so even
if DOS didn't do load time relocation, other operating systems did and
they ran on the same machines.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Waldek Hebisch on Sun Jun 22 01:36:25 2025

It appears that Waldek Hebisch <antispam@fricas.org> said:

AFAIK in OS/360 overlays were separately loaded, just like
programs. So even with one program running one was likely
to want several modules, each at it own load adress.

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the appropriate overlay when you called down into one.

One load module could also run another using system calls which was occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Sun Jun 22 08:57:04 2025

John Levine <johnl@taugh.com> schrieb:

It appears that Waldek Hebisch <antispam@fricas.org> said:

AFAIK in OS/360 overlays were separately loaded, just like
programs. So even with one program running one was likely
to want several modules, each at it own load adress.

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the appropriate overlay when you called down into one.

Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).

Of course I could drink a coffee during the 20 or so minutes wall-time
it took a program with one of the graphics libraries I using to link
(the jobs were high priority, so they were running right away)
but nobody can drink that much coffee.

One load module could also run another using system calls which was occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.

Sort of early JIT,then (pun intended).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 22 17:52:44 2025

According to Thomas Koenig <tkoenig@netcologne.de>:

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.

Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).

The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.

15K is really small, it must have overlaid like crazy.

One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.

Sort of early JIT,then (pun intended).

Very much so. Tape systems spent more time sorting than doing anything
else, so they had all sorts of hacks to speed it up. Precompiling the
inner loop was just one of them. I gather they wrote their own channel programs, too.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Sun Jun 22 18:25:02 2025

On Sun, 22 Jun 2025 17:52:44 +0000, John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.

Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).

The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.

15K is really small, it must have overlaid like crazy.

1975-77 I worked on a Sigma 5 computer system that only had 16KB
of core. You had to fit the OS, and your application into 16K;
with the OS eating up 5-6K of your memory. So, yes, everything
was overlaid to the hilt.

Our application was::
a) real time capture of A/D readout from NMR into 4K array
b) convert to float
c) FFT on the float data
d) conjugate multiply with precomputed 4K array
e) FFT-1
g) write the data onto the Textronics Display in graphics form
h) with ability to save to disk/tape for later.

So, yes, there were a lot of overlays !!

Later we added the computer driving a frequency generator while
capturing the A/D data "coherently".

One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.

Sort of early JIT,then (pun intended).

Very much so. Tape systems spent more time sorting than doing anything
else, so they had all sorts of hacks to speed it up. Precompiling the
inner loop was just one of them. I gather they wrote their own channel programs, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Sun Jun 22 20:29:41 2025

John Levine <johnl@taugh.com> writes:

According to Thomas Koenig <tkoenig@netcologne.de>:

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.

Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).

The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.

15K is really small, it must have overlaid like crazy.

One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.

Sort of early JIT,then (pun intended).

Very much so. Tape systems spent more time sorting than doing anything
else, so they had all sorts of hacks to speed it up. Precompiling the
inner loop was just one of them. I gather they wrote their own channel >programs, too.

The Burroughs medium systems Sort intrinsic would even read the tape
backwards to improve sort speed. The author of the intrinsic was
justifiably proud of the performance for a variety of source and
destination media. A 16-unit tape sort/merge was really impressive
to watch.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jun 22 22:44:44 2025

According to Scott Lurndal <slp53@pacbell.net>:

Sort of early JIT,then (pun intended).

Very much so. Tape systems spent more time sorting than doing anything >>else, so they had all sorts of hacks to speed it up. Precompiling the >>inner loop was just one of them. I gather they wrote their own channel >>programs, too.

The Burroughs medium systems Sort intrinsic would even read the tape >backwards to improve sort speed.

That was a standard trick. A tape sort read the inputs, wrote sorted runs of records on several tapes, then repeatedly merged the runs from one group of tapes to another until there was one big sorted run. I think sometime in the 1950s someone noticed that rather than rewinding between passes, just read the tapes backward and sort in reverse order. You might end up with the final sort backwards and have to do one more pass to make it forwards but I gather it was worth it.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Mon Jun 23 06:07:11 2025

John Levine <johnl@taugh.com> schrieb:

According to Thomas Koenig <tkoenig@netcologne.de>:

A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.

Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).

The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.

15K is really small, it must have overlaid like crazy.

And they probably didn't touch it again... The machine I worked
on was a Fujitsu rebranded as a Siemens 7881. I didn't know the
original Fujitsu name at the time. It ran BS 3000, which was an
MVS clone. And with a main memory of 2*16MB and a normal job size
of 1MB or more (no reason to select anything less), it still ran
dead slow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to quadibloc on Mon Jun 23 17:13:34 2025

quadibloc <quadibloc@gmail.com> schrieb:

On Mon, 23 Jun 2025 6:07:11 +0000, Thomas Koenig wrote:

And they probably didn't touch it again... The machine I worked
on was a Fujitsu rebranded as a Siemens 7881. I didn't know the
original Fujitsu name at the time. It ran BS 3000, which was an
MVS clone.

I tried to look it up, and found it was really a Siemens 7.881-2 (the punctuation is important). And this was one of Fujitsu's larger scale systems, intended to compete with the IBM 3800, so if it ran dead
slow, that is surprising.

What I meant that the linker ran awfully slowly if there was
anything big to link. Apart from that, I really didn't have
any meaningful comparisons, it was the first large system I ever
worked on.

For some reason, they renamed the standard IBM utilities, so
IEBGENER became JSEGENER (but an IEBGENER alias was still
provided).

Also, the English in their documentation was really strange.
When the computer center switched to an IBM 3090, that was
a dramatic improvement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Tue Jul 29 08:45:14 2025

John Savard <quadibloc@invalid.invalid> writes:

But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.

Microcode may have been a good thing somewhat earlier when ROM or the
writable control store (WCS) could be run at speeds much higher than
core memory (how was the WCS actually implemented?), but core memory
had been replaced by semiconductor DRAM by the time the VAX was
introduced, and that was faster (already the Nova 800 of 1971 had an
800ns cycle, and Acorn managed to access DRAM at 8MHz (but only when
staying within the same row) in 1987); my guess is that in the VAX
11/780 timeframe, 2-3MHz DRAM access within a row would have been
possible. Moreover, the VAX 11/780 has a cache (it also has a WCS).
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.

Nevertheless, if I time-traveled to the start of the VAX design, and
was put in charge of designing the VAX, I would design a RISC, and I
am sure that it would outperform the actual VAX 11/780 by at least a
factor of 2. So no, I don't think that the VAX architecture was a
good match for the technology of the time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Wed Jul 30 05:59:18 2025

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.

Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and >PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.

POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting >one address modify another in the same instruction, would have made it a lot >easier to pipeline.

My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to
achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.

Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what
RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of
conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Jul 29 16:44:35 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.

That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex instruction and address modes and the tiny 512 byte page size.

Another, which is not entirely their fault, is that they did not expect compilers to improve as fast as they did, leading to a machine which was fun to program in assembler but full of stuff that was useless to compilers and instructions like POLY that should have been subroutines. The 801 project and PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC presumably didn't know about it.

Related to the microcode issue they also don't seem to have anticipated how important pipelining would be. Some minor changes to the VAX, like not letting one address modify another in the same instruction, would have made it a lot easier to pipeline.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Wed Aug 27 00:35:18 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.

Much greater than just "well aware" there were at least 15 grad students
at CMU working on optimizing compilers AND the VAX ISA; as well as Wulf, Newell, and Bell leading the pack.

POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

Hold on a minute:: My Transcendentals are done in POLY-like fashion,
it is just that the constants come from ROM inside the FPU, instead
of user defined DRAM coefficients. Thus, POLY is good, POLY as an
instruction is bad.

Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >easier to pipeline.

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.

Compilers have taught us that one-address-mode per instruction is
"sufficient" {if you are going to have address modes.}

My work on My 66000 has taught me that 1 constant per instruction
is nearly sufficient. The only places I break this is ST #val[disp]
and LOOP cnd,Ri,#inc,#max.

Pipeline work over 1983-to-current has shown that LD and OPs perform
just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
were LD and OP, and there are way to perform LD and OP as if it were
LD+OP.

Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.

Condition codes get hard when DECODE width grows greater than 3.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to MitchAlsup on Wed Aug 27 05:12:57 2025

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

10.5 on a characteristic mix, actually.

See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MitchAlsup on Wed Aug 27 17:19:06 2025

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

...

[...] POLY as an
instruction is bad.

Exactly.

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

It's better to forget this misinformation, and instead remember that
the VAX has an average CPI of 10.6 (Table 8 of <https://american.cs.ucdavis.edu/academic/readings/papers/p301-emer.pdf>)

Table 9 of that reference is also interesting:

CALL/RET instructions take an average 45 cycles, Character
instructions (I guess this means stuff like EDIT) takes an average 117
cycles, and Decimal instructions take an average 101 cycles. It seems
that these instructions all have no special hardware support on the
VAX 11/780 and do it all through microcode. So replacing Character
and Decimal instructions with calls to functions on a RISC-VAX could
easily outperform the VAX 11/780 even without special hardware
support. Now add decimal support like the HPPA has done or string
support like the Alpha has done, and you see even better speed for
these instructions.

For CALL/RET, one might use one of the modern calling conventions.
However, this loses some capabilities compared to the VAX. So one may
prefer to keep frame pointers by default and maybe other features that
allow, e.g., universal cross-language debugging on the VAX without monstrosities like ELF and DWARF.

Pipeline work over 1983-to-current has shown that LD and OPs perform
just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
were LD and OP, and there are way to perform LD and OP as if it were
LD+OP.

I don't know what you are getting at here. When implementing the 486,
Intel chose the following pipeline:

Instruction Fetch
Instruction Decode
Mem1
Mem2/OP
Writeback

This meant that load-and-op instructions take 2 cycles (and RMW
instructions take three); it gave us the address-generation interlock (op-to-load latency 2), and 3-cycle taken branches. An alternative
would have been:

Instruction Fetch
Instruction Decode
Mem1
Mem2
OP
Writeback

This would have resultet in a max throughput of 1 CPI for sequences of load-and-op instructions, but would have resultet in an AGI of 3
cycles, and 4-cycle taken branches.

For the Bonnell Intel chose such a pipeline (IIRC with a third mem
stage), but the Bonnell has a branch predictor, so the longer branch
latency usually does not strike.

AFAIK IBM used such a pipeline for some S/360 descendants.

Condition codes get hard when DECODE width grows greater than 3.

And yet the widest implementations (up to 10 wide up to now) are of
ISAs that have condition-code registers. Even particularly nasty ones
in the case of AMD64.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	45:42:47
Calls:	10,394
Calls today:	2
Files:	14,066
Messages:	6,417,268

Re: fitting programs in Why I've Dropped In

Who's Online

Recent Visitors

System Info