Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
As we have discussed, the S/360 designers needed some mechanism to
allow a program to be loaded at an arbitrary location in memory.
Did they? Why? I remember reading that the systems software people
spent a lot of work on an overlay mechanism, so the thinking at IBM at
the time was apparently not about keeping several programs in RAM at
the same time, but about running one program at one time, and finding
ways to make that program fit into available RAM.
In any case, it's no problem to add a virtual-memory mechanism that is
not visible to user-level, or maybe even kernel-level (does the
original S/360 have that?) programs, whether it's paged virtual memory
or a simple base+range mechanism.
An interesting development is that, e.g., on Ultrix on DecStations
programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked
s/linux/svr3/. It was SVR3 unix that first had static libraries linked
at a specific address.
No, I meant statically linked libraries.
Static linking does not require any coordination. Every executable
gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
not shared.
The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.
While I admit that was less familiar with the lower end systems, I think
the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would
have been significant when they implemented interactive usage.
Remember that S/360 was mostly aimed at batch processing where each program >> starts and runs until it's done. The higher end systems did multiprogramming so
they could run some other batch program in the short interval while waiting for
a disk or tape or card operation.
True, and good points.
They included what they called teleprocessing but those systems were transaction
monitors built in the SAGE model with a queue of short chunks of code running to
completion. Relocation and swapping wouldn't help there either.
Agreed. Although how much were the choices in implementing
teleprocessing influenced by the hardware design choices? I don't know
and haven't thought about it at all.
The SAGE programming model has been quite successful for systems that
need fast realtime response, even 70 years later.
According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.
While I admit that was less familiar with the lower end systems, I think
the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would
have been significant when they implemented interactive usage.
The 360/30 was byte serial and stored the 16 registers in core (and I mean core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an additional 4.5us if it was indexed. I'd think the time to add a system base register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up, it'd be a significant extra chunk of hardware.
The guys who designed the 360 thought really hard about a design that could scale
up and down and have multiple efficient implementations.
The /30 was the most
popular of the 360 line. IBM shipped thousands of them. They made a few mistakes
(hex floating point and the high address byte) but not big ones.
The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.
First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!
I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
The 360/30 was byte serial and stored the 16 registers in core (and I mean >>> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >>> additional 4.5us if it was indexed. I'd think the time to add a system base >>> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.
First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!
I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.
But the most important goal of the 360 was a single architecture, run the same
code on every model. This mutant /30 would presumably have 16 bit direct addresses only, and so much for upward compatibility with models with more memory.
In the IBMSJ architecture article they said:
It was decided to commit the system completely to a base-register technique;
the direct part of the address, the displacement, was made so small (12 bits, or
4096 characters) that direct addressing is a practical programming technique
only on very small models. This commitment implies that all programs are
location-independent, except for constants used to load the base registers.
Thus, all programs can easily be relocated.
I think they meant it was easy to relocate programs when they were loaded, which
is true, no fiddly instruction patching needed.
The idea that you would move a
program after it was loaded was at the time an exotic high end feature.
But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most >definitely NOT suggesting a different architecture for the smaller
versus larger systems.
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most
definitely NOT suggesting a different architecture for the smaller
versus larger systems.
I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:
If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
One does not do adequately.
What one expects memory reference instructions to be able to do is:
Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.
When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.
That's how they were able to behave back when memory was 64K bytes in
size.
Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.
Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.
Of course, the other problem is that base registers use up registers.
So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems >>like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most >>definitely NOT suggesting a different architecture for the smaller
versus larger systems.
I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:
If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
One does not do adequately.
What one expects memory reference instructions to be able to do is:
Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.
When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.
That's how they were able to behave back when memory was 64K bytes in
size.
Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.
Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.
Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.
John Savard
On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:
If you mean how does a single program address more than 16MB, the answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
One does not do adequately.
What one expects memory reference instructions to be able to do is:
Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.
When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional instructions.
That's how they were able to behave back when memory was 64K bytes in
size.
Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
index register.
Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.
Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
to make that necessary.
John Savard
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
But that is not what I suggested. Let's go back a bit. I suggested
that the choice of using visible registers for base registers was a
mistake made by the S/360 architects. You responded by talking about
while that would have been OK for the larger systems, on smaller systems
like the /30, it would have substantially hurt performance. I think
that I showed above that this didn't have to be the case. So I repeat
my suggestion that a hidden base register would have been a better
choice, both for the bigger models, and even for the /30. I am most
definitely NOT suggesting a different architecture for the smaller
versus larger systems.
I'm confused. Can you give some examples in this system of how it
would still provide 24 bit addressing, using the same instruction set
on large and small models, and not making the instructions bigger?
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions
so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits. Basically construct a large constant from smaller ones the way RISC do.)
Note also that 360 index register was not scaled and so not
directly usable from array index value for other than byte arrays.
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
If you mean how does a single program address more than 16MB, ...
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
If you mean how does a single program address more than 16MB, ...
No, I mean how does a program address more than 64K.
There's one base register,
and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?
If the idea is that each program is limited to 64K even though the overall system address space is bigger, BTDT on a PDP-11 and would prefer not to go back.
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
If you mean how does a single program address more than 16MB, ...
No, I mean how does a program address more than 64K.
There's one base register,
and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?
If you mean how does a single program address more than 16MB, ...
No, I mean how does a program address more than 64K.
Index registers. You still need those, for exactly that reason. But
you don't need a second mechanism, i.e. base registers specified in the >instruction. One mechanism is sufficient. If you have 32 bit
registers, as the S/360 did, you can address up to 4GB.
According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
If you mean how does a single program address more than 16MB, ...
No, I mean how does a program address more than 64K.
Index registers. You still need those, for exactly that reason. But
you don't need a second mechanism, i.e. base registers specified in the
instruction. One mechanism is sufficient. If you have 32 bit
registers, as the S/360 did, you can address up to 4GB.
I don't get the impression that we are thinking about the same S/360.
because index registers are for
displacing from an address; base registers build an address.
The 360 had four instruction formats.
RR was register to register, no
problem there. RX was memory to register, with a four bit register operand, four bit base register, four bit index register, and 12 bit displacement.
As I understand it, you'd change that to 16 bit displacement relative to
an implicit base register and still have the optional index register.
But there are two other instruction formats SS and SI that have four bit base register, 12 bit displacement, and no index register. What happens to them?
16
bit displacement so you can only address 64K? Reuse the base register bits as an
index register so you can only address 4K directly?
In case it's not obvious, all programs but the most trivial used multiple base
registers.
First you'd have one to point to the code and static data.
For I/O you'd
set another register to point to an I/O buffer, and use that register as the base register in SS and SI instructions to move stuff in and out of the buffer,
then pass the buffer to the operating system, update the register to point to the next buffer and do it again. If you were doing a read-compute-write loop,
you'd have one base register for the read buffer and one for the write buffer.
Same for any non-trivial data structrure, you set a register to point to a structure and use it as the base register to refer to fields. A single global base register couldn't do any of that.
I read somewhere that they did simulations and found that a typical program used
four base registers at a time.
Using an integer general register for program relocation was a flawed approach. It uses a critical 4 instruction bits for a second register specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.
The correct design was to have separate base and bounds registers for
program relocation, managed by the OS outside program control.
When the OS switches tasks it loads the integer and float registers,
sets base and bounds physical offsets for it, and Bob's your uncle.
Also all tasks are dynamically relocatable.
The cost is just the two base and bounds relocation registers.
The same ALU is still used for AGEN to calcuate [reg+disp+base] and
send the physical address to the Memory Address Register (MAR).
While the bus cycle sequencer is accessing memory the ALU can be used
to do the bounds check and maybe abort the access.
And IBM could have charged extra for the base and bounds registers
(which would have been present in all models, just enabled by a jumper).
Anything else would involve slowing down the program by adding extra
instructions that weren't required back when memory was smaller. That's
doing extra work to answer the same questions.
This is the ISA design trade off - which address calculations occur frequently enough to warrant their own instructions (address modes).
Of course, the other problem is that base registers use up registers. So
in my CISC design, I had a separate bank of eight base registers
distinct from the eight general registers. When there are 32 registers,
though, using two or three of them as base registers is not bad enough
to make that necessary.
John Savard
I would rather have a [base+index<<scale+disp] address mode using
integer registers and let the compiler decide how best to use them.
quadibloc wrote:
On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:
If you mean how does a single program address more than 16MB, the answer >>> is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
One does not do adequately.
What one expects memory reference instructions to be able to do is:
Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.
When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional
instructions.
[reg], [reg+disp], [reg+index+disp] are all different address
calculations.
The only memory address mode that's functionally mandatory is [reg].
After that the question is which calculations occur frequently enough to warrant being integrated into their own instruction (address modes).
Others are the relegated to separate address calculations, and it
depends on the complexity of a specific address expression how it
maps onto a particular ISA as to how many instructions it takes.
On 6/19/2025 2:45 PM, John Levine wrote:--------------
Let's start with the SS instructions. The documentation says the
instruction has two base registers. But if I said, with no actual
change to the hardware, it has two index registers, it would perform
exactly as it does now.
Yes, it only has 12 bit displacements, but that
is no different from what it has now.
So other than the name in the documentation, things are exactly as they were.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions
so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
Basically construct a large constant from smaller ones the way RISC do.)
They usually loaded constants from memory close to the routine itself. https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
is a nice introduction.
I was looking for ways that don't require an extra memory access
and can also be used for 32-bit integer calculations.
Ideally an instruction to Load Immediate of 32-bits into a register,
an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
16-bit immediates (a variation on the 48-bit instruction format).
On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
--------------------------
Using an integer general register for program relocation was a flawed
approach. It uses a critical 4 instruction bits for a second register
specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.
I think they (IBM) originally thought that their base registers would
be fixed register numbers so that relocation software could update
them as segments moved around (pre release), and that they realized
later that this was a folly.
once they discovered "Translation" they decided to live with it
for a long time--until the DAST-box shoed up (/67).
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
On a 360 if I'm accessing a struct larger than 2kB then I would
load a 32-bit immediate offset into a temp register, say R1.
(I don't know 360 but I don't see any 32-bit load immediate instructions >>> so it looks like I'd have to do something like Load Address LA to load
a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
Basically construct a large constant from smaller ones the way RISC do.)
They usually loaded constants from memory close to the routine itself.
https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
is a nice introduction.
This way would also need a BAL to copy the PC into a base register,
then L at PC-offset to load a 32-bit offset into an index register,
then an RX instruction using the base+index address.
I was looking for ways that don't require an extra memory access
and can also be used for 32-bit integer calculations.
Ideally an instruction to Load Immediate of 32-bits into a register,
an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
16-bit immediates (a variation on the 48-bit instruction format).
Alternatively a variation on 32-bit formats using two instructions,
a Load Immediate High which shifts the 16-bit immediate to the
dest register upper end, plus an Add Immediate of 16-low bits.
An 8-bit opcode, a 4-bit function code field, a 4-bit source/dest
register,
and a 16 bit value. Also useful for many other operations with 16-bit immediate values, sub, mul, div, and, or, xor.
On Thu, 19 Jun 2025 17:52:49 +0000, John Levine wrote:
It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
Each address in an instruction was only 16 bits, which they happened
to split 4 bits for the base register and 12 for the displacement. If
you get rid of the register, you still only have 16 bits. On a larger
model with say, a megabyte of memory, how does a program address that?
If you mean how does a single program address more than 16MB, ...
No, I mean how does a program address more than 64K. There's one base
register,
and the address field in an instruction is only 16 bits. How am I
supposed to
address a megabyte with only a 16 bit offset?
Given that he continued to write:
If you mean how does a single program address more than 16MB, the answer >>> is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
he gave an answer to addressing more than 64K.
16 MB is addressed by 24 bits, and is thus the entire address
space of System/360. I presume that was just a typo.
I disagree with his solution, because index registers are for
displacing from an address; base registers build an address.
Locations in memory should be able to be addressed in a static
manner.
On Thu, 19 Jun 2025 21:45:24 +0000, John Levine wrote:
But there are two other instruction formats SS and SI that have four bit
base
register, 12 bit displacement, and no index register. What happens to
them? 16
bit displacement so you can only address 64K? Reuse the base register
bits as an
index register so you can only address 4K directly?
Since he is "really" talking about the fact that using base registers,
in addition to index registers, is a mistake on my new Concertina II
design, the fact that the string and packed decimal memory-to-memory instructions, with no room for indexing, couldn't do without the base register... is merely a historical sidelight.
The System/360 design could just have added 64-bit instructions, I
suppose.
In principle, indeed, one doesn't "need" base registers. One can use the index registers as base registers, and then use another register with
the base plus the array displacement whenever one accesses an array. I
think base registers are a better idea; array accesses are common enough
that saving an instruction for them makes sense.
I did feel the 68000 design made a mistake with its address registers.
Using separate registers, on a CISC design with register banks of only 8 registers, for the base registers makes sense. They're mostly static,
and they take up precious register space. But indexes are computed, and
so integer GPRs, not address registers, ought to have been used for
that, in my opinion.
This may have been mitigated, though; I think the 68000 had forms of the arithmetic instructions that worked with the address registers instead.
John Savard
According to MitchAlsup1 <mitchalsup@aol.com>:
On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
--------------------------
Using an integer general register for program relocation was a flawed
approach. It uses a critical 4 instruction bits for a second register
specifier that doesn't work for program relocation and looses 4 bits
from the displacement which frequently could use them.
I think they (IBM) originally thought that their base registers would
be fixed register numbers so that relocation software could update
them as segments moved around (pre release), and that they realized
later that this was a folly.
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.
They may not have considered that early enough in the project.
It appears that Thomas Koenig <tkoenig@netcologne.de> said:
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.
They may not have considered that early enough in the project.
See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious project.
Considering that S/360 outsold all of its competitors combined, it's hard to argue it was a major mistake.
They added VM to S/370 but in the intervening years both the hardware and the understanding of how VM works had gotten a lot better. It is my impression that
early VM systems were wildly overoptimistic about how little physical memory they needed. Fortunately, Moore's law made memory sizes grow enough to solve that problem by brute force, somewhat aided by better understanding of working
sets.
On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
quadibloc wrote:
On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:
If you mean how does a single program address more than 16MB, the
answer
is by using an index register. You still need those. You just don't
need two registers (base and index) when one will do.
One does not do adequately.
What one expects memory reference instructions to be able to do is:
Normally, to be able to access any part of memory in a simple manner
which does not require any additional instructions.
When indexed, to include the address of an array, and, in an index
register, a displacement from the start of an array. With no additional
instructions.
[reg], [reg+disp], [reg+index+disp] are all different address
calculations.
Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.
The only memory address mode that's functionally mandatory is [reg].
Leading to poor addressability and larger instruction count.
After that the question is which calculations occur frequently enough to
warrant being integrated into their own instruction (address modes).
Having spend 7 years doing x86, the answer was clear to me::
[base+Rindex<<2+Displacement]
Others are the relegated to separate address calculations, and it
depends on the complexity of a specific address expression how it
maps onto a particular ISA as to how many instructions it takes.
So, now you are claiming that adding instructions and latency to
memory access is not harming performance !?!?!?!
Clearly you don't "get it"
John Levine <johnl@taugh.com> schrieb:
It appears that Thomas Koenig <tkoenig@netcologne.de> said:
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.
They may not have considered that early enough in the project.
See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious project.
Brooks himself wrote he considered not adding virtual memory to the /360
a mistake, so...
Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.
I think we can be in agreemend that it was, indeed, a mistake,
but oviously not fatal.
IBM had good peripherals, they had a good upgrade path to very
powerful machines, and they were bit-compatible for user programs
(plus, they put in the microcode emulation of the 1401 so their
customers could transition smoothly - that was a genius move,
the /360 could probably would have been far less of a success
if that had not been possible). All of these were good reasons
to buy these machines.
Customers could and did work around the memory fragmentation,
but it didn't make their lives easier.
But IBM severely underestimated the software complexity of the
system they were creating, hence the delays and "The Mythical
Man-Month" (and such abominations as JCL. Which way around is
that COND parameter again?
No, I am pointing to the reality that each ISA chooses certain
operations to perform more optimally than others.
If my ISA has a 3-bit scale field and yours has 2,
and if the expression is an index to an fp64 complex array,
then I use just 1 instruction while you need 2.
360 has [base+index+imm12] but does not have scaled index so for array indexing on >1 byte it must copy an array index to a temp register,
then left shift. The extra copy is required because shift left
operates on a single source-dest register only.
MitchAlsup1 wrote:-------------
Having spend 7 years doing x86, the answer was clear to me::
[base+Rindex<<2+Displacement]
I assume you mean a 2-bit scale, not the constant 2.
I do want to say that because I believe that IBM made some mistakes on
the S/360, I don't want to take away their good decisions or detract
from their success.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
No, I am pointing to the reality that each ISA chooses certain
operations to perform more optimally than others.
If my ISA has a 3-bit scale field and yours has 2,
and if the expression is an index to an fp64 complex array,
then I use just 1 instruction while you need 2.
Hmm... assuming you have base+index addressing without
scaling (and without implied scaling), you can do
(for four-byte sizes)
for (i=0; i<n; i++) {
c[i] = a[i] + b[i]
}
and assuming that R1 points at a[0], R2 at b[0] and R3 at c[0]
and that R4 is zero initially, you can do (pseudo-assembly),
and R7 holds 4*n
..Loop:
ld R5,[R1,R4]
ld R6,[R2,R4]
add R5,R5,R6
st R5,[R3,R4]
add R4,R4,#4
cmp R6,R7
blt .Loop
For this simple loop, there is no disadvantage of not
having scaled index registers. This can be different
when the value of the index variable is needed for
something else, for for accessing something that has
a different size.
360 has [base+index+imm12] but does not have scaled index so for array
indexing on >1 byte it must copy an array index to a temp register,
then left shift. The extra copy is required because shift left
operates on a single source-dest register only.
Not needed, see above (too lazy to look up the /360 assembler :-)
MitchAlsup1 wrote:-------------
On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
[reg], [reg+disp], [reg+index+disp] are all different address
calculations.
Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.
Yes, back then there would be just one ALU/AGEN for the core, used for
pretty much all arithmetic, though sometimes it would have a separate incrementer/decrementer so it can overlap it with the ALU/AGEN.
It appears that Thomas Koenig <tkoenig@netcologne.de> said:
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.
They may not have considered that early enough in the project.
See the message I sent yesterday. They knew about dynamic relocation and virtual
memory, but considered it too risky to add to an already ambitious
project.
Considering that S/360 outsold all of its competitors combined, it's
hard to
argue it was a major mistake.
On Sat, 21 Jun 2025 0:06:35 +0000, quadibloc wrote:
On Fri, 20 Jun 2025 23:57:29 +0000, quadibloc wrote:
On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:
The attack of the Killer Micro's did not appear until circa 1977.
That could be considered the very beginning, as that's when the Altair
8800 came out and so on.
And since the context was discussing events before 1977, that's good
enough to say that back then, micros weren't a problem for sure.
But 8-bit microprocessors didn't kill minis and mainframes. They weren't >>> powerful enough to compete. When did micros really become killers?
Well, they certainly were killers when the Pentium II came out in 1997,
but I'd say that's rather a late date.
Instead, micros were lethal to a lot of larger systems even before they
reached that level of performance. In 1987, halfway between those two
dates, Intel came out with the 387. Hardware floating point for a 32 bit >>> system? It's about at that point that anything larger became
questionable.
And I was able to find out that the phrase was coined by Eugene Brooks
in 1990, in the title of a paper at Supercomputing 1990.
1989 certainly included some momentous events - the Cyrix FasMath 83D87,
and the Intel 486, with hadware floating-point standard.
On 6/15/2025 1:20 PM, John Levine wrote:
According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
The low end 360s were too underpowered to do
time sharing and any sort of dynamic relocation would have just made them more
expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
made up for it with much better I/O devices.
While I admit that was less familiar with the lower end systems, I think >>> the extra expense would have been a single register in the CPU to hold
the base, and a few extra instructions at task switch time to save and
reload it. Not very much. And the benefits to the larger systems would >>> have been significant when they implemented interactive usage.
The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
to the bound register, so that's an extra 9us for every instruction, which would
be about a 30% slowdown. If they put those registers in logic to speed it up,
it'd be a significant extra chunk of hardware.
First John, I want to thank you for forcing me to think about this.
Good to keep the brain active!
I think your analysis is flawed. While you would have to add the
contents of the system base register to compute the memory address for
the memory to register load, you would save having to add the contents
of the base register from the instruction. So I think it would be a wash.
Furthermore, since the S/360 used storage keys for protection, there is
no need for a bounds register.
Lastly, since programs were loaded on page boundaries and the max memory
on the /30 (I had to look this up) was 64K, the system base register
would only have had to be 4 bits!, so maybe small enough to invest in
actual hardware to hold it. IF so, it would have been a significant
speedup, as you wouldn't have had to load the base register value from core.
John Levine <johnl@taugh.com> schrieb:
Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.
I think we can be in agreemend that it was, indeed, a mistake,
but oviously not fatal.
IBM had good peripherals, they had a good upgrade path to very
powerful machines, and they were bit-compatible for user programs
(plus, they put in the microcode emulation of the 1401 so their
customers could transition smoothly - that was a genius move,
the /360 could probably would have been far less of a success
if that had not been possible). All of these were good reasons
to buy these machines.
Customers could and did work around the memory fragmentation,
but it didn't make their lives easier.
But IBM severely underestimated the software complexity of the
system they were creating, hence the delays and "The Mythical
Man-Month" (and such abominations as JCL. Which way around is
that COND parameter again? But because I made some money
working on mainframes as a student, I cannot complain - nobody
ever challenged the hours I billed because mainframes are
complex, as everybody knows, and JCL was a large part of that :-)
On Fri, 20 Jun 2025 18:06:18 +0000, John Levine wrote:
It appears that Thomas Koenig <tkoenig@netcologne.de> said:
I'm pretty sure you're wrong. They didn't think they needed to move
a program after it was loaded.
Which was a mistake, but one hat had no impact on MFT (you had a
fixed number of regions with a fixed memory there), but once
they released MVT, because then memory fragmentation became
inevitable.
They may not have considered that early enough in the project.
See the message I sent yesterday. They knew about dynamic relocation and
virtual
memory, but considered it too risky to add to an already ambitious
project.
Considering that S/360 outsold all of its competitors combined, it's
hard to
argue it was a major mistake.
What outsold the competitors is the ISA remaining stable over machine
size and machine generation--preserving the software investment.
Over in the number crunching side of things (CDC 6600-7600--CRAY)
one had to hold onto Fortran decks and recompile for each machine.
On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:
The attack of the Killer Micro's did not appear until circa 1977.
That could be considered the very beginning, as that's when the Altair
8800 came out and so on.
And since the context was discussing events before 1977, that's good
enough to say that back then, micros weren't a problem for sure.
But 8-bit microprocessors didn't kill minis and mainframes. They weren't powerful enough to compete. When did micros really become killers?
Well, they certainly were killers when the Pentium II came out in 1997,
but I'd say that's rather a late date.
Instead, micros were lethal to a lot of larger systems even before they reached that level of performance. In 1987, halfway between those two
dates, Intel came out with the 387. Hardware floating point for a 32 bit system? It's about at that point that anything larger became
questionable.
What define(s|d) a "mini" or a "mainframe"?
For "micro" AFAIK the definition is/was "single-chip CPU", so I guess
"mini" would be something like "CPU made of 74xxx thingies?" and as for
how to distinguish them from mainframes, I don't know.
I think you underestimate impact of micros. At lowest end ZX Spectrum
and Commodre 64 gave nontrivial compute power at low cost. There was
IBM PC and 68000-based workstations. So already around 1983 micros
limited market for low end minis (and due to minis market for low end mainfraimes was limited earlier).
On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:
MitchAlsup1 wrote:-------------
On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
[reg], [reg+disp], [reg+index+disp] are all different address
calculations.
Yes, but the first 2 are a STRICT subset of the last one. So, you
build the AGEN unit to perform the last one, and have DECODE feed
zeros (0s) when you don't need the first two.
Yes, back then there would be just one ALU/AGEN for the core, used for
pretty much all arithmetic, though sometimes it would have a separate
incrementer/decrementer so it can overlap it with the ALU/AGEN.
Mc 88100 had::
a) integer ALU (+ and -)
b) address ALU (+ and <<{0,1,2,3})
c) PC ALU (INC4, Disp16, Disp26)
mostly because we did not want to route data to the ALU, and
occasionally we wanted to use several FUs simultaneously.
Note: Integer adder needed negate to perform SUB, this takes the
same gate delay as AGEN with <<{0,1,2,3} with add-only.
Even Mc66000 had 3 adders {PC, D, A}
I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,
so in
principle it could load anywhere.
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,
There was no mention of this in the Principle of Operations,
and its timing is given in the System/360 Mode1 30 Functional
Characteristics document, so I don't think this is true.
so in
principle it could load anywhere.
We should also consider what the machine was capable of running.
Like all of /360 it was supposed to hae run OS/360, but
that was running late and was too big, so smaller systems
were used. These were generally only capable of running
one program at a time, so the point where to load becomes
sort of moot. (Also, DOS/360 does not seem to have had a
relocating loader, so everything had to be loaded at
a pre-determined address.
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure of loading only at page boundaries. 360/30 had no
paging hardware and AFAIK storage keys were optional,
There was no mention of this in the Principle of Operations,
and its timing is given in the System/360 Mode1 30 Functional
Characteristics document, so I don't think this is true.
We should also consider what the machine was capable of running.
Like all of /360 it was supposed to hae run OS/360, but
that was running late and was too big, so smaller systems
were used.
These were generally only capable of running
one program at a time, so the point where to load becomes
sort of moot. (Also, DOS/360 does not seem to have had a
relocating loader, so everything had to be loaded at
a pre-determined address.
AFAIK in OS/360 overlays were separately loaded, just like
programs. So even with one program running one was likely
to want several modules, each at it own load adress.
It appears that Waldek Hebisch <antispam@fricas.org> said:
AFAIK in OS/360 overlays were separately loaded, just like
programs. So even with one program running one was likely
to want several modules, each at it own load adress.
A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the appropriate overlay when you called down into one.
One load module could also run another using system calls which was occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.
A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.
Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).
One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.
Sort of early JIT,then (pun intended).
According to Thomas Koenig <tkoenig@netcologne.de>:
A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.
Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).
The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.
15K is really small, it must have overlaid like crazy.
One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.
Sort of early JIT,then (pun intended).
Very much so. Tape systems spent more time sorting than doing anything
else, so they had all sorts of hacks to speed it up. Precompiling the
inner loop was just one of them. I gather they wrote their own channel programs, too.
According to Thomas Koenig <tkoenig@netcologne.de>:
A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.
Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).
The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.
15K is really small, it must have overlaid like crazy.
One load module could also run another using system calls which was
occasionally useful, e.g., the sort program could call the linkage
editor to make a loadable module of the specific comparison and exit
routins for a sort run.
Sort of early JIT,then (pun intended).
Very much so. Tape systems spent more time sorting than doing anything
else, so they had all sorts of hacks to speed it up. Precompiling the
inner loop was just one of them. I gather they wrote their own channel >programs, too.
Sort of early JIT,then (pun intended).
Very much so. Tape systems spent more time sorting than doing anything >>else, so they had all sorts of hacks to speed it up. Precompiling the >>inner loop was just one of them. I gather they wrote their own channel >>programs, too.
The Burroughs medium systems Sort intrinsic would even read the tape >backwards to improve sort speed.
According to Thomas Koenig <tkoenig@netcologne.de>:
A load module could contain multiple tree structured overlays, and in
OS at least, the linker added glue code that loaded and relocated the
appropriate overlay when you called down into one.
Over-use of that technique may have been the reason why the linker
was so sloooooow, even on machines with adequate memory, or maybe
it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
unheard of at IBM).
The manual says there were two versions of the linker, E level 15K and
F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
It says in the three versions of F "the logic and control flow is
identical" but the bigger ones are faster which suggests to me that
they unfolded some overlays.
15K is really small, it must have overlaid like crazy.
On Mon, 23 Jun 2025 6:07:11 +0000, Thomas Koenig wrote:
And they probably didn't touch it again... The machine I worked
on was a Fujitsu rebranded as a Siemens 7881. I didn't know the
original Fujitsu name at the time. It ran BS 3000, which was an
MVS clone.
I tried to look it up, and found it was really a Siemens 7.881-2 (the punctuation is important). And this was one of Fujitsu's larger scale systems, intended to compete with the IBM 3800, so if it ran dead
slow, that is surprising.
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.
Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and >PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.
Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting >one address modify another in the same instruction, would have made it a lot >easier to pipeline.
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
- anton
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:...
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
[...] POLY as an
instruction is bad.
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
Pipeline work over 1983-to-current has shown that LD and OPs perform
just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
were LD and OP, and there are way to perform LD and OP as if it were
LD+OP.
Condition codes get hard when DECODE width grows greater than 3.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 45:42:47 |
Calls: | 10,394 |
Calls today: | 2 |
Files: | 14,066 |
Messages: | 6,417,268 |