• Re: fitting programs in Why I've Dropped In

    From John Levine@21:1/5 to All on Sat Jun 14 22:12:15 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    As we have discussed, the S/360 designers needed some mechanism to
    allow a program to be loaded at an arbitrary location in memory.

    Did they? Why? I remember reading that the systems software people
    spent a lot of work on an overlay mechanism, so the thinking at IBM at
    the time was apparently not about keeping several programs in RAM at
    the same time, but about running one program at one time, and finding
    ways to make that program fit into available RAM.

    They did both. OS/360 divided up memory into partitions, at boot time in MFT or dynamically in MVT. Each job step said how big a partition it needed,
    and if you ran out of space, your program failed. Many of the utilities
    had different versions for different partition sizes, which I assume were
    the same code with more or less overlaying.

    In any case, it's no problem to add a virtual-memory mechanism that is
    not visible to user-level, or maybe even kernel-level (does the
    original S/360 have that?) programs, whether it's paged virtual memory
    or a simple base+range mechanism.

    That's what the 360/67 and 370 DAT did. CP/67 and later VM/370 took
    advantange of the fact that nearly everything that affects or observes
    the global environment traps in user mode so they could provide a
    simulated kernel mode good enough to fool most operating systems.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 15 17:00:47 2025
    According to Scott Lurndal <slp53@pacbell.net>:
    An interesting development is that, e.g., on Ultrix on DecStations
    programs were statically linked for a specific address. Then dynamic >>linking became fashionable; on Linux at first dynamically-linked

    s/linux/svr3/. It was SVR3 unix that first had static libraries linked
    at a specific address.

    BSD/OS, the commercial descendant of 4BSD, also had them around the same time. I am pretty sure they were separately developed since SVR3 used COFF and BSD
    as I recall still used a.out.

    There were some configuration files that set the addresses for each library to prevent overlap, and some kludgery that let programs override a few library routines like malloc() with local versions.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 15 17:54:06 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    No, I meant statically linked libraries.

    Static linking does not require any coordination. Every executable
    gets its own copy of the library parts it uses linked to fit into the >executable's address space, i.e., with static linking libraries are
    not shared.

    With traditional static linking they aren't shared, but with static linked shared libraries they are.

    On BSD/OS a whole library was linked into a single shared segment with a fixed address, and it created a stub library with the addresses of each routine in the
    segment. The standard C library was one shared library, and there were a few other libraries, maybe a math library. In theory you could make your own shared libraries but in practice there were a few shipped with the system, built to ensure that no shared libraries used overlapping addresses.

    When you linked a program, your program got the library routine addresses from the stubs, and your program had something at the beginning saying which shared libraries it used. At program startup time, it just mapped in the shared libraries. There was no runtime linking or relocation since the library addresses were all set at the time the library was built.

    It wasn't anywhere near as flexible as dynamic libraries, but it worked well for
    what it did, every program on the system shared a single copy of the C library (and whatever other shared libraries there were), and program startup was fast since there was no runtime linking or relocation.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 15 20:20:05 2025
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    The low end 360s were too underpowered to do
    time sharing and any sort of dynamic relocation would have just made them more
    expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
    made up for it with much better I/O devices.

    While I admit that was less familiar with the lower end systems, I think
    the extra expense would have been a single register in the CPU to hold
    the base, and a few extra instructions at task switch time to save and
    reload it. Not very much. And the benefits to the larger systems would
    have been significant when they implemented interactive usage.

    The 360/30 was byte serial and stored the 16 registers in core (and I mean core.) According to my copy of the 360/30 Functional Characteristics manual, a register to register load took 17us, memory to register took 24us, with an additional 4.5us if it was indexed. I'd think the time to add a system base register would be about the same as the indexing time, as would the comparison to the bound register, so that's an extra 9us for every instruction, which would
    be about a 30% slowdown. If they put those registers in logic to speed it up, it'd be a significant extra chunk of hardware.

    The guys who designed the 360 thought really hard about a design that could scale
    up and down and have multiple efficient implementations. The /30 was the most popular of the 360 line. IBM shipped thousands of them. They made a few mistakes
    (hex floating point and the high address byte) but not big ones.

    Remember that S/360 was mostly aimed at batch processing where each program >> starts and runs until it's done. The higher end systems did multiprogramming so
    they could run some other batch program in the short interval while waiting for
    a disk or tape or card operation.

    True, and good points.

    The 360/30's channel was implemented in the CPU microcode, borrowing cycles as needed.
    I gather that when it was running a disk operation, the CPU pretty much halted. Swapping on a system that slow would have made no sense.

    They included what they called teleprocessing but those systems were transaction
    monitors built in the SAGE model with a queue of short chunks of code running to
    completion. Relocation and swapping wouldn't help there either.

    Agreed. Although how much were the choices in implementing
    teleprocessing influenced by the hardware design choices? I don't know
    and haven't thought about it at all.

    The SAGE programming model has been quite successful for systems that
    need fast realtime response, even 70 years later. SABRE (originally on
    7090, later on 360s) used it. The CICS database monitor uses it. Take
    a look inside the python twisted library, or Javascript node.js, and
    there it is.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Al Kossow@21:1/5 to John Levine on Sun Jun 15 18:48:20 2025
    On 6/15/25 1:20 PM, John Levine wrote:

    The SAGE programming model has been quite successful for systems that
    need fast realtime response, even 70 years later.

    Do the SDC SAGE programming documents exist on line anywhere?
    MANY years ago, one of the Smithsonian curators showed me a line
    of binders documenting the software, but it wasn't possible to
    scan or copy them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 10:35:20 2025
    On 6/15/2025 1:20 PM, John Levine wrote:
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    The low end 360s were too underpowered to do
    time sharing and any sort of dynamic relocation would have just made them more
    expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
    made up for it with much better I/O devices.

    While I admit that was less familiar with the lower end systems, I think
    the extra expense would have been a single register in the CPU to hold
    the base, and a few extra instructions at task switch time to save and
    reload it. Not very much. And the benefits to the larger systems would
    have been significant when they implemented interactive usage.

    The 360/30 was byte serial and stored the 16 registers in core (and I mean core.) According to my copy of the 360/30 Functional Characteristics manual, a
    register to register load took 17us, memory to register took 24us, with an additional 4.5us if it was indexed. I'd think the time to add a system base register would be about the same as the indexing time, as would the comparison
    to the bound register, so that's an extra 9us for every instruction, which would
    be about a 30% slowdown. If they put those registers in logic to speed it up, it'd be a significant extra chunk of hardware.

    First John, I want to thank you for forcing me to think about this.
    Good to keep the brain active!

    I think your analysis is flawed. While you would have to add the
    contents of the system base register to compute the memory address for
    the memory to register load, you would save having to add the contents
    of the base register from the instruction. So I think it would be a wash.

    Furthermore, since the S/360 used storage keys for protection, there is
    no need for a bounds register.

    Lastly, since programs were loaded on page boundaries and the max memory
    on the /30 (I had to look this up) was 64K, the system base register
    would only have had to be 4 bits!, so maybe small enough to invest in
    actual hardware to hold it. IF so, it would have been a significant
    speedup, as you wouldn't have had to load the base register value from core.



    The guys who designed the 360 thought really hard about a design that could scale
    up and down and have multiple efficient implementations.

    I absolutely believe that. And it was a new concept at the time, so
    more kudos!

    The /30 was the most
    popular of the 360 line. IBM shipped thousands of them. They made a few mistakes
    (hex floating point and the high address byte) but not big ones.

    I agree about the two you mentioned, but I would also include the
    pointer based parameter passing to the OS, for the reasons that Lynn has
    so eloquently explained.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Wed Jun 18 19:51:08 2025
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
    register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
    to the bound register, so that's an extra 9us for every instruction, which would
    be about a 30% slowdown. If they put those registers in logic to speed it up,
    it'd be a significant extra chunk of hardware.

    First John, I want to thank you for forcing me to think about this.
    Good to keep the brain active!

    I think your analysis is flawed. While you would have to add the
    contents of the system base register to compute the memory address for
    the memory to register load, you would save having to add the contents
    of the base register from the instruction. So I think it would be a wash.

    But the most important goal of the 360 was a single architecture, run the same code on every model. This mutant /30 would presumably have 16 bit direct addresses only, and so much for upward compatibility with models with more memory.

    In the IBMSJ architecture article they said:

    It was decided to commit the system completely to a base-register technique;
    the direct part of the address, the displacement, was made so small (12 bits, or
    4096 characters) that direct addressing is a practical programming technique
    only on very small models. This commitment implies that all programs are
    location-independent, except for constants used to load the base registers.
    Thus, all programs can easily be relocated.

    I think they meant it was easy to relocate programs when they were loaded, which
    is true, no fiddly instruction patching needed. The idea that you would move a program after it was loaded was at the time an exotic high end feature. There was a relocation option for the 7094 but it was an RPQ, not in the regular catalog, and only used for CTSS:

    https://bitsavers.org/pdf/ibm/7094/L22-6641-3_RPQ_E07291_880287_7090-7094_Multiprogramming_Package.pdf

    The 360/20 was sort of like what you're proposing, a 16 bit system that was as compatible
    with real 360s as they could make it. It had 8 registers numbred 8 to 16. In a program
    address, if the high bit of the register number was 1, it was a B+D address, but if it was
    zero, the low 15 bits were a direct address which was plenty since the biggest /20 was 16K.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 15:30:56 2025
    On 6/18/2025 12:51 PM, John Levine wrote:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    The 360/30 was byte serial and stored the 16 registers in core (and I mean >>> core.) According to my copy of the 360/30 Functional Characteristics manual, a
    register to register load took 17us, memory to register took 24us, with an >>> additional 4.5us if it was indexed. I'd think the time to add a system base >>> register would be about the same as the indexing time, as would the comparison
    to the bound register, so that's an extra 9us for every instruction, which would
    be about a 30% slowdown. If they put those registers in logic to speed it up,
    it'd be a significant extra chunk of hardware.

    First John, I want to thank you for forcing me to think about this.
    Good to keep the brain active!

    I think your analysis is flawed. While you would have to add the
    contents of the system base register to compute the memory address for
    the memory to register load, you would save having to add the contents
    of the base register from the instruction. So I think it would be a wash.

    But the most important goal of the 360 was a single architecture, run the same
    code on every model. This mutant /30 would presumably have 16 bit direct addresses only, and so much for upward compatibility with models with more memory.

    But that is not what I suggested. Let's go back a bit. I suggested
    that the choice of using visible registers for base registers was a
    mistake made by the S/360 architects. You responded by talking about
    while that would have been OK for the larger systems, on smaller systems
    like the /30, it would have substantially hurt performance. I think
    that I showed above that this didn't have to be the case. So I repeat
    my suggestion that a hidden base register would have been a better
    choice, both for the bigger models, and even for the /30. I am most
    definitely NOT suggesting a different architecture for the smaller
    versus larger systems.



    In the IBMSJ architecture article they said:

    It was decided to commit the system completely to a base-register technique;
    the direct part of the address, the displacement, was made so small (12 bits, or
    4096 characters) that direct addressing is a practical programming technique
    only on very small models. This commitment implies that all programs are
    location-independent, except for constants used to load the base registers.
    Thus, all programs can easily be relocated.

    I think they meant it was easy to relocate programs when they were loaded, which
    is true, no fiddly instruction patching needed.

    Agreed.

    The idea that you would move a
    program after it was loaded was at the time an exotic high end feature.

    I understand that. But so was a system needing more than 24 bits of
    address, yet you readily admit that not requiring the high order 8 bits
    of an address to be zero was a mistake. In both cases, they were
    mistakes of not anticipating future developments. Of course, I realize
    that, as, I think Yogi Berra said, "Predictions are hard, especially
    ones of the future!".



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Thu Jun 19 01:23:52 2025
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    But that is not what I suggested. Let's go back a bit. I suggested
    that the choice of using visible registers for base registers was a
    mistake made by the S/360 architects. You responded by talking about
    while that would have been OK for the larger systems, on smaller systems
    like the /30, it would have substantially hurt performance. I think
    that I showed above that this didn't have to be the case. So I repeat
    my suggestion that a hidden base register would have been a better
    choice, both for the bigger models, and even for the /30. I am most >definitely NOT suggesting a different architecture for the smaller
    versus larger systems.

    I'm confused. Can you give some examples in this system of how it
    would still provide 24 bit addressing, using the same instruction set
    on large and small models, and not making the instructions bigger?

    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Wed Jun 18 19:41:10 2025
    On 6/18/2025 6:23 PM, John Levine wrote:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    But that is not what I suggested. Let's go back a bit. I suggested
    that the choice of using visible registers for base registers was a
    mistake made by the S/360 architects. You responded by talking about
    while that would have been OK for the larger systems, on smaller systems
    like the /30, it would have substantially hurt performance. I think
    that I showed above that this didn't have to be the case. So I repeat
    my suggestion that a hidden base register would have been a better
    choice, both for the bigger models, and even for the /30. I am most
    definitely NOT suggesting a different architecture for the smaller
    versus larger systems.

    I'm confused. Can you give some examples in this system of how it
    would still provide 24 bit addressing, using the same instruction set
    on large and small models, and not making the instructions bigger?

    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    If you mean how does a single program address more than 16MB, the answer
    is by using an index register. You still need those. You just don't
    need two registers (base and index) when one will do.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Wed Jun 18 23:10:07 2025
    On 6/18/2025 10:36 PM, quadibloc wrote:
    On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

    If you mean how does a single program address more than 16MB, the answer
    is by using an index register.  You still need those.  You just don't
    need two registers (base and index) when one will do.

    One does not do adequately.

    It does/did for many/most architectures, even one contemporaneous with S/360


    What one expects memory reference instructions to be able to do is:

    Normally, to be able to access any part of memory in a simple manner
    which does not require any additional instructions.

    Then the S/360 failed as you had to load either an index register or a
    base register. Without that, you would require, in S/360s time, 24 bit offsets.


    When indexed, to include the address of an array, and, in an index
    register, a displacement from the start of an array. With no additional instructions.

    Again, many/most contemporaneous architectures didn't support this.


    That's how they were able to behave back when memory was 64K bytes in
    size.

    Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
    index register.

    But let's say your program has more than 4K of non array data. Then you
    either have to reload the base register or use multiple base registers,
    which reduces the number of registers available for other things.


    Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
    doing extra work to answer the same questions.

    Of course, the other problem is that base registers use up registers.

    Yes

    So
    in my CISC design, I had a separate bank of eight base registers
    distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
    to make that necessary.

    But S/360 (which is what we were discussing) had only 16 GPRs.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Thu Jun 19 12:12:59 2025
    John Levine <johnl@taugh.com> schrieb:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    But that is not what I suggested. Let's go back a bit. I suggested
    that the choice of using visible registers for base registers was a
    mistake made by the S/360 architects. You responded by talking about
    while that would have been OK for the larger systems, on smaller systems >>like the /30, it would have substantially hurt performance. I think
    that I showed above that this didn't have to be the case. So I repeat
    my suggestion that a hidden base register would have been a better
    choice, both for the bigger models, and even for the /30. I am most >>definitely NOT suggesting a different architecture for the smaller
    versus larger systems.

    I'm confused. Can you give some examples in this system of how it
    would still provide 24 bit addressing, using the same instruction set
    on large and small models, and not making the instructions bigger?

    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    Some thoughts...

    Assume memory instructions along the lines of a RISC-style
    load/store instruction, using a 16-bit signed displacement.
    Storing a pointer to the data and then accessing it via, for
    example, 1234(r3) would add the hidden base register, the
    index register and the displacement - same amount of effort
    as a base register, an index register and a 12-bit displacement.
    (It would also make efficient use of the 8-bit and 16-bit
    adders of the low-end machines :-)

    For instructions which would not need an index register, that
    is an additional effort of one addition, which, as you pointed
    out, could be significant slowdown especially for the 360/30.

    But...

    Assume that the machine has a "real" program counter and a
    "user-visisble" program counter. Normally, the machine operates
    on the real one; the user-visible one is only computed if the user
    program asks about this.

    Then consider PC-relative branches, and a PC-relative addressing
    mode via a special register number, so 1234(PC) would then only
    need a single addition and be faster. Tell people about this,
    and they will bend over backwards to use it (especially since
    +/- 8 kb would be quite large by the standards of the day).

    I think such a machine would have, on average, higher performance
    than what they actually built.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to quadibloc on Thu Jun 19 09:35:43 2025
    quadibloc wrote:
    On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

    If you mean how does a single program address more than 16MB, the answer
    is by using an index register. You still need those. You just don't
    need two registers (base and index) when one will do.

    One does not do adequately.

    What one expects memory reference instructions to be able to do is:

    Normally, to be able to access any part of memory in a simple manner
    which does not require any additional instructions.

    When indexed, to include the address of an array, and, in an index
    register, a displacement from the start of an array. With no additional instructions.

    [reg], [reg+disp], [reg+index+disp] are all different address calculations.

    The only memory address mode that's functionally mandatory is [reg].
    After that the question is which calculations occur frequently enough to warrant being integrated into their own instruction (address modes).

    Others are the relegated to separate address calculations, and it
    depends on the complexity of a specific address expression how it
    maps onto a particular ISA as to how many instructions it takes.

    Personally I think those 4 bits for the second address register
    would be better allocated to having a 16-bit displacement.
    Note also that 360 index register was not scaled and so not
    directly usable from array index value for other than byte arrays.

    That's how they were able to behave back when memory was 64K bytes in
    size.

    The program's logical address calculation is independent of how much
    physical memory is attached to a system.

    Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
    index register.

    If you want a base+index<<scale address calculation then include
    instructions that do just that.

    Using an integer general register for program relocation was a flawed
    approach. It uses a critical 4 instruction bits for a second register
    specifier that doesn't work for program relocation and looses 4 bits
    from the displacement which frequently could use them.

    The correct design was to have separate base and bounds registers for
    program relocation, managed by the OS outside program control.
    When the OS switches tasks it loads the integer and float registers,
    sets base and bounds physical offsets for it, and Bob's your uncle.
    Also all tasks are dynamically relocatable.

    The cost is just the two base and bounds relocation registers.
    The same ALU is still used for AGEN to calcuate [reg+disp+base] and
    send the physical address to the Memory Address Register (MAR).
    While the bus cycle sequencer is accessing memory the ALU can be used
    to do the bounds check and maybe abort the access.

    And IBM could have charged extra for the base and bounds registers
    (which would have been present in all models, just enabled by a jumper).

    Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
    doing extra work to answer the same questions.

    This is the ISA design trade off - which address calculations occur
    frequently enough to warrant their own instructions (address modes).

    Of course, the other problem is that base registers use up registers. So
    in my CISC design, I had a separate bank of eight base registers
    distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
    to make that necessary.

    John Savard

    I would rather have a [base+index<<scale+disp] address mode using
    integer registers and let the compiler decide how best to use them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Thu Jun 19 13:37:48 2025
    On Thu, 19 Jun 2025 5:36:29 +0000, quadibloc wrote:

    On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

    If you mean how does a single program address more than 16MB, the answer
    is by using an index register. You still need those. You just don't
    need two registers (base and index) when one will do.

    One does not do adequately.

    What one expects memory reference instructions to be able to do is:

    Normally, to be able to access any part of memory in a simple manner
    which does not require any additional instructions.

    And this is what RISC is BAD at doing.

    Consider RISC-V accessing a static array more than 4 Petabytes away
    from the address of the instruction being performed. First one has to
    create an address into the Literal pool, then load the pointer to the
    static variable, then finally LD the static variable::

    AUPIC R7,hi(&static_array)
    LDD R7,lo(&static_array)(R7)
    SLL R6,R6,#2
    ADD R8,R5,R6
    LDW R7,0(R8)

    Whereas a reasonable ISA allows::

    LDW R7,[IP,R6<<2,Static_array-.]

    1 instruction rather than 5, with a latency of LD-pipeline rather
    than 2-lod-pipeline+index shifting.

    When indexed, to include the address of an array, and, in an index
    register, a displacement from the start of an array. With no additional instructions.

    That's how they were able to behave back when memory was 64K bytes in
    size.

    You still want a minimum instruction count, even when memory is 2^64
    bytes in size.

    Now that memory is bigger, a base register, set once at the start of the program, and then basically forgotten about, lets the program behave basically the same way. When an address is indexed, also specify an
    index register.

    FOr statically linked object modules--maybe.
    FOr dynamically linked objects--at best the jury is still out.

    Anything else would involve slowing down the program by adding extra instructions that weren't required back when memory was smaller. That's
    doing extra work to answer the same questions.

    Like my example above. How does CII do with the above ??

    Of course, the other problem is that base registers use up registers. So
    in my CISC design, I had a separate bank of eight base registers
    distinct from the eight general registers. When there are 32 registers, though, using two or three of them as base registers is not bad enough
    to make that necessary.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Thu Jun 19 10:32:40 2025
    John Levine wrote:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    But that is not what I suggested. Let's go back a bit. I suggested
    that the choice of using visible registers for base registers was a
    mistake made by the S/360 architects. You responded by talking about
    while that would have been OK for the larger systems, on smaller systems
    like the /30, it would have substantially hurt performance. I think
    that I showed above that this didn't have to be the case. So I repeat
    my suggestion that a hidden base register would have been a better
    choice, both for the bigger models, and even for the /30. I am most
    definitely NOT suggesting a different architecture for the smaller
    versus larger systems.

    I'm confused. Can you give some examples in this system of how it
    would still provide 24 bit addressing, using the same instruction set
    on large and small models, and not making the instructions bigger?

    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    The address modes supported by an ISA are just optimization points
    not limitations. If and address calculation is more complex than is
    supported by the address modes of the LD/ST instructions themselves
    then it must be calculated separately using integer instructions
    into a temp register then used as an address.

    On a 360 if I'm accessing a struct larger than 2kB then I would
    load a 32-bit immediate offset into a temp register, say R1.
    (I don't know 360 but I don't see any 32-bit load immediate instructions
    so it looks like I'd have to do something like Load Address LA to load
    a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits. Basically construct a large constant from smaller ones the way RISC do.)

    Then with the offset in R1 and struct address in R2 do a LD R3,[R2+R1],
    using the implicit base+index addition in the address mode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Thu Jun 19 14:54:04 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    On a 360 if I'm accessing a struct larger than 2kB then I would
    load a 32-bit immediate offset into a temp register, say R1.
    (I don't know 360 but I don't see any 32-bit load immediate instructions
    so it looks like I'd have to do something like Load Address LA to load
    a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits. Basically construct a large constant from smaller ones the way RISC do.)

    They usually loaded constants from memory close to the routine itself. https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
    is a nice introduction.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Thu Jun 19 15:11:29 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Note also that 360 index register was not scaled and so not
    directly usable from array index value for other than byte arrays.

    There is always strength reduction. It seems the original FORTRAN
    compiler did a lot of that for the 704, but I'm not sure that the
    /360 compilers did - from what I read, they regressed in code
    generation quality.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to sfuld@alumni.cmu.edu.invalid on Thu Jun 19 17:52:49 2025
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K. There's one base register, and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

    If the idea is that each program is limited to 64K even though the overall system address space is bigger, BTDT on a PDP-11 and would prefer not to go back.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Thu Jun 19 12:36:14 2025
    On 6/19/2025 10:52 AM, John Levine wrote:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K.

    Index registers. You still need those, for exactly that reason. But
    you don't need a second mechanism, i.e. base registers specified in the instruction. One mechanism is sufficient. If you have 32 bit
    registers, as the S/360 did, you can address up to 4GB.


    There's one base register,
    and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

    See above

    If the idea is that each program is limited to 64K even though the overall system address space is bigger, BTDT on a PDP-11 and would prefer not to go back.


    No, that is not the idea. I agree that it would be terrible if it was!



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Thu Jun 19 20:25:00 2025
    John Levine <johnl@taugh.com> schrieb:
    It appears that Stephen Fuld <sfuld@alumni.cmu.edu.invalid> said:
    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K.

    By using registers and constants pointing to elsewhere?

    IBM /360 had 48-bit instructions, so an instruciton loading a
    32-bit constant into a register would have been entirely feasible.
    Load a target address where you need to access things, and do
    your memory operations there.

    There's one base register,
    and the address field in an instruction is only 16 bits. How am I supposed to address a megabyte with only a 16 bit offset?

    I'm not sure what proposal you are replying to, or that there wasn't
    some miscommunciation somewhere along the line.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Jun 19 21:45:24 2025
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K.

    Index registers. You still need those, for exactly that reason. But
    you don't need a second mechanism, i.e. base registers specified in the >instruction. One mechanism is sufficient. If you have 32 bit
    registers, as the S/360 did, you can address up to 4GB.

    I don't get the impression that we are thinking about the same S/360.

    The 360 had four instruction formats. RR was register to register, no
    problem there. RX was memory to register, with a four bit register operand, four bit base register, four bit index register, and 12 bit displacement.
    As I understand it, you'd change that to 16 bit displacement relative to
    an implicit base register and still have the optional index register.

    But there are two other instruction formats SS and SI that have four bit base register, 12 bit displacement, and no index register. What happens to them? 16 bit displacement so you can only address 64K? Reuse the base register bits as an
    index register so you can only address 4K directly?

    In case it's not obvious, all programs but the most trivial used multiple base registers. First you'd have one to point to the code and static data. For I/O you'd
    set another register to point to an I/O buffer, and use that register as the base register in SS and SI instructions to move stuff in and out of the buffer, then pass the buffer to the operating system, update the register to point to the next buffer and do it again. If you were doing a read-compute-write loop, you'd have one base register for the read buffer and one for the write buffer.

    Same for any non-trivial data structrure, you set a register to point to a structure and use it as the base register to refer to fields. A single global base register couldn't do any of that.

    I read somewhere that they did simulations and found that a typical program used
    four base registers at a time.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Levine on Thu Jun 19 16:05:16 2025
    On 6/19/2025 2:45 PM, John Levine wrote:
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K.

    Index registers. You still need those, for exactly that reason. But
    you don't need a second mechanism, i.e. base registers specified in the
    instruction. One mechanism is sufficient. If you have 32 bit
    registers, as the S/360 did, you can address up to 4GB.

    I don't get the impression that we are thinking about the same S/360.


    Could be, and I defer to your greater knowledge of S/350, so I may have
    left something out. But in the following, I don't think so. See below.

    Look at it this way. (I was led to this by John Savard's comments in
    another post in this thread.

    because index registers are for
    displacing from an address; base registers build an address.

    So he seems to regard them as different. But both base registers and
    index registers are actually the same GPRs. There is no physical
    difference. The only difference is in ones head as how they are used,
    and in some cases, not even there (and probably, though not necessarily,
    the actual value in them). If you regard them all as index registers
    (again, no physical change, just a change in the way you look at it.),
    it may make things clearer.


    The 360 had four instruction formats.

    Yes.

    RR was register to register, no
    problem there. RX was memory to register, with a four bit register operand, four bit base register, four bit index register, and 12 bit displacement.
    As I understand it, you'd change that to 16 bit displacement relative to
    an implicit base register and still have the optional index register.

    Correct.


    But there are two other instruction formats SS and SI that have four bit base register, 12 bit displacement, and no index register. What happens to them?

    Let's start with the SS instructions. The documentation says the
    instruction has two base registers. But if I said, with no actual
    change to the hardware, it has two index registers, it would perform
    exactly as it does now. Yes, it only has 12 bit displacements, but that
    is no different from what it has now. So other than the name in the documentation, things are exactly as they were. So while you haven't
    gained a larger displacement, you haven't lost any addressing capability
    that you now have. The actual value in the index register might be
    different from what it would have been if you called it a base register,
    but, if so, that only means a different value for the displacement. But
    if you load the same value in the "index" register as you previously did
    the "base" register, the code is indistinguishable.

    The SI instructions are similar. Just call the "base register" an index register. You still have to arrange that it points within 4096 of the immediate, but you had to do that with the base register anyway. So
    nothing lost, nothing gained.


    16
    bit displacement so you can only address 64K? Reuse the base register bits as an
    index register so you can only address 4K directly?

    You have exactly the same capability as you have now, except increased displacement for RX instructions. And, by virtue of the hidden base
    register, you gain the ability to relocate programs after their initial
    load.


    In case it's not obvious, all programs but the most trivial used multiple base
    registers.

    Sure.

    First you'd have one to point to the code and static data.

    Perhaps more than one if you have more than 4K of these. With my
    proposal, you wouldn't need the one that points to the beginning of the program, as that is the contents of the system base register once the
    program is loaded.

    For I/O you'd
    set another register to point to an I/O buffer, and use that register as the base register in SS and SI instructions to move stuff in and out of the buffer,
    then pass the buffer to the operating system, update the register to point to the next buffer and do it again. If you were doing a read-compute-write loop,
    you'd have one base register for the read buffer and one for the write buffer.

    Same for any non-trivial data structrure, you set a register to point to a structure and use it as the base register to refer to fields. A single global base register couldn't do any of that.

    If you use those same registers exactly as you say, but called them
    index registers, nothing would change.

    I read somewhere that they did simulations and found that a typical program used
    four base registers at a time.

    I believe that. Just think of them as four index registers.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 19 23:36:42 2025
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
    --------------------------
    Using an integer general register for program relocation was a flawed approach. It uses a critical 4 instruction bits for a second register specifier that doesn't work for program relocation and looses 4 bits
    from the displacement which frequently could use them.

    I think they (IBM) originally thought that their base registers would
    be fixed register numbers so that relocation software could update
    them as segments moved around (pre release), and that they realized
    later that this was a folly.

    The correct design was to have separate base and bounds registers for
    program relocation, managed by the OS outside program control.

    Had they (IBM) decided that (say) R13-R16 were relocation registers
    {R13==code, R14==data, R15==BSS, R16==stack} that these segments
    could be relocated dynamically.

    When the OS switches tasks it loads the integer and float registers,
    sets base and bounds physical offsets for it, and Bob's your uncle.
    Also all tasks are dynamically relocatable.

    Yes, when the OS moves one of those segments, the OS changes the value
    in the register (and the protection bits in each page).

    Data would be accessed with [R14+Rindex+DISP12], ...

    And this would lead to lots of addressing problems, not the least of
    which was FORTRAN pas by address subroutine arguments--which needed
    either indirection in callee or creation of a non-relocatable base
    register (using something like LA R2,[R14,,,,]).

    But the architectural choice had already been bade and could not be
    unmade. So, they (IBM) decided to "live" with it (for then). And
    once they discovered "Translation" they decided to live with it
    for a long time--until the DAST-box shoed up (/67).

    The cost is just the two base and bounds relocation registers.

    This was another solution, and probably would have delayed the initial
    machine sales by several months; and Somebody up the chain decided
    to go with what they had.

    The same ALU is still used for AGEN to calcuate [reg+disp+base] and
    send the physical address to the Memory Address Register (MAR).

    Remember: They (IBM) knew that [base+index+disp112] took only
    1 gate delay longer to calculate than [base+index] or [base+disp12]

    While the bus cycle sequencer is accessing memory the ALU can be used
    to do the bounds check and maybe abort the access.

    x286 style.

    And IBM could have charged extra for the base and bounds registers
    (which would have been present in all models, just enabled by a jumper).

    Delay was the enemy.

    Anything else would involve slowing down the program by adding extra
    instructions that weren't required back when memory was smaller. That's
    doing extra work to answer the same questions.

    This is the ISA design trade off - which address calculations occur frequently enough to warrant their own instructions (address modes).

    I think this was the cross product of::
    a) designers missing the base register relocation problem
    b) management needing cash flow

    Of course, the other problem is that base registers use up registers. So
    in my CISC design, I had a separate bank of eight base registers
    distinct from the eight general registers. When there are 32 registers,
    though, using two or three of them as base registers is not bad enough
    to make that necessary.

    John Savard

    I would rather have a [base+index<<scale+disp] address mode using
    integer registers and let the compiler decide how best to use them.

    I continue to think that RISC-V has to few addressing modes

    MEM Rd,offset12(reg)

    is inefficient, and leads to executing more instructions with overall
    increase in latency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Jun 19 23:18:44 2025
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

    quadibloc wrote:
    On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

    If you mean how does a single program address more than 16MB, the answer >>> is by using an index register. You still need those. You just don't
    need two registers (base and index) when one will do.

    One does not do adequately.

    What one expects memory reference instructions to be able to do is:

    Normally, to be able to access any part of memory in a simple manner
    which does not require any additional instructions.

    When indexed, to include the address of an array, and, in an index
    register, a displacement from the start of an array. With no additional
    instructions.

    [reg], [reg+disp], [reg+index+disp] are all different address
    calculations.

    Yes, but the first 2 are a STRICT subset of the last one. So, you
    build the AGEN unit to perform the last one, and have DECODE feed
    zeros (0s) when you don't need the first two.

    The only memory address mode that's functionally mandatory is [reg].

    Leading to poor addressability and larger instruction count.

    After that the question is which calculations occur frequently enough to warrant being integrated into their own instruction (address modes).

    Having spend 7 years doing x86, the answer was clear to me::

    [base+Rindex<<2+Displacement]

    Others are the relegated to separate address calculations, and it
    depends on the complexity of a specific address expression how it
    maps onto a particular ISA as to how many instructions it takes.

    So, now you are claiming that adding instructions and latency to
    memory access is not harming performance !?!?!?!

    Clearly you don't "get it"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jun 19 23:45:32 2025
    On Thu, 19 Jun 2025 23:05:16 +0000, Stephen Fuld wrote:

    On 6/19/2025 2:45 PM, John Levine wrote:
    --------------
    Let's start with the SS instructions. The documentation says the
    instruction has two base registers. But if I said, with no actual
    change to the hardware, it has two index registers, it would perform
    exactly as it does now.

    Had the SS instructions had both base register and index register
    many of the relocation problems would have gone away.

    Yes, it only has 12 bit displacements, but that
    is no different from what it has now.

    The difference is that they had not (remotely relocatable) way of

    Array_of_struct[i].struct.foo = Array_of_struct[j].struct.bar

    If you add index to base, then you have a window where remote relocation
    fails, and you have no way to add the index to the constant. On the
    other
    hand if SS were fo the form:

    OP [base+index],[base+index]

    relocation works, but now one has to LA a bunch of constants, leading
    to longer access sequences (the same problem facing ISAs with poor
    address modes.).

    So other than the name in the documentation, things are exactly as they were.

    Arithmetically, yes; practically, no.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Thu Jun 19 20:36:36 2025
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    On a 360 if I'm accessing a struct larger than 2kB then I would
    load a 32-bit immediate offset into a temp register, say R1.
    (I don't know 360 but I don't see any 32-bit load immediate instructions
    so it looks like I'd have to do something like Load Address LA to load
    a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
    Basically construct a large constant from smaller ones the way RISC do.)

    They usually loaded constants from memory close to the routine itself. https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
    is a nice introduction.

    This way would also need a BAL to copy the PC into a base register,
    then L at PC-offset to load a 32-bit offset into an index register,
    then an RX instruction using the base+index address.

    I was looking for ways that don't require an extra memory access
    and can also be used for 32-bit integer calculations.

    Ideally an instruction to Load Immediate of 32-bits into a register,
    an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
    16-bit immediates (a variation on the 48-bit instruction format).

    Alternatively a variation on 32-bit formats using two instructions,
    a Load Immediate High which shifts the 16-bit immediate to the
    dest register upper end, plus an Add Immediate of 16-low bits.
    An 8-bit opcode, a 4-bit function code field, a 4-bit source/dest register,
    and a 16 bit value. Also useful for many other operations with 16-bit
    immediate values, sub, mul, div, and, or, xor.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to ThatWouldBeTelling@thevillage.com on Fri Jun 20 01:10:46 2025
    It appears that EricP <ThatWouldBeTelling@thevillage.com> said:
    I was looking for ways that don't require an extra memory access
    and can also be used for 32-bit integer calculations.

    Ideally an instruction to Load Immediate of 32-bits into a register,
    an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
    16-bit immediates (a variation on the 48-bit instruction format).

    On zSeries that's load immediate, LGFI, which puts a 32 bit immediate
    value in a register. There's also LAY, load address with a 20 bit
    signed displacement rather than 12 bit unsigned, and load address
    relative long LARL with a 32 bit displacement added to the current
    address.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jun 20 01:32:01 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
    --------------------------
    Using an integer general register for program relocation was a flawed
    approach. It uses a critical 4 instruction bits for a second register
    specifier that doesn't work for program relocation and looses 4 bits
    from the displacement which frequently could use them.

    I think they (IBM) originally thought that their base registers would
    be fixed register numbers so that relocation software could update
    them as segments moved around (pre release), and that they realized
    later that this was a folly.

    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded. It's really easy to relocate 360 code
    at load time, just add the offset to all of the address constants in
    memory. After that, as we've seen, not so much.

    once they discovered "Translation" they decided to live with it
    for a long time--until the DAST-box shoed up (/67).

    According to Pugh et al, who were there, they knew about the Atlas One
    Level Store (OLS), both the attractive idea of unifying RAM and disk
    storage, and that its performance was terrible. (We now call that
    thrashing.)

    They knew about CTSS, which ran on IBM hardware with base-and-bounds relocation. IBM Research had built a few experimental time-sharing
    systems. But the technical risk of adding dynamic address modification
    of any sort to what was already a very large and risky project was too
    much, so they didn't.

    MIT and Bell Labs were already thinking in 1964 about the project that
    became Multics, IBM offered some proposed hardware, which they
    rejected. That percolated up within IBM, senior management was unhappy
    about it and in less than a year came up with the /67 with virtual
    memory but Multics already had other plans. IBM produced TSS which was
    a disaster (I used it) but the hardware was fine and MTS and CP/67 got
    good time-sharing performance.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 01:15:31 2025
    On Fri, 20 Jun 2025 0:36:36 +0000, EricP wrote:

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    On a 360 if I'm accessing a struct larger than 2kB then I would
    load a 32-bit immediate offset into a temp register, say R1.
    (I don't know 360 but I don't see any 32-bit load immediate instructions >>> so it looks like I'd have to do something like Load Address LA to load
    a 12-bit constant, left shift it 12 bits, then LA to add a low 12-bits.
    Basically construct a large constant from smaller ones the way RISC do.)

    They usually loaded constants from memory close to the routine itself.
    https://bitsavers.org/pdf/ibm/360/training/GC20-1646-5_A_Programmers_Introduction_to_IBM_System360_Assembly_Language_196907.pdf
    is a nice introduction.

    This way would also need a BAL to copy the PC into a base register,
    then L at PC-offset to load a 32-bit offset into an index register,
    then an RX instruction using the base+index address.

    I was looking for ways that don't require an extra memory access
    and can also be used for 32-bit integer calculations.

    /360 assembly performed a lot of "literal Pool" accesses to get the
    constants needed to run the program at hand. Back in 1973 when I was
    looking; I found a lot of these kinds of accesses, where the linker
    would make holes in the subroutine for ease of access to those
    constants. It looked strange, but they got it to work.

    Ideally an instruction to Load Immediate of 32-bits into a register,
    an 8-bit opcode, 4-bit function code, 4-bit dest reg, and two
    16-bit immediates (a variation on the 48-bit instruction format).

    This is one of my railing points:: and ISA should use no instructions
    to use a constant as an operand in an instruction, nor use any reg-
    isters to hold the use-once constant.

    Alternatively a variation on 32-bit formats using two instructions,
    a Load Immediate High which shifts the 16-bit immediate to the
    dest register upper end, plus an Add Immediate of 16-low bits.
    An 8-bit opcode, a 4-bit function code field, a 4-bit source/dest
    register,
    and a 16 bit value. Also useful for many other operations with 16-bit immediate values, sub, mul, div, and, or, xor.

    Universal Constants means you need to do none of this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 20 07:42:02 2025
    On 6/19/2025 12:00 PM, quadibloc wrote:
    On Thu, 19 Jun 2025 17:52:49 +0000, John Levine wrote:

    It appears that Stephen Fuld  <sfuld@alumni.cmu.edu.invalid> said:
    Each address in an instruction was only 16 bits, which they happened
    to split 4 bits for the base register and 12 for the displacement. If
    you get rid of the register, you still only have 16 bits. On a larger
    model with say, a megabyte of memory, how does a program address that?

    If you mean how does a single program address more than 16MB, ...

    No, I mean how does a program address more than 64K. There's one base
    register,
    and the address field in an instruction is only 16 bits. How am I
    supposed to
    address a megabyte with only a 16 bit offset?

    Given that he continued to write:

    If you mean how does a single program address more than 16MB, the answer >>> is by using an index register.  You still need those.  You just don't
    need two registers (base and index) when one will do.

    he gave an answer to addressing more than 64K.

    16 MB is addressed by 24 bits, and is thus the entire address
    space of System/360. I presume that was just a typo.

    I disagree with his solution, because index registers are for
    displacing from an address; base registers build an address.

    As I posted in a response to John Levine, that distinction is in your
    head, not in the hardware. Each are uses of the same 16 GPRs. The
    contents of each are added to a displacement in the instruction to get
    an address.

    Locations in memory should be able to be addressed in a static
    manner.

    I don't know what that means. What is a "static manner"?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to quadibloc on Fri Jun 20 15:55:54 2025
    On Fri, 20 Jun 2025 14:51:16 +0000, quadibloc wrote:

    On Thu, 19 Jun 2025 21:45:24 +0000, John Levine wrote:

    But there are two other instruction formats SS and SI that have four bit
    base
    register, 12 bit displacement, and no index register. What happens to
    them? 16
    bit displacement so you can only address 64K? Reuse the base register
    bits as an
    index register so you can only address 4K directly?

    Since he is "really" talking about the fact that using base registers,
    in addition to index registers, is a mistake on my new Concertina II
    design, the fact that the string and packed decimal memory-to-memory instructions, with no room for indexing, couldn't do without the base register... is merely a historical sidelight.

    The System/360 design could just have added 64-bit instructions, I
    suppose.

    In principle, indeed, one doesn't "need" base registers. One can use the index registers as base registers, and then use another register with
    the base plus the array displacement whenever one accesses an array. I
    think base registers are a better idea; array accesses are common enough
    that saving an instruction for them makes sense.

    I did feel the 68000 design made a mistake with its address registers.

    The A and D registers provided the ability to write 2 registers per
    microCycle, improving 000 and 010 perf.

    Using separate registers, on a CISC design with register banks of only 8 registers, for the base registers makes sense. They're mostly static,
    and they take up precious register space. But indexes are computed, and
    so integer GPRs, not address registers, ought to have been used for
    that, in my opinion.

    This may have been mitigated, though; I think the 68000 had forms of the arithmetic instructions that worked with the address registers instead.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Fri Jun 20 17:19:37 2025
    John Levine <johnl@taugh.com> schrieb:
    According to MitchAlsup1 <mitchalsup@aol.com>:
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
    --------------------------
    Using an integer general register for program relocation was a flawed
    approach. It uses a critical 4 instruction bits for a second register
    specifier that doesn't work for program relocation and looses 4 bits
    from the displacement which frequently could use them.

    I think they (IBM) originally thought that their base registers would
    be fixed register numbers so that relocation software could update
    them as segments moved around (pre release), and that they realized
    later that this was a folly.

    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to tkoenig@netcologne.de on Fri Jun 20 18:06:18 2025
    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    See the message I sent yesterday. They knew about dynamic relocation and virtual
    memory, but considered it too risky to add to an already ambitious project.

    Considering that S/360 outsold all of its competitors combined, it's hard to argue it was a major mistake.

    They added VM to S/370 but in the intervening years both the hardware and the understanding of how VM works had gotten a lot better. It is my impression that early VM systems were wildly overoptimistic about how little physical memory they needed. Fortunately, Moore's law made memory sizes grow enough to solve that problem by brute force, somewhat aided by better understanding of working sets.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Fri Jun 20 18:31:48 2025
    John Levine <johnl@taugh.com> schrieb:
    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    See the message I sent yesterday. They knew about dynamic relocation and virtual
    memory, but considered it too risky to add to an already ambitious project.

    Brooks himself wrote he considered not adding virtual memory to the /360
    a mistake, so...

    Considering that S/360 outsold all of its competitors combined, it's hard to argue it was a major mistake.

    I think we can be in agreemend that it was, indeed, a mistake,
    but oviously not fatal.

    IBM had good peripherals, they had a good upgrade path to very
    powerful machines, and they were bit-compatible for user programs
    (plus, they put in the microcode emulation of the 1401 so their
    customers could transition smoothly - that was a genius move,
    the /360 could probably would have been far less of a success
    if that had not been possible). All of these were good reasons
    to buy these machines.

    Customers could and did work around the memory fragmentation,
    but it didn't make their lives easier.

    But IBM severely underestimated the software complexity of the
    system they were creating, hence the delays and "The Mythical
    Man-Month" (and such abominations as JCL. Which way around is
    that COND parameter again? But because I made some money
    working on mainframes as a student, I cannot complain - nobody
    ever challenged the hours I billed because mainframes are
    complex, as everybody knows, and JCL was a large part of that :-)

    They added VM to S/370 but in the intervening years both the hardware and the understanding of how VM works had gotten a lot better. It is my impression that
    early VM systems were wildly overoptimistic about how little physical memory they needed. Fortunately, Moore's law made memory sizes grow enough to solve that problem by brute force, somewhat aided by better understanding of working
    sets.

    You mean a major selling point for virtual memory was that people
    didn't think they had to buy that much expensive core storage?
    Sounds plausible.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jun 20 15:13:35 2025
    MitchAlsup1 wrote:
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:

    quadibloc wrote:
    On Thu, 19 Jun 2025 2:41:10 +0000, Stephen Fuld wrote:

    If you mean how does a single program address more than 16MB, the
    answer
    is by using an index register. You still need those. You just don't
    need two registers (base and index) when one will do.

    One does not do adequately.

    What one expects memory reference instructions to be able to do is:

    Normally, to be able to access any part of memory in a simple manner
    which does not require any additional instructions.

    When indexed, to include the address of an array, and, in an index
    register, a displacement from the start of an array. With no additional
    instructions.

    [reg], [reg+disp], [reg+index+disp] are all different address
    calculations.

    Yes, but the first 2 are a STRICT subset of the last one. So, you
    build the AGEN unit to perform the last one, and have DECODE feed
    zeros (0s) when you don't need the first two.

    Yes, back then there would be just one ALU/AGEN for the core, used for
    pretty much all arithmetic, though sometimes it would have a separate incrementer/decrementer so it can overlap it with the ALU/AGEN.


    The only memory address mode that's functionally mandatory is [reg].

    Leading to poor addressability and larger instruction count.

    Its an optimization allowing 16-bit instructions that would only used for
    a limited set of operations, just loads and stores of a few data types.

    LD and ST for halfword, word, single and double uses 8/256 opcodes
    and saves 2 4 instruction bytes on, say, 10% of loads and stores.

    VAX usage stats show that 9% to 15% of address specifiers are what
    it called register-deferred [reg], register address with no offset.
    That was for compiled languages (Fortran, Pascal, Cobol, etc),
    none of which were assembler or C where *p pointer access is more common

    x64 usage stats show ~7% register-indirect [reg] and 32% displacement [reg+disp], 0.85% scaled indexed, and 21% absolute.

    After that the question is which calculations occur frequently enough to
    warrant being integrated into their own instruction (address modes).

    Having spend 7 years doing x86, the answer was clear to me::

    [base+Rindex<<2+Displacement]

    I assume you mean a 2-bit scale, not the constant 2.
    Yes, that's the maximal answer (though I would have a 3-bit scale).
    VAX usage stats show ~8% operand specifiers are indexed, ~10% displacement
    (of various sizes) so there are size savings to be had for supporting
    smaller variations.

    WRT 360, the maximal address mode would be [rBase+rIndex+imm24]
    in a 48-bit instruction, and then have smaller variations as optimizations.
    eg [rBase+imm12] in a 32-bit instruction.

    Others are the relegated to separate address calculations, and it
    depends on the complexity of a specific address expression how it
    maps onto a particular ISA as to how many instructions it takes.

    So, now you are claiming that adding instructions and latency to
    memory access is not harming performance !?!?!?!

    Clearly you don't "get it"

    No, I am pointing to the reality that each ISA chooses certain
    operations to perform more optimally than others.
    If my ISA has a 3-bit scale field and yours has 2,
    and if the expression is an index to an fp64 complex array,
    then I use just 1 instruction while you need 2.

    360 has [base+index+imm12] but does not have scaled index so for array
    indexing on >1 byte it must copy an array index to a temp register,
    then left shift. The extra copy is required because shift left
    operates on a single source-dest register only.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Fri Jun 20 12:27:19 2025
    On 6/20/2025 11:31 AM, Thomas Koenig wrote:
    John Levine <johnl@taugh.com> schrieb:
    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    See the message I sent yesterday. They knew about dynamic relocation and virtual
    memory, but considered it too risky to add to an already ambitious project.

    Brooks himself wrote he considered not adding virtual memory to the /360
    a mistake, so...

    Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.

    I think we can be in agreemend that it was, indeed, a mistake,
    but oviously not fatal.

    IBM had good peripherals, they had a good upgrade path to very
    powerful machines, and they were bit-compatible for user programs
    (plus, they put in the microcode emulation of the 1401 so their
    customers could transition smoothly - that was a genius move,
    the /360 could probably would have been far less of a success
    if that had not been possible). All of these were good reasons
    to buy these machines.

    And the larger machines had emulation of the 7080. And IBM had great
    marketing and armies of customer engineers, and relationships with
    company CEOs, a huge installed base of EAM machines (card sorters,
    tabulators etc.) that was ripe for upgrading, etc. There were many
    favtors that contributed to its success.


    Customers could and did work around the memory fragmentation,
    but it didn't make their lives easier.

    But IBM severely underestimated the software complexity of the
    system they were creating, hence the delays and "The Mythical
    Man-Month" (and such abominations as JCL. Which way around is
    that COND parameter again?

    All good points.

    I do want to say that because I believe that IBM made some mistakes on
    the S/360, I don't want to take away their good decisions or detract
    from their success.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Fri Jun 20 20:35:23 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    No, I am pointing to the reality that each ISA chooses certain
    operations to perform more optimally than others.
    If my ISA has a 3-bit scale field and yours has 2,
    and if the expression is an index to an fp64 complex array,
    then I use just 1 instruction while you need 2.

    Hmm... assuming you have base+index addressing without
    scaling (and without implied scaling), you can do
    (for four-byte sizes)

    for (i=0; i<n; i++) {
    c[i] = a[i] + b[i]
    }

    and assuming that R1 points at a[0], R2 at b[0] and R3 at c[0]
    and that R4 is zero initially, you can do (pseudo-assembly),
    and R7 holds 4*n

    .Loop:
    ld R5,[R1,R4]
    ld R6,[R2,R4]
    add R5,R5,R6
    st R5,[R3,R4]
    add R4,R4,#4
    cmp R6,R7
    blt .Loop

    For this simple loop, there is no disadvantage of not
    having scaled index registers. This can be different
    when the value of the index variable is needed for
    something else, for for accessing something that has
    a different size.

    360 has [base+index+imm12] but does not have scaled index so for array indexing on >1 byte it must copy an array index to a temp register,
    then left shift. The extra copy is required because shift left
    operates on a single source-dest register only.

    Not needed, see above (too lazy to look up the /360 assembler :-)



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 21:09:30 2025
    On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

    MitchAlsup1 wrote:
    -------------

    Having spend 7 years doing x86, the answer was clear to me::

    [base+Rindex<<2+Displacement]

    I assume you mean a 2-bit scale, not the constant 2.

    The example was a WORD being accessed out of an array, so I did mean #2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Fri Jun 20 21:26:48 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    I do want to say that because I believe that IBM made some mistakes on
    the S/360, I don't want to take away their good decisions or detract
    from their success.

    It was a revolutionary concept and a revolutionary class of machines.

    But reading about the systen and its design process makes me itch
    to go find my old time machine (I misplaced it somewhere) and
    influence the course of computer history by pointing out some
    of the quirks and unnecessary complexities to the /360 team.

    The other point in time wold probably have been Data General or
    DEC to ca. 1975 to head its Fountainhead respectively VAX project
    towards RISC and graph-coloring register allocation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jun 20 21:27:55 2025
    On Fri, 20 Jun 2025 20:35:23 +0000, Thomas Koenig wrote:

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    No, I am pointing to the reality that each ISA chooses certain
    operations to perform more optimally than others.
    If my ISA has a 3-bit scale field and yours has 2,
    and if the expression is an index to an fp64 complex array,
    then I use just 1 instruction while you need 2.

    Hmm... assuming you have base+index addressing without
    scaling (and without implied scaling), you can do
    (for four-byte sizes)

    static int64_t a[100], b[100], c[100]

    for (i=0; i<n; i++) {
    c[i] = a[i] + b[i]
    }

    and assuming that R1 points at a[0], R2 at b[0] and R3 at c[0]
    and that R4 is zero initially, you can do (pseudo-assembly),
    and R7 holds 4*n

    ..Loop:
    ld R5,[R1,R4]
    ld R6,[R2,R4]
    add R5,R5,R6
    st R5,[R3,R4]
    add R4,R4,#4
    cmp R6,R7
    blt .Loop

    MOV R4,#0
    VEC R16,{}
    LDD R5,[R1,R4<<3]
    LDD R6,[R2,R4<<3]
    ADD R5,R5,R6
    STD R5,[R3,R4<<3]
    LOOP LT,R4,#1,Rn

    The loop consists of 4 instructions of loop workload and 1 instruction
    of loop-overhead:: 5-instructions in 5 words.

    Whereas: RISC-V would need::

    MOV R4,#0
    SLA Rn,Rn,#4
    loop:
    LDD R5,[R1]
    LDD R6,[R2]
    ADD R5,R5,R6
    STD R5,[R3]
    ADD R1,R1,#8
    ADD R2,R2,#8
    ADD R3,R3,#8
    ADD R4,R4,#8
    BLT R4,Rn,Loop

    And at the exit of the loop; R1, R2, and R3 are no longer pointing
    at the starting points of their arrays; potentially adding more
    instructions; 9 instructions and 9 words.

    The loop consist of 4 instructions of the loop workload, and 4
    instructions of loop-overhead {and possibly 3 instructions to
    recover the array pointers.}

    I am sensitive to this because the 88K Greenhills compiler would
    produce the later instead of the former even though the former
    was significantly faster and smaller and was part of 88K ISA.

    For this simple loop, there is no disadvantage of not
    having scaled index registers. This can be different
    when the value of the index variable is needed for
    something else, for for accessing something that has
    a different size.

    360 has [base+index+imm12] but does not have scaled index so for array
    indexing on >1 byte it must copy an array index to a temp register,
    then left shift. The extra copy is required because shift left
    operates on a single source-dest register only.

    Mostly the IBM compilers strength reduced the indexing to appear

    for (i=0; i<4×n; i+=4) {
    c[i/4] = a[i/4] + b[i/4]
    }

    Which is still necessary/useful for non-primitive types.

    Not needed, see above (too lazy to look up the /360 assembler :-)



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jun 20 21:48:37 2025
    On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
    -------------

    [reg], [reg+disp], [reg+index+disp] are all different address
    calculations.

    Yes, but the first 2 are a STRICT subset of the last one. So, you
    build the AGEN unit to perform the last one, and have DECODE feed
    zeros (0s) when you don't need the first two.

    Yes, back then there would be just one ALU/AGEN for the core, used for
    pretty much all arithmetic, though sometimes it would have a separate incrementer/decrementer so it can overlap it with the ALU/AGEN.

    Mc 88100 had::
    a) integer ALU (+ and -)
    b) address ALU (+ and <<{0,1,2,3})
    c) PC ALU (INC4, Disp16, Disp26)
    mostly because we did not want to route data to the ALU, and
    occasionally we wanted to use several FUs simultaneously.

    Note: Integer adder needed negate to perform SUB, this takes the
    same gate delay as AGEN with <<{0,1,2,3} with add-only.

    Even Mc66000 had 3 adders {PC, D, A}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Fri Jun 20 21:34:24 2025
    On Fri, 20 Jun 2025 18:06:18 +0000, John Levine wrote:

    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    See the message I sent yesterday. They knew about dynamic relocation and virtual
    memory, but considered it too risky to add to an already ambitious
    project.

    Considering that S/360 outsold all of its competitors combined, it's
    hard to
    argue it was a major mistake.

    What outsold the competitors is the ISA remaining stable over machine
    size and machine generation--preserving the software investment.

    Over in the number crunching side of things (CDC 6600-7600--CRAY)
    one had to hold onto Fortran decks and recompile for each machine.

    The attack of the Killer Micro's did not appear until circa 1977.

    Side note: CRAY sold a lot of vector processors at the time when NEC
    had higher performance CPUs and larger memories, because the CRAY
    machines had more memory BW that the I/O devices could use. So, a
    CRAY could be doing compute while storing the previous workload onto
    disk, and while loading the next workload from disk, completely
    overlapped with the computation.

    IBM had good peripherals, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to quadibloc on Fri Jun 20 23:20:12 2025
    On 6/20/2025 5:13 PM, quadibloc wrote:
    On Sat, 21 Jun 2025 0:06:35 +0000, quadibloc wrote:

    On Fri, 20 Jun 2025 23:57:29 +0000, quadibloc wrote:

    On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:

    The attack of the Killer Micro's did not appear until circa 1977.

    That could be considered the very beginning, as that's when the Altair
    8800 came out and so on.

    And since the context was discussing events before 1977, that's good
    enough to say that back then, micros weren't a problem for sure.

    But 8-bit microprocessors didn't kill minis and mainframes. They weren't >>> powerful enough to compete. When did micros really become killers?

    Well, they certainly were killers when the Pentium II came out in 1997,
    but I'd say that's rather a late date.

    Instead, micros were lethal to a lot of larger systems even before they
    reached that level of performance. In 1987, halfway between those two
    dates, Intel came out with the 387. Hardware floating point for a 32 bit >>> system? It's about at that point that anything larger became
    questionable.

    And I was able to find out that the phrase was coined by Eugene Brooks
    in 1990, in the title of a paper at Supercomputing 1990.

    1989 certainly included some momentous events - the Cyrix FasMath 83D87,
    and the Intel 486, with hadware floating-point standard.

    And don't forget, the 486 also included on chip cache.

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Stephen Fuld on Sat Jun 21 12:04:30 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 6/15/2025 1:20 PM, John Levine wrote:
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    The low end 360s were too underpowered to do
    time sharing and any sort of dynamic relocation would have just made them more
    expensive with no practical benefit. A 360/30 was a lot slower than a PDP-8 but
    made up for it with much better I/O devices.

    While I admit that was less familiar with the lower end systems, I think >>> the extra expense would have been a single register in the CPU to hold
    the base, and a few extra instructions at task switch time to save and
    reload it. Not very much. And the benefits to the larger systems would >>> have been significant when they implemented interactive usage.

    The 360/30 was byte serial and stored the 16 registers in core (and I mean >> core.) According to my copy of the 360/30 Functional Characteristics manual, a
    register to register load took 17us, memory to register took 24us, with an >> additional 4.5us if it was indexed. I'd think the time to add a system base >> register would be about the same as the indexing time, as would the comparison
    to the bound register, so that's an extra 9us for every instruction, which would
    be about a 30% slowdown. If they put those registers in logic to speed it up,
    it'd be a significant extra chunk of hardware.

    First John, I want to thank you for forcing me to think about this.
    Good to keep the brain active!

    I think your analysis is flawed. While you would have to add the
    contents of the system base register to compute the memory address for
    the memory to register load, you would save having to add the contents
    of the base register from the instruction. So I think it would be a wash.

    Furthermore, since the S/360 used storage keys for protection, there is
    no need for a bounds register.

    Lastly, since programs were loaded on page boundaries and the max memory
    on the /30 (I had to look this up) was 64K, the system base register
    would only have had to be 4 bits!, so maybe small enough to invest in
    actual hardware to hold it. IF so, it would have been a significant
    speedup, as you wouldn't have had to load the base register value from core.

    I am not sure of loading only at page boundaries. 360/30 had no
    paging hardware and AFAIK storage keys were optional, so in
    principle it could load anywhere. Also, it could load multiple
    modules as part of single program, so even with use of storage
    keys they would run with single key, so no need to go to separete
    page.

    Actually, with your proposal one would loose or cripple ability to
    load modules at different locations (thanks to multiple base
    registers such module could access data in different modules).

    Councerning hardware base register, note that 360 instructions
    were interpreted by microcode. Fetching corresponding microinstruction
    would be substantial cost for models keeping microcode in
    core (that is 20 and 25). On 30 it would be smaller penalty,
    but still non-negligible.

    So, non-negligible performance loss and loss of functionality
    to get "exotic" feature. If 360 architects consider all of
    this I think their decision would not change.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Sat Jun 21 14:33:58 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    John Levine <johnl@taugh.com> schrieb:

    Considering that S/360 outsold all of its competitors combined, it's hard to >> argue it was a major mistake.

    I think we can be in agreemend that it was, indeed, a mistake,
    but oviously not fatal.

    IBM had good peripherals, they had a good upgrade path to very
    powerful machines, and they were bit-compatible for user programs
    (plus, they put in the microcode emulation of the 1401 so their
    customers could transition smoothly - that was a genius move,
    the /360 could probably would have been far less of a success
    if that had not been possible). All of these were good reasons
    to buy these machines.

    Burroughs did the same, adding B300 emulation to the B3500.


    Customers could and did work around the memory fragmentation,
    but it didn't make their lives easier.

    But IBM severely underestimated the software complexity of the
    system they were creating, hence the delays and "The Mythical
    Man-Month" (and such abominations as JCL. Which way around is
    that COND parameter again? But because I made some money
    working on mainframes as a student, I cannot complain - nobody
    ever challenged the hours I billed because mainframes are
    complex, as everybody knows, and JCL was a large part of that :-)

    JCL was, indeed rather horrible. Something Burroughs avoided.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 21 14:36:19 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 20 Jun 2025 18:06:18 +0000, John Levine wrote:

    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    I'm pretty sure you're wrong. They didn't think they needed to move
    a program after it was loaded.

    Which was a mistake, but one hat had no impact on MFT (you had a
    fixed number of regions with a fixed memory there), but once
    they released MVT, because then memory fragmentation became
    inevitable.

    They may not have considered that early enough in the project.

    See the message I sent yesterday. They knew about dynamic relocation and
    virtual
    memory, but considered it too risky to add to an already ambitious
    project.

    Considering that S/360 outsold all of its competitors combined, it's
    hard to
    argue it was a major mistake.

    What outsold the competitors is the ISA remaining stable over machine
    size and machine generation--preserving the software investment.

    Over in the number crunching side of things (CDC 6600-7600--CRAY)
    one had to hold onto Fortran decks and recompile for each machine.

    Burroughs, on the other hand, had binary compatability throughout
    the lifetime of their mainframe lines, even after a major architectural redesign in the early 80s compiled applications from 1966 still
    ran fine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to quadibloc on Sat Jun 21 14:25:18 2025
    quadibloc <quadibloc@gmail.com> wrote:
    On Fri, 20 Jun 2025 21:34:24 +0000, MitchAlsup1 wrote:

    The attack of the Killer Micro's did not appear until circa 1977.

    That could be considered the very beginning, as that's when the Altair
    8800 came out and so on.

    And since the context was discussing events before 1977, that's good
    enough to say that back then, micros weren't a problem for sure.

    But 8-bit microprocessors didn't kill minis and mainframes. They weren't powerful enough to compete. When did micros really become killers?

    Well, they certainly were killers when the Pentium II came out in 1997,
    but I'd say that's rather a late date.

    Instead, micros were lethal to a lot of larger systems even before they reached that level of performance. In 1987, halfway between those two
    dates, Intel came out with the 387. Hardware floating point for a 32 bit system? It's about at that point that anything larger became
    questionable.

    I think you underestimate impact of micros. At lowest end ZX Spectrum
    and Commodre 64 gave nontrivial compute power at low cost. There was
    IBM PC and 68000-based workstations. So already around 1983 micros
    limited market for low end minis (and due to minis market for low end mainfraimes was limited earlier). Around 1990 there were Risc
    workstation and minis were legacy. VAX switched to microprocessors
    and DEC decided to replace VAX with Alpha. IBM started using
    microprocessors for its mainfraimes around 1993.

    If you consider that designers have to look forward few years,
    then 1977 looks like reasonable boundary date: before micropcessor
    were quite unlikely to be good choice, after frequently it had to
    be consdered.

    BTW: Alreay 8086 could be paired with 8087 which allowed cost
    effective floating point computation.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Stefan Monnier on Sat Jun 21 16:50:01 2025
    On 21/06/2025 16:39, Stefan Monnier wrote:
    What define(s|d) a "mini" or a "mainframe"?
    For "micro" AFAIK the definition is/was "single-chip CPU", so I guess
    "mini" would be something like "CPU made of 74xxx thingies?" and as for
    how to distinguish them from mainframes, I don't know.

    The old definition I recall is:
    If you can pick it up it's a micro.

    If you can't pick it up, but you can see over it, it's a mini.

    If you can't see over it it's a mainframe.

    It's obviously a bit of a joke - but I don't think I've heard anything
    better.

    Andy

    --
    Do not listen to rumour, but, if you do, do not believe it.
    Ghandi.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Sat Jun 21 11:39:26 2025
    I think you underestimate impact of micros. At lowest end ZX Spectrum
    and Commodre 64 gave nontrivial compute power at low cost. There was
    IBM PC and 68000-based workstations. So already around 1983 micros
    limited market for low end minis (and due to minis market for low end mainfraimes was limited earlier).

    What define(s|d) a "mini" or a "mainframe"?
    For "micro" AFAIK the definition is/was "single-chip CPU", so I guess
    "mini" would be something like "CPU made of 74xxx thingies?" and as for
    how to distinguish them from mainframes, I don't know.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sat Jun 21 14:57:16 2025
    MitchAlsup1 wrote:
    On Fri, 20 Jun 2025 19:13:35 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 19 Jun 2025 13:35:43 +0000, EricP wrote:
    -------------

    [reg], [reg+disp], [reg+index+disp] are all different address
    calculations.

    Yes, but the first 2 are a STRICT subset of the last one. So, you
    build the AGEN unit to perform the last one, and have DECODE feed
    zeros (0s) when you don't need the first two.

    Yes, back then there would be just one ALU/AGEN for the core, used for
    pretty much all arithmetic, though sometimes it would have a separate
    incrementer/decrementer so it can overlap it with the ALU/AGEN.

    Mc 88100 had::
    a) integer ALU (+ and -)
    b) address ALU (+ and <<{0,1,2,3})
    c) PC ALU (INC4, Disp16, Disp26)
    mostly because we did not want to route data to the ALU, and
    occasionally we wanted to use several FUs simultaneously.

    Note: Integer adder needed negate to perform SUB, this takes the
    same gate delay as AGEN with <<{0,1,2,3} with add-only.

    Even Mc66000 had 3 adders {PC, D, A}

    I had a look at the 360-30 uArch in the IBM Field Engineeering manual

    http://www.bitsavers.org/pdf/ibm/360/fe/2030/Y24-3360-1_2030_FE_Theory_Opns_Jun67.pdf

    and there is basically nothing to it.
    It is literally just a bunch of registers, an 8-bit ALU, a bunch of 8-bit
    buses from the registers to the ALU and a result bus back to the registers,
    a microcode read-only memory called CROS Capacitive Read Only Storage cards, and a microcode counter-sequencer. Understandably it's performance was something like 34.5 kIPS, as in 34,500 Instructions Per Second

    Here is a picture of the TROS Transformer Read Only Storage
    from a model-20 microcode:

    https://static.righto.com/images/ibm-360-50/tros.jpg

    Ken Shirriff also shows the much fancier -50 uArch:

    https://static.righto.com/images/ibm-360-50/diagram-w900.jpg

    Simulating the IBM 360/50 mainframe from its microcode http://www.righto.com/2022/01/ibm360model50.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Waldek Hebisch on Sat Jun 21 20:32:47 2025
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I am not sure of loading only at page boundaries. 360/30 had no
    paging hardware and AFAIK storage keys were optional,

    There was no mention of this in the Principle of Operations,
    and its timing is given in the System/360 Mode1 30 Functional
    Characteristics document, so I don't think this is true.

    so in
    principle it could load anywhere.

    We should also consider what the machine was capable of running.
    Like all of /360 it was supposed to hae run OS/360, but
    that was running late and was too big, so smaller systems
    were used. These were generally only capable of running
    one program at a time, so the point where to load becomes
    sort of moot. (Also, DOS/360 does not seem to have had a
    relocating loader, so everything had to be loaded at
    a pre-determined address.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Thomas Koenig on Sun Jun 22 01:26:46 2025
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I am not sure of loading only at page boundaries. 360/30 had no
    paging hardware and AFAIK storage keys were optional,

    There was no mention of this in the Principle of Operations,
    and its timing is given in the System/360 Mode1 30 Functional
    Characteristics document, so I don't think this is true.

    If you mean loading, it is really OS function, not architecture feature.

    so in
    principle it could load anywhere.

    We should also consider what the machine was capable of running.
    Like all of /360 it was supposed to hae run OS/360, but
    that was running late and was too big, so smaller systems
    were used. These were generally only capable of running
    one program at a time, so the point where to load becomes
    sort of moot. (Also, DOS/360 does not seem to have had a
    relocating loader, so everything had to be loaded at
    a pre-determined address.

    AFAIK in OS/360 overlays were separately loaded, just like
    programs. So even with one program running one was likely
    to want several modules, each at it own load adress.

    I am not sure what DOS was doing, but many OS/360 programs
    were supposed to run under DOS. Since overlays were used
    quite a lot I would expect DOS to support them. And due
    to conventions with base registers supporting overlays loaded
    at arbitrary location probably did not require much of effort.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 22 01:31:52 2025
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I am not sure of loading only at page boundaries. 360/30 had no
    paging hardware and AFAIK storage keys were optional,

    There was no mention of this in the Principle of Operations,
    and its timing is given in the System/360 Mode1 30 Functional
    Characteristics document, so I don't think this is true.

    It's on page 11 of Functional Characteristics. Storage Protection
    was an optional feature.

    We should also consider what the machine was capable of running.
    Like all of /360 it was supposed to hae run OS/360, but
    that was running late and was too big, so smaller systems
    were used.

    I saw someone run OS on a 64K /30 but you're right, DOS and TOS
    were much more common.

    These were generally only capable of running
    one program at a time, so the point where to load becomes
    sort of moot. (Also, DOS/360 does not seem to have had a
    relocating loader, so everything had to be loaded at
    a pre-determined address.

    I think you're right but I don't understand your point. All models of
    the 360 had the same architecture and the same instruction set so even
    if DOS didn't do load time relocation, other operating systems did and
    they ran on the same machines.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Waldek Hebisch on Sun Jun 22 01:36:25 2025
    It appears that Waldek Hebisch <antispam@fricas.org> said:
    AFAIK in OS/360 overlays were separately loaded, just like
    programs. So even with one program running one was likely
    to want several modules, each at it own load adress.

    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the appropriate overlay when you called down into one.

    One load module could also run another using system calls which was occasionally useful, e.g., the sort program could call the linkage
    editor to make a loadable module of the specific comparison and exit
    routins for a sort run.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Sun Jun 22 08:57:04 2025
    John Levine <johnl@taugh.com> schrieb:
    It appears that Waldek Hebisch <antispam@fricas.org> said:
    AFAIK in OS/360 overlays were separately loaded, just like
    programs. So even with one program running one was likely
    to want several modules, each at it own load adress.

    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the appropriate overlay when you called down into one.

    Over-use of that technique may have been the reason why the linker
    was so sloooooow, even on machines with adequate memory, or maybe
    it was something else. (I think they replaced it with something almost-compatible, and even hijacked the IEWL name, almost
    unheard of at IBM).

    Of course I could drink a coffee during the 20 or so minutes wall-time
    it took a program with one of the graphics libraries I using to link
    (the jobs were high priority, so they were running right away)
    but nobody can drink that much coffee.

    One load module could also run another using system calls which was occasionally useful, e.g., the sort program could call the linkage
    editor to make a loadable module of the specific comparison and exit
    routins for a sort run.

    Sort of early JIT,then (pun intended).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 22 17:52:44 2025
    According to Thomas Koenig <tkoenig@netcologne.de>:
    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the
    appropriate overlay when you called down into one.

    Over-use of that technique may have been the reason why the linker
    was so sloooooow, even on machines with adequate memory, or maybe
    it was something else. (I think they replaced it with something >almost-compatible, and even hijacked the IEWL name, almost
    unheard of at IBM).

    The manual says there were two versions of the linker, E level 15K and
    F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
    It says in the three versions of F "the logic and control flow is
    identical" but the bigger ones are faster which suggests to me that
    they unfolded some overlays.

    15K is really small, it must have overlaid like crazy.

    One load module could also run another using system calls which was
    occasionally useful, e.g., the sort program could call the linkage
    editor to make a loadable module of the specific comparison and exit
    routins for a sort run.

    Sort of early JIT,then (pun intended).

    Very much so. Tape systems spent more time sorting than doing anything
    else, so they had all sorts of hacks to speed it up. Precompiling the
    inner loop was just one of them. I gather they wrote their own channel programs, too.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Sun Jun 22 18:25:02 2025
    On Sun, 22 Jun 2025 17:52:44 +0000, John Levine wrote:

    According to Thomas Koenig <tkoenig@netcologne.de>:
    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the
    appropriate overlay when you called down into one.

    Over-use of that technique may have been the reason why the linker
    was so sloooooow, even on machines with adequate memory, or maybe
    it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
    unheard of at IBM).

    The manual says there were two versions of the linker, E level 15K and
    F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
    It says in the three versions of F "the logic and control flow is
    identical" but the bigger ones are faster which suggests to me that
    they unfolded some overlays.

    15K is really small, it must have overlaid like crazy.

    1975-77 I worked on a Sigma 5 computer system that only had 16KB
    of core. You had to fit the OS, and your application into 16K;
    with the OS eating up 5-6K of your memory. So, yes, everything
    was overlaid to the hilt.

    Our application was::
    a) real time capture of A/D readout from NMR into 4K array
    b) convert to float
    c) FFT on the float data
    d) conjugate multiply with precomputed 4K array
    e) FFT-1
    g) write the data onto the Textronics Display in graphics form
    h) with ability to save to disk/tape for later.

    So, yes, there were a lot of overlays !!

    Later we added the computer driving a frequency generator while
    capturing the A/D data "coherently".

    One load module could also run another using system calls which was
    occasionally useful, e.g., the sort program could call the linkage
    editor to make a loadable module of the specific comparison and exit
    routins for a sort run.

    Sort of early JIT,then (pun intended).

    Very much so. Tape systems spent more time sorting than doing anything
    else, so they had all sorts of hacks to speed it up. Precompiling the
    inner loop was just one of them. I gather they wrote their own channel programs, too.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Sun Jun 22 20:29:41 2025
    John Levine <johnl@taugh.com> writes:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the
    appropriate overlay when you called down into one.

    Over-use of that technique may have been the reason why the linker
    was so sloooooow, even on machines with adequate memory, or maybe
    it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
    unheard of at IBM).

    The manual says there were two versions of the linker, E level 15K and
    F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
    It says in the three versions of F "the logic and control flow is
    identical" but the bigger ones are faster which suggests to me that
    they unfolded some overlays.

    15K is really small, it must have overlaid like crazy.

    One load module could also run another using system calls which was
    occasionally useful, e.g., the sort program could call the linkage
    editor to make a loadable module of the specific comparison and exit
    routins for a sort run.

    Sort of early JIT,then (pun intended).

    Very much so. Tape systems spent more time sorting than doing anything
    else, so they had all sorts of hacks to speed it up. Precompiling the
    inner loop was just one of them. I gather they wrote their own channel >programs, too.

    The Burroughs medium systems Sort intrinsic would even read the tape
    backwards to improve sort speed. The author of the intrinsic was
    justifiably proud of the performance for a variety of source and
    destination media. A 16-unit tape sort/merge was really impressive
    to watch.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jun 22 22:44:44 2025
    According to Scott Lurndal <slp53@pacbell.net>:
    Sort of early JIT,then (pun intended).

    Very much so. Tape systems spent more time sorting than doing anything >>else, so they had all sorts of hacks to speed it up. Precompiling the >>inner loop was just one of them. I gather they wrote their own channel >>programs, too.

    The Burroughs medium systems Sort intrinsic would even read the tape >backwards to improve sort speed.

    That was a standard trick. A tape sort read the inputs, wrote sorted runs of records on several tapes, then repeatedly merged the runs from one group of tapes to another until there was one big sorted run. I think sometime in the 1950s someone noticed that rather than rewinding between passes, just read the tapes backward and sort in reverse order. You might end up with the final sort backwards and have to do one more pass to make it forwards but I gather it was worth it.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Levine on Mon Jun 23 06:07:11 2025
    John Levine <johnl@taugh.com> schrieb:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    A load module could contain multiple tree structured overlays, and in
    OS at least, the linker added glue code that loaded and relocated the
    appropriate overlay when you called down into one.

    Over-use of that technique may have been the reason why the linker
    was so sloooooow, even on machines with adequate memory, or maybe
    it was something else. (I think they replaced it with something >>almost-compatible, and even hijacked the IEWL name, almost
    unheard of at IBM).

    The manual says there were two versions of the linker, E level 15K and
    F level 44K, and three subversions of the latter, 44K, 88K, and 128K.
    It says in the three versions of F "the logic and control flow is
    identical" but the bigger ones are faster which suggests to me that
    they unfolded some overlays.

    15K is really small, it must have overlaid like crazy.

    And they probably didn't touch it again... The machine I worked
    on was a Fujitsu rebranded as a Siemens 7881. I didn't know the
    original Fujitsu name at the time. It ran BS 3000, which was an
    MVS clone. And with a main memory of 2*16MB and a normal job size
    of 1MB or more (no reason to select anything less), it still ran
    dead slow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to quadibloc on Mon Jun 23 17:13:34 2025
    quadibloc <quadibloc@gmail.com> schrieb:
    On Mon, 23 Jun 2025 6:07:11 +0000, Thomas Koenig wrote:

    And they probably didn't touch it again... The machine I worked
    on was a Fujitsu rebranded as a Siemens 7881. I didn't know the
    original Fujitsu name at the time. It ran BS 3000, which was an
    MVS clone.

    I tried to look it up, and found it was really a Siemens 7.881-2 (the punctuation is important). And this was one of Fujitsu's larger scale systems, intended to compete with the IBM 3800, so if it ran dead
    slow, that is surprising.

    What I meant that the linker ran awfully slowly if there was
    anything big to link. Apart from that, I really didn't have
    any meaningful comparisons, it was the first large system I ever
    worked on.

    For some reason, they renamed the standard IBM utilities, so
    IEBGENER became JSEGENER (but an IEBGENER alias was still
    provided).

    Also, the English in their documentation was really strange.
    When the computer center switched to an IBM 3090, that was
    a dramatic improvement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Tue Jul 29 08:45:14 2025
    John Savard <quadibloc@invalid.invalid> writes:
    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.

    Microcode may have been a good thing somewhat earlier when ROM or the
    writable control store (WCS) could be run at speeds much higher than
    core memory (how was the WCS actually implemented?), but core memory
    had been replaced by semiconductor DRAM by the time the VAX was
    introduced, and that was faster (already the Nova 800 of 1971 had an
    800ns cycle, and Acorn managed to access DRAM at 8MHz (but only when
    staying within the same row) in 1987); my guess is that in the VAX
    11/780 timeframe, 2-3MHz DRAM access within a row would have been
    possible. Moreover, the VAX 11/780 has a cache (it also has a WCS).
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    Nevertheless, if I time-traveled to the start of the VAX design, and
    was put in charge of designing the VAX, I would design a RISC, and I
    am sure that it would outperform the actual VAX 11/780 by at least a
    factor of 2. So no, I don't think that the VAX architecture was a
    good match for the technology of the time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Wed Jul 30 05:59:18 2025
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and >PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting >one address modify another in the same instruction, would have made it a lot >easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to
    achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what
    RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of
    conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Jul 29 16:44:35 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex instruction and address modes and the tiny 512 byte page size.

    Another, which is not entirely their fault, is that they did not expect compilers to improve as fast as they did, leading to a machine which was fun to program in assembler but full of stuff that was useless to compilers and instructions like POLY that should have been subroutines. The 801 project and PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC presumably didn't know about it.

    Related to the microcode issue they also don't seem to have anticipated how important pipelining would be. Some minor changes to the VAX, like not letting one address modify another in the same instruction, would have made it a lot easier to pipeline.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Wed Aug 27 00:35:18 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
    PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    Much greater than just "well aware" there were at least 15 grad students
    at CMU working on optimizing compilers AND the VAX ISA; as well as Wulf, Newell, and Bell leading the pack.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Hold on a minute:: My Transcendentals are done in POLY-like fashion,
    it is just that the constants come from ROM inside the FPU, instead
    of user defined DRAM coefficients. Thus, POLY is good, POLY as an
    instruction is bad.

    Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >easier to pipeline.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Compilers have taught us that one-address-mode per instruction is
    "sufficient" {if you are going to have address modes.}

    My work on My 66000 has taught me that 1 constant per instruction
    is nearly sufficient. The only places I break this is ST #val[disp]
    and LOOP cnd,Ri,#inc,#max.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    Condition codes get hard when DECODE width grows greater than 3.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to MitchAlsup on Wed Aug 27 05:12:57 2025
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Wed Aug 27 17:19:06 2025
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.
    ...
    [...] POLY as an
    instruction is bad.

    Exactly.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    It's better to forget this misinformation, and instead remember that
    the VAX has an average CPI of 10.6 (Table 8 of <https://american.cs.ucdavis.edu/academic/readings/papers/p301-emer.pdf>)

    Table 9 of that reference is also interesting:

    CALL/RET instructions take an average 45 cycles, Character
    instructions (I guess this means stuff like EDIT) takes an average 117
    cycles, and Decimal instructions take an average 101 cycles. It seems
    that these instructions all have no special hardware support on the
    VAX 11/780 and do it all through microcode. So replacing Character
    and Decimal instructions with calls to functions on a RISC-VAX could
    easily outperform the VAX 11/780 even without special hardware
    support. Now add decimal support like the HPPA has done or string
    support like the Alpha has done, and you see even better speed for
    these instructions.

    For CALL/RET, one might use one of the modern calling conventions.
    However, this loses some capabilities compared to the VAX. So one may
    prefer to keep frame pointers by default and maybe other features that
    allow, e.g., universal cross-language debugging on the VAX without monstrosities like ELF and DWARF.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    I don't know what you are getting at here. When implementing the 486,
    Intel chose the following pipeline:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2/OP
    Writeback

    This meant that load-and-op instructions take 2 cycles (and RMW
    instructions take three); it gave us the address-generation interlock (op-to-load latency 2), and 3-cycle taken branches. An alternative
    would have been:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2
    OP
    Writeback

    This would have resultet in a max throughput of 1 CPI for sequences of load-and-op instructions, but would have resultet in an AGI of 3
    cycles, and 4-cycle taken branches.

    For the Bonnell Intel chose such a pipeline (IIRC with a third mem
    stage), but the Bonnell has a branch predictor, so the longer branch
    latency usually does not strike.

    AFAIK IBM used such a pipeline for some S/360 descendants.

    Condition codes get hard when DECODE width grows greater than 3.

    And yet the widest implementations (up to 10 wide up to now) are of
    ISAs that have condition-code registers. Even particularly nasty ones
    in the case of AMD64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)