• Re: Why VAX Was the Ultimate CISC and Not RISC

    From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat Mar 1 11:58:17 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Could the VAX have been designed as a
    RISC architecture to begin with? Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer” >market and wiped the floor with DEC’s flagship architecture, >performance-wise.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches.

    Like other USA-based computer architects, Bell ignores ARM, which
    outperformed the VAX without using caches and was much easier to
    design.

    As for code size, we see significantly smaller code for RISC
    instruction sets with 16/32-bit encodings such as ARM T32/A32 and
    RV64GC than for all CISCs, including AMD64, i386, and S390x <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
    so much better in this respect that its code is significantly smaller
    than for these CPUs.

    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce that I know how to
    design their next architecture, and how to compiler for it, I would
    give the implementation team RV32GC as architecture to implement, and
    that they should use pipelining for that, and of course also give that
    to the software people.

    As a result, DEC would have had an architecture that would have given
    them superior performance, they would not have suffered from the
    infighting of VAX9000 vs. PRISM etc. (and not from the wrong decision
    to actually build the VAX9000), and might still be going strong to
    this day. They would have been able to extend RV32GC to RV64GC
    without problems, and produce superscalar and OoO implementations.

    OTOH, DEC had great success with the VAX for a while, and their demise
    may have been unavoidable given their market position: Their customers (especially the business customers of VAXen) went to them instead of
    IBM, because they wanted something less costly, and they continued
    onwards to PCs running Linux when they provided something less costly.
    So DEC would also have needed to outcompete Intel and the PC market to
    succeed (and IBM eventually got out of that market).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Mar 1 08:09:35 2025
    Found this paper <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
    at Gordon Bell’s website. Talking about the VAX, which was designed as
    the ultimate “kitchen-sink” architecture, with every conceivable
    feature to make it easy for compilers (and humans) to generate code,
    he explains:

    The VAX was designed to run programs using the same amount of
    memory as they occupied in a PDP-11. The VAX-11/780 memory range
    was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
    to have very efficient encoding of programs. Very efficient
    encoding of programs was achieved by having a large number of
    instructions, including those for decimal arithmetic, string
    handling, queue manipulation, and procedure calls. In essence, any
    frequent operation, such as the instruction address calculations,
    was put into the instruction-set. VAX became known as the
    ultimate, Complex (Complete) Instruction Set Computer. The Intel
    x86 architecture followed a similar evolution through various
    address sizes and architectural fads.

    The VAX project started roughly around the time the first RISC
    concepts were being researched. Could the VAX have been designed as a
    RISC architecture to begin with? Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer” market and wiped the floor with DEC’s flagship architecture, performance-wise.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches. It
    should be noted at the time the VAX-11/780 was introduced, DRAMs
    were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
    sizes continued to improve following Moore’s Law, but it wasn’t
    till 1985, that Reduced Instruction Set Computers could be built
    in a cost-effective fashion using SRAM caches. In essence RISC
    traded off cache memories built from SRAMs for the considerably
    faster, and less expensive Read Only Memories that held the more
    complex instructions of VAX (Bell, 1986).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sat Mar 1 17:59:51 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce that I know how to
    design their next architecture, and how to compiler for it, I would
    give the implementation team RV32GC as architecture to implement, and
    that they should use pipelining for that, and of course also give that
    to the software people.

    There was also the question of PDP-11 compatibility. I would solve
    that by adding a PDP-11 decoder that produces RV32G instructions (or
    maybe the microcode that the RV32G decoder produces). Low-end models
    may get a dynamic binary translator instead.

    OTOH, DEC had great success with the VAX for a while, and their demise
    may have been unavoidable given their market position: Their customers >(especially the business customers of VAXen) went to them instead of
    IBM, because they wanted something less costly, and they continued
    onwards to PCs running Linux when they provided something less costly.
    So DEC would also have needed to outcompete Intel and the PC market to >succeed (and IBM eventually got out of that market).

    OTOH, HP was also a big player in the mini and later workstation
    market, and they managed to survive, albeit by eventually splitting
    themselves into HPE for the big iron, and the other part for the PCs
    and printers. But it may be the exception that proves the rule.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Mar 1 18:03:21 2025
    On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Could the VAX have been designed as a
    RISC architecture to begin with? Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer” >>market and wiped the floor with DEC’s flagship architecture, >>performance-wise.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches.

    Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to
    design.

    Was ARM around when VAX was being designed (~1973) ??

    "The Case for the Reduced Instruction Set Computer" was after
    1980 as a point of temporal reference.

    As for code size, we see significantly smaller code for RISC
    instruction sets with 16/32-bit encodings such as ARM T32/A32 and
    RV64GC than for all CISCs, including AMD64, i386, and S390x <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
    so much better in this respect that its code is significantly smaller
    than for these CPUs.

    VAX's advantage was it executed fewer instructions (VAX only executed
    65% of the number of instructions R2000 executed.)

    My 66000 only needs 70% of the instructions RISC-V requires. Thus
    it is within spitting distance of VAX instruction count while still
    being almost a RISC architecture.

    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce

    You would also have to convince the Computer Science department at
    CMU; Where a lot of VAX ideas were dreamed up based on the success
    of the PDP-11.

    that I know how to
    design their next architecture, and how to compiler for it, I would
    give the implementation team RV32GC as architecture to implement, and
    that they should use pipelining for that, and of course also give that
    to the software people.

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    As a result, DEC would have had an architecture that would have given
    them superior performance, they would not have suffered from the
    infighting of VAX9000 vs. PRISM etc. (and not from the wrong decision
    to actually build the VAX9000), and might still be going strong to
    this day. They would have been able to extend RV32GC to RV64GC
    without problems, and produce superscalar and OoO implementations.

    The design point you target for the original VAX would have taken
    significantly longer to design, debug, and ship.

    OTOH, DEC had great success with the VAX for a while, and their demise
    may have been unavoidable given their market position: Their customers (especially the business customers of VAXen) went to them instead of
    IBM, because they wanted something less costly, and they continued
    onwards to PCs running Linux when they provided something less costly.
    So DEC would also have needed to outcompete Intel and the PC market to succeed (and IBM eventually got out of that market).

    Unclear.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Mar 1 20:01:01 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

    Like other USA-based computer architects, Bell ignores ARM, which
    outperformed the VAX without using caches and was much easier to
    design.

    Was ARM around when VAX was being designed (~1973) ??

    ARM was designed starting in 1983, if Wikipedia is to be believed.

    The only ones experimenting (successfully) with RISC at the time
    the VAX was designed were IBM with the 801, and they were kept
    from realizing their full potential by IBM's desire to not hurt
    their /370 business.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Mar 1 14:40:55 2025
    Lawrence D'Oliveiro wrote:
    Found this paper <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
    at Gordon Bell’s website. Talking about the VAX, which was designed as
    the ultimate “kitchen-sink” architecture, with every conceivable
    feature to make it easy for compilers (and humans) to generate code,
    he explains:

    The VAX was designed to run programs using the same amount of
    memory as they occupied in a PDP-11. The VAX-11/780 memory range
    was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
    to have very efficient encoding of programs. Very efficient
    encoding of programs was achieved by having a large number of
    instructions, including those for decimal arithmetic, string
    handling, queue manipulation, and procedure calls. In essence, any
    frequent operation, such as the instruction address calculations,
    was put into the instruction-set. VAX became known as the
    ultimate, Complex (Complete) Instruction Set Computer. The Intel
    x86 architecture followed a similar evolution through various
    address sizes and architectural fads.

    The VAX project started roughly around the time the first RISC
    concepts were being researched. Could the VAX have been designed as a
    RISC architecture to begin with? Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer” market and wiped the floor with DEC’s flagship architecture, performance-wise.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches. It
    should be noted at the time the VAX-11/780 was introduced, DRAMs
    were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
    sizes continued to improve following Moore’s Law, but it wasn’t
    till 1985, that Reduced Instruction Set Computers could be built
    in a cost-effective fashion using SRAM caches. In essence RISC
    traded off cache memories built from SRAMs for the considerably
    faster, and less expensive Read Only Memories that held the more
    complex instructions of VAX (Bell, 1986).

    If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it was still taking multiple clocks to serially decode each instruction and
    that basically stalls away any benefits a pipeline might have given.

    If they had just only put in *the things they actually use*
    (as show by DEC's own instruction usage stats from 1982),
    and left out all the things that they rarely or never use,
    it would have had 50 or so opcodes instead of 305,
    at most one operand that addressed memory on arithmetic and logic opcodes
    with 3 address modes (register, register address, register offset address) instead of 0 to 5 variable length operands with 13 address modes each
    (most combinations of which are either silly, redundant, or illegal).

    Then they would have be able to parse instructions in one clock,
    which makes pipelining a possible consideration,
    and simplifies the uArch so now it can all fit on one chip,
    which allows it to complete with RISC.

    The reason it was designed the way it was, was because DEC had
    microcode and microprogramming on the brain.
    In this 1975 paper Bell and Strecher say it over and over and over.
    They were looking at the cpu design as one large parsing machine
    and not as a set of parallel hardware tasks.

    This was their mental mindset just before they started the VAX design:

    What Have We Learned From PDP11, Bell Strecker, 1975 https://gordonbell.azurewebsites.net/Digital/Bell_Strecker_What_we%20_learned_fm_PDP-11c%207511.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Mar 1 20:46:29 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches.

    Like other USA-based computer architects, Bell ignores ARM, which >outperformed the VAX without using caches and was much easier to
    design.

    That's not a fair comparison. VAX design started in 1975 and shipped in 1978. The first ARM design started in 1983 with working silicon in 1985. It was a decade later.

    On the other hand, I think some things were shortsighted even at the time. As Bell's paper said, they knew about Moore's law but didn't believe it. If they believed it they could have made the instructions a little less dense and a lot easier to decode and pipeline. STRETCH did pipelining in the 1950s so they should have been aware of it and considered that future machines could use it.

    As someeone else noted, they had microcode on the brain and the VAX instruction set is clearly designed to be decoded by microcode one byte at a time. Address modes can have side-effects so you have to decode them serially or have a big honking hazard scheme. They probably also assumed that microcode ROM would
    be faster than RAM which even in 1975 was not particularly true. Rather than putting every possible instruction into microcode, have a fast subroutine call and make them subsroutines which can be cached and pipelined.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Mar 1 23:19:24 2025
    On Sat, 01 Mar 2025 14:40:55 -0500, EricP wrote:

    If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
    was still taking multiple clocks to serially decode each instruction and
    that basically stalls away any benefits a pipeline might have given.

    How many clocks did Alpha take to process each instruction? Because I
    recall the initial chips had clock speeds several times that of the RISC competition, but performance, while competitive, was not several times
    greater.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Mar 1 22:30:32 2025
    On Sat, 1 Mar 2025 19:40:55 +0000, EricP wrote:

    Lawrence D'Oliveiro wrote:
    Found this paper
    <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
    at Gordon Bell’s website. Talking about the VAX, which was designed as
    the ultimate “kitchen-sink” architecture, with every conceivable
    feature to make it easy for compilers (and humans) to generate code,
    he explains:

    The VAX was designed to run programs using the same amount of
    memory as they occupied in a PDP-11. The VAX-11/780 memory range
    was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
    to have very efficient encoding of programs. Very efficient
    encoding of programs was achieved by having a large number of
    instructions, including those for decimal arithmetic, string
    handling, queue manipulation, and procedure calls. In essence, any
    frequent operation, such as the instruction address calculations,
    was put into the instruction-set. VAX became known as the
    ultimate, Complex (Complete) Instruction Set Computer. The Intel
    x86 architecture followed a similar evolution through various
    address sizes and architectural fads.

    The VAX project started roughly around the time the first RISC
    concepts were being researched. Could the VAX have been designed as a
    RISC architecture to begin with? Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer”
    market and wiped the floor with DEC’s flagship architecture,
    performance-wise.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches. It
    should be noted at the time the VAX-11/780 was introduced, DRAMs
    were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
    sizes continued to improve following Moore’s Law, but it wasn’t
    till 1985, that Reduced Instruction Set Computers could be built
    in a cost-effective fashion using SRAM caches. In essence RISC
    traded off cache memories built from SRAMs for the considerably
    faster, and less expensive Read Only Memories that held the more
    complex instructions of VAX (Bell, 1986).

    If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
    was
    still taking multiple clocks to serially decode each instruction and
    that basically stalls away any benefits a pipeline might have given.

    If they had just only put in *the things they actually use*
    (as show by DEC's own instruction usage stats from 1982),
    and left out all the things that they rarely or never use,
    it would have had 50 or so opcodes instead of 305,
    at most one operand that addressed memory on arithmetic and logic
    opcodes
    with 3 address modes (register, register address, register offset
    address)
    instead of 0 to 5 variable length operands with 13 address modes each
    (most combinations of which are either silly, redundant, or illegal).

    Excepting for the 1 memory operand per instruction, the above para-
    graph accurately describes My 66000 ISA.

    Then they would have be able to parse instructions in one clock,
    which makes pipelining a possible consideration,
    and simplifies the uArch so now it can all fit on one chip,
    which allows it to complete with RISC.

    If VAX had stuck with PDP-11 address modes and simply added the
    {Byte, Half, Word, Double} accesses it would have been a lot easier
    to pipeline.

    The reason it was designed the way it was, was because DEC had
    microcode and microprogramming on the brain.

    As did most of academia at the time.

    In this 1975 paper Bell and Strecher say it over and over and over.
    They were looking at the cpu design as one large parsing machine
    and not as a set of parallel hardware tasks.

    Orthogonality, Regularity, Expressibility, ...

    This was their mental mindset just before they started the VAX design:

    What Have We Learned From PDP11, Bell Strecker, 1975 https://gordonbell.azurewebsites.net/Digital/Bell_Strecker_What_we%20_learned_fm_PDP-11c%207511.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sat Mar 1 22:25:26 2025
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    The answer was no, the VAX could not have been done as a RISC >>>architecture. RISC wasn’t actually price-performance competitive until >>>the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches.

    Like other USA-based computer architects, Bell ignores ARM, which >>outperformed the VAX without using caches and was much easier to
    design.

    That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >The first ARM design started in 1983 with working silicon in 1985. It was a >decade later.

    The point is that ARM outperformed VAX without using caches. DRAM
    with 800ns cycle time was available in 1971 (the Nova 800 used it).
    By 1977, when the VAX 11/780 was released, certainly faster DRAM was
    available.

    So I think that, for a VAX-11/780-priced machine, they could have had
    a pipelined RISC that reads instructions from two 32-bit-wide DRAM
    banks alternatingly, resulting in maybe 3-4 32-bits of instructions
    delivered per microsecond for straight-line code without loads or
    stores. And in RV32GC many instructions take only 16 bits, so these
    3-4 32-bits contain maybe 5-6 instructions. So that might be 5-6 peak
    MIPS, maybe 3 average MIPS, compared to 0.5 VAX MIPS. Some VAX
    instructions have to be replaced with several RISC instructions, so
    let's say these 3 RISC MIPS correspond to 2 VAX MIPS. That would
    still be faster than the VAX 11/780, which reportedly had about 0.5
    MIPS.

    The other thing is that the VAX 11/780 (released 1977) had a 2KB
    cache, so Bell's argument that caches were only available around 1985
    does not hold water on that end, either. So my 1977 RISC project
    would have used that cache, too, increasing the performance of the
    result even more.

    Yes, commercial RISCs only happened in 1986 or so, but there is no
    technical reason for that, only that commercial architects did not
    believe in such things at the time. It took research projects from
    several sources until the concept had enough credibility to be taken
    seriously. That's why I asked for the magic wand for my time-travel
    project.

    It's interesting that this lack of credibility apparently includes
    IBM, whose research lab pioneered the concept. They produced the IBM
    801 with 15MHz clock, probably around the time of the first VAX, but
    the IBM 801 had no MMU; not sure what RAM technology they used.

    IBM tried to commercialize it in the ROMP in the IBM RT PC; Wikipedia
    says: "The architectural work on the ROMP began in late spring of
    1977, as a spin-off of IBM Research's 801 RISC processor ... The first
    examples became available in 1981, and it was first used commercially
    in the IBM RT PC announced in January 1986. ... The delay between the completion of the ROMP design, and introduction of the RT PC was
    caused by overly ambitious software plans for the RT PC and its
    operating system (OS)." And IBM then designed a new RISC, the
    RS/6000, which was released in 1990.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 00:16:06 2025
    On Sat, 01 Mar 2025 11:58:17 GMT, Anton Ertl wrote:

    Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to design.

    While those ARM chips were legendary for their low power consumption (and
    low transistor count), those Archimedes machines were not exactly low-
    cost, as I recall.

    Without caches, did they have to use faster (and therefore more expensive) memory? Or did they fall back on the classic “wait states”?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun Mar 2 01:02:04 2025
    On Sat, 1 Mar 2025 22:29:27 +0000, BGB wrote:

    On 3/1/2025 5:58 AM, Anton Ertl wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    ------------------------------
    Would likely need some new internal operators to deal with bit-array operations and similar, with bit-ranges allowed as a pseudo-value type
    (may exist in constant expressions but will not necessarily exist as an actual value type at runtime).
    Say:
    val[63:32]
    Has the (63:32) as a BitRange type, which then has special semantics
    when used as an array index on an integer type, ...

    Mc 88K and My 66000 both have bit-vector operations.

    The previous idea for bitfield extract/insert had turned into a
    composite BITMOV instruction that could potentially do both operations
    in a single instruction (along with moving a bitfield directly between
    two instructions).

    Using CARRY and extract + insert, one can extract a field spanning
    a doubleword and then insert it into another pair of doublewords.
    1 pseudo-instruction, 2 actual instructions.

    Idea here is that it may do, essentially a combination of a shift and a masked bit-select, say:
    Low 8 bits of immediate encode a shift in the usual format:
    Signed 8-bit shift amount, negative is right shift.
    High bits give a pair of bit-offsets used to compose a bit-mask.
    These will MUX between the shifted value and another input value.

    You want the offset (a 6-bit number) and the size (another 6-bit number)
    in order to identify the field in question.

    I am still not sure whether this would make sense in hardware, but is
    not entirely implausible to implement in the Verilog.

    In the extract case, you have the shifter before the masker
    In the insert case, you have the masker before the shifter
    followed by a merge (OR). Both maskers use the size. Offset
    goes only to the shifter.

    Would likely be a 2 or 3 cycle operation, say:
    EX1: Do a Shift and Mask Generation;
    May reuse the normal SHAD unit for the shift;
    Mask-Gen will be specialized logic;
    EX2:
    Do the MUX.
    EX3:
    Present MUX result as output (passed over from EX2).

    I have done these in 1 cycle ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 02:40:45 2025
    On Sat, 01 Mar 2025 22:25:26 GMT, Anton Ertl wrote:

    The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
    so Bell's argument that caches were only available around 1985 does not
    hold water on that end, either.

    It was about the sizes of the caches, and hence their contribution to the
    cost.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Anton Ertl on Sat Mar 1 18:29:50 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    IBM tried to commercialize it in the ROMP in the IBM RT PC; Wikipedia
    says: "The architectural work on the ROMP began in late spring of
    1977, as a spin-off of IBM Research's 801 RISC processor ... The first examples became available in 1981, and it was first used commercially
    in the IBM RT PC announced in January 1986. ... The delay between the completion of the ROMP design, and introduction of the RT PC was
    caused by overly ambitious software plans for the RT PC and its
    operating system (OS)." And IBM then designed a new RISC, the
    RS/6000, which was released in 1990.

    ROMP originally for DISPLAYWRITER follow-on ... running CP.r operating
    system and PL.8 programming language. ROMP was minimal 801, didn't have supervisor/problem mode ... at the time their claim was PL.8 would only generate correct code and CP.r would only load/execute correct programs.
    They claimed 40bit addressing ... 32 bit addresses ... but top four bits selected 16 "segment registers" that contained 12bit
    segment-identifiers. ... aka 28bit segment displacement and 12bit
    segment-ids (40bits) .... and any inline code could change segment
    register value ... as easily as could load any general register.

    When follow-on to DISPLAYWRITER was canceled, they pivoted to UNIX
    workstation market and got the company that had done AT&T unix port to
    IBM/PC for PC/IX ... to do AIX. Now ROMP needed supervisor/problem mode
    and inline code could no longer change segment register values
    ... needed to have supervisor call.

    Folklore is they also had 200 PL.8 programmers and needed something for
    them to do, so they gen'ed a abstract virtual machine system ("VRM") (implemented in PL.8) and had AIX port be done to the abstract virtual
    machine definition (instead of real hardware) .... claiming that the
    combined effort would be less (total effort) than having the outside
    company do the AIX port to the real hardware (also putting in a lot of
    IBM SNA communication support).

    The IBM Palo Alto group had been working on UCB BSD port to 370, but was redirected to do it instead to bare ROMP hardware ... doing it in
    enormously significantly less resources than the VRM+AIX+SNA effort.

    Move to RS/6000 & RIOS (large multi-chip) doubled the 12bit segment-id
    to 24bit segment-id (and some left-over description talked about it
    being 52bit addressing) and eliminated the VRM ... and adding in some
    amount of BSDisms.

    AWD had done their own cards for PC/RT (16bit AT) bus, including a 4mbit token-ring card. Then for RS/6000 microchannel, AWD was told they
    couldn't do their own card, but had to do PS2 microchannel cards. The communication group was fiercely fighting off client/server and
    distributed computing and had seriously performance knee-capped PS2
    cards, including ($800) 16mbit token-ring card (the PS2 microchannel
    which had lower card throughput than the PC/RT 4mbit TR card). There
    was joke that PC/RT 4mbit TR server having higher throughput than
    RS/6000 16mbit TR server. There was also joke that the RS6000/730 with
    VMEbus was a work around corporate politics and being able to install high-performance workstation cards

    We got the HA/6000 project in 1988 (approved by Nick Donofrio),
    originally for NYTimes to move their newspaper system off VAXCluster to RS/6000. I rename it HA/CMP. https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing when I start doing technical/scientific cluster scale-up with national
    labs (LLNL, LANL, NCAR, etc) and commercial cluster scale-up with RDBMS
    vendors (Oracle, Sybase, Ingres, Informix that had vaxcluster support in
    same source base with unix). The S/88 product administrator then starts
    taking us around to their customers and also has me do a section for the corporate continuous availability strategy document ... it gets pulled
    when both Rochester/AS400 and POK/(high-end mainframe) complain they
    couldn't meet the requirements.

    Early Jan1992 have a meeting with Oracle CEO and IBM/AWD Hester tells
    Ellison we would have 16-system clusters by mid92 and 128-system
    clusters by ye92. Then late Jan92, cluster scale-up is transferred for
    announce as IBM Supercomputer (for technical/scientific *ONLY*) and we
    are told we can't work on anything with more than four processors (we
    leave IBM a few months later). Contributing was the mainframe DB2 DBMS
    group were complaining if we were allowed to coninue, it would be at
    least five years ahead of them.

    Neither ROMP or RIOS supported bus/cache consistency for multiprocessor operation. The executive we reported to, went over to head up ("AIM" -
    Apple, IBM, Motorola) Somerset for single chip 801/risc ... but also
    adopts Motorola 88k bus enabling multiprocessor configurations. He later
    leaves Somerset for president of (SGI owned) MIPS.

    trivia: I also had HSDT project (started in early 80s), T1 and faster
    computer links, both terrestrial and satellite ... which included custom designed TDMA satellite system done on the other side of the pacific
    ... and put in 3-node system. two 4.5M dishes, one in San Jose and one
    in Yorktown Research (hdqtrs, east coast) and a 7M dish in Austin (where
    much of the RIOS design was going on). San Jose also got an EVE, a
    superfast hardware VLSI logic simulator (scores of times faster than
    existing simultion) ... and it was claimed that Austin being able to use
    the EVE in San Jose, helped bring RIOS in a year early.

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sun Mar 2 11:46:23 2025
    BGB <cr88192@gmail.com> writes:
    It almost seems like they could have tried making a PDP-11 based PC.

    I dimly remember that there were efforts in that direction. But the
    PDP-11 does not even have the cumbersome support for more than 64KB
    that the 8086 has (there were PDP-11s with more, but that was even
    more cumbersome to use).

    DEC also tried their hand in the PC-like business (DEC Rainbow 100).
    They did not succeed. Maybe that's the decisive difference from HP:
    They did succeed in the PC market.

    DEC could have maybe had a marketing advantage in, say, "Hey, our crap
    can run UNIX" and "UNIX is better than DOS".

    That was not what customers were interested in. There were various
    Unix variants available for the PC, but the customers preferred using
    DOS, which was preinstalled and did not cost extra. And even when you
    could install Unix for free in the form of Linux, most customers just
    used Windows which was preinstalled, and (probably decisively) for the
    network effects).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 12:03:58 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    How many clocks did Alpha take to process each instruction?

    For the 21064 see slide 15 of <https://people.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture19.pdf>

    I.e., about 1 CPI for Ear, and about 4.3 CPI for TCP-B, with other
    benchmarks in between.

    Theoretical bottom CPI (peak performance) of the 21064 is 0.5.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Sun Mar 2 09:34:37 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:
    As for code size, we see significantly smaller code for RISC
    instruction sets with 16/32-bit encodings such as ARM T32/A32 and
    RV64GC than for all CISCs, including AMD64, i386, and S390x
    <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
    so much better in this respect that its code is significantly smaller
    than for these CPUs.

    VAX's advantage was it executed fewer instructions (VAX only executed
    65% of the number of instructions R2000 executed.)

    This agrees with my estimate that a CPU with 3 RV32GC MIPS would have
    the same performance as a CPU with 2 VAX MIPS.

    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce

    You would also have to convince the Computer Science department at
    CMU; Where a lot of VAX ideas were dreamed up based on the success
    of the PDP-11.

    Yes, include that in my magic wand.

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite direction:

    Transistors area proc CPU
    125,000 74.82 3um MicroVAX 78032 (integer-only, some instructions missing)
    68,000 44 3.5um 68,000 (integer-only, no MMU)
    45,000 58.52 2um ROMP (integer-only, no MMU, three pipeline stages)
    25,000 50 3um ARM1 (integer-only, no MMU, pipelined)
    110,000 ? 1.2um SPARC MB86900 (integer-only, pipelined)
    110,000 80 2um MIPS R2000 (integer-only, pipelined)

    It seems that the MMU cost a lot of transistors, while the pipelining
    did not, as especially the ARM1 shows.

    The design point you target for the original VAX would have taken >significantly longer to design, debug, and ship.

    What makes you think so? A major selling point of RISC especially
    compared to the VAX was that the reduced instruction-set complexity
    reduces the implementation effort. And the fact that the students of
    Berkeley and Stanford could produce their prototypes in a short time
    lends credibility to the claim. We also have some timelines we can
    compare:

    <https://en.wikipedia.org/wiki/PA-RISC> says:
    |In early 1982, work on the Precision Architecture began at HP
    |Laboratories, defining the instruction set and virtual memory
    |system. Development of the first TTL implementation started in April
    |1983. With simulation of the processor having completed in 1983, a
    |final processor design was delivered to software developers in July
    |1984. Systems prototyping followed, with "lab prototypes" being
    |produced in 1985 and product prototypes in 1986.[7]
    |
    |The first processors were introduced in products during 1986.

    <https://www.hpmuseum.net/display_item.php?hw=836> writes:
    |The 3000/930 and 3000/950 were both announced in March of 1986 but did
    |not ship until the second half of 1987.

    So here we actually have a discrete implementation of a RISC.

    You write that VAX work began in 1973; it was introduced in 1977 (but
    when where machines shipped to customers?), which would mean that
    development also took 4 years. According to <https://en.wikipedia.org/wiki/VAX-11>, development began in 1976, but
    that is hard to believe, especially given the CISC-based problems such
    as having to keep many pages in physical memory at the same time.

    <https://people.computing.clemson.edu/~mark/330/eagle.html#timeline>
    says about the Data General Eclipse MV/8000:
    |spring 1978 - Eagle project starts
    |summer 1978 - recruiting of Hardy Boys and Microkids
    ||April 1979 - projected Eagle completion date (missed)
    |June 1979 - West presents Eagle at Product Board Meeting
    |mid-1979 - Eagle supporter and VP of Engineering Carl Carmen leaves DG |October 1979 - Gallifrey Eagle moved to software department
    |November 1979 - Tom West transferred
    |2H79 and early 1980 - difficulties with PAL supplier;
    | hardware debugging and software development continue |April 1980 - public announcement of MV/8000
    |fall 1980 - Eclipse group reorganized

    It's not clear when the MV/8000 was delivered to the customers.

    The same timeline also contains Data General's Plan A (Eagle was Plan
    B), the more ambitious FHP which makes the writable control store
    available to third-party software, i.e., it's in a way a further step
    in the thinking that led to CISC:

    |July 1975 - FHP project starts
    |September 1977 - EGO vs. FHP meeting; FHP version promised in a year
    |early 1979 - news that FHP will miss deadline "by a huge margin"
    |November 1981 - FHP demo

    BTW, the existence of writable control stores before the release of
    the VAX is further counterevidence to the claim that at the time of
    the VAX, microcode ROM was so much faster and that fast SRAM was not
    an option: The DEC PDP-10 KL10 (1975) has 80*1280bits (12.5KB) or
    80*2K bits (20KB) of WCS, depending on the model, and they were the
    machines that the VAX-11/780 was going to replace. So not only did
    RISCs not need cache to perform better than a VAX-11/780, for a
    machine in the price class of the VAX-11/780 you could also have cache
    to gain even more speed; the 2KB of the VAX-11/780 itself would help,
    but bigger caches were possible and could help more.

    <https://en.wikipedia.org/wiki/I386> says:
    |Development of i386 technology began in 1982 under the internal name
    |of P3.[4] The tape-out of the 80386 development was finalized in July |1985.[4] The 80386 was introduced as pre-production samples for
    |software development workstations in October 1985.[5] Manufacturing
    |of the chips in significant quantities commenced in June 1986 ... The
    |first personal computer to make use of the 80386 was the Deskpro 386.

    The Compaq DeskPro 386 was released on September 9, 1986. But the
    i386 was a single chip, not a discrete implementation, which may have
    had an influence on development time. IBM took until July 1987 to
    introduce their first computer with an i386.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 12:10:42 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Sat, 01 Mar 2025 11:58:17 GMT, Anton Ertl wrote:

    Like other USA-based computer architects, Bell ignores ARM, which
    outperformed the VAX without using caches and was much easier to design.

    While those ARM chips were legendary for their low power consumption (and
    low transistor count), those Archimedes machines were not exactly low-
    cost, as I recall.

    Compared to what? Compared to an 8-bit or 16-bit home computer: no.
    Compared to PCs at the time: yes. Compared to a contemporary VAX: by
    a lot.

    USD GBP Year Model
    149 1985 C64 (64KB RAM)
    699 499 1987 Amiga 500 (512KB RAM)
    799 1987 ARM Archimedes 305 (512KB RAM)
    3500 1989 ARM R140 (4MB RAM, 60MB HDD, RISC iX, no ethernet)
    1995 1990 ARM R225 (8MB RAM, RISC iX)
    3995 1990 ARM R260 (8MB RAM, 100MB HDD, RISC iX)

    6499 1986 Compaq DeskPro 386 Model 40 (1MB RAM, 40MB HDD)
    3769 1987 MacIntosh II (1MB RAM, no HDD for this price)
    5000 1987? MicroVAX II KA620 (1MB RAM)
    81000 1989 MicroVAX 3800
    120200 1989 MicroVax 3900

    The MicroVAX II KA620 seems to be a special model ("a single-board
    MicroVAX II designed for automatic test equipment and manufacturing applications which only ran DEC's real-time VAXELN operating system" <https://alchetron.com/MicroVAX>; all VAX prices are from that site).

    The question of performance came up: <https://en.wikipedia.org/wiki/File:Archimedes_performance.svg> shows
    that the Acorn Archimedes using ARM2 without cache ran Dhrystone at
    2.8 times the speed of a VAX-11/780, and the ARM3 with 4KB cache ran
    DhryStone at 10.5-15.1 times the speed of the VAX-11/780. The graph
    also shows the competition in the home computer and PC space, and
    gives sources for the numbers.

    Without caches, did they have to use faster (and therefore more expensive) >memory?

    The ARM Archimedes used a 32-bit memory interface which was apparently
    a lot more expensive than the 8-bit memory interface of the C64 and
    the 16-bit memory interface of the Amiga 500. Other than that, I
    don't think there was much in the way of DRAM speed grades at the
    time. Marketing high-speed DRAM to the gullible seems to be a rather
    recent development.

    Or did they fall back on the classic “wait states”?

    Wait states make no sense is the CPU has no chance to do other work
    during the wait states (as for the ARM2), so I doubt that. I expect
    that what they did was to just use the clock that the RAM supported.
    They used an 8MHz clock.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 13:19:32 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Sat, 01 Mar 2025 22:25:26 GMT, Anton Ertl wrote:

    The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
    so Bell's argument that caches were only available around 1985 does not
    hold water on that end, either.

    It was about the sizes of the caches

    Sure, more cache is better than less cache, all other things being
    equal, but as the ARM2 without cache demonstrates, a RISC running out
    of RAM can outperform the VAX-11/780 with 2KB (or is it 8KB? see
    below) cache. And as the ARM3 with 4KB cache demonstrates, that
    performance advantage increases by a lot with even a 4KB cache.

    and hence their contribution to the
    cost.

    Interestingly, <http://bitsavers.informatik.uni-stuttgart.de/pdf/datapro/datapro_reports_70s-90s/DEC/M11-384-40_8402_DEC_VAX-11.pdf>
    reports VAX-11/780 cache as having 8KB (4KB for the /750). It also
    gives a delivery date of Jan. 1978 for the VAX-11/780. So that cost
    was obviously acceptable for the VAX-11/780, and, as the ARM3
    performance results show, is not too small to be beneficial to
    performance.

    MIPS used 64KB caches for the R2000? Because they could, in 1986.
    Motorola used 16KB caches for the 88000? Obviously 64KB is not all
    that necessary. Acorn used a 4KB shared cache for ARM3? Because it
    allowed them to do it on a single chip; it still gives good benefits.

    My impression is that Bell was just grasping at straws to justify
    their wrong choices. He looked at other differences (rather than the instruction set) between the MIPS R2000 and the VAX, and if it
    represented something that was not available at acceptable cost in
    1977 (in particular, 64KB caches), he used it as justification for the
    VAX.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to BGB on Sun Mar 2 16:33:31 2025
    On Sat, 1 Mar 2025 17:45:58 -0600
    BGB <cr88192@gmail.com> wrote:


    Not sure about what instruction scheduling was like on the Alpha,

    DEC shipped 4 generations of Alpha CPUs with 3 quite different core microarchitectures.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Sun Mar 2 15:42:53 2025
    On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite direction:

    You could look at the MIT Lisp Machine, it used basically the same chips
    as a VAX 11/780 but was a pipelined load/store architecture internally.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Robert Swindells on Sun Mar 2 09:03:53 2025
    Robert Swindells <rjs@fdy2.co.uk> writes:
    You could look at the MIT Lisp Machine, it used basically the same chips
    as a VAX 11/780 but was a pipelined load/store architecture internally.

    from long ago and far away:

    Date: 79/07/11 11:00:03
    To: wheeler

    i heard a funny story: seems the MIT LISP machine people proposed that
    IBM furnish them with an 801 to be the engine for their prototype.
    B.O. Evans considered their request, and turned them down.. offered them
    an 8100 instead! (I hope they told him properly what they thought of
    that)

    ... snip ...

    ... trivia: Evans had asked my wife to review/audit 8100 (had really
    slow, anemic processor) and shortly later it was canceled
    ("decomitted").

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Mar 2 13:51:36 2025
    Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite direction:

    The first article in this Mar-1987 HP Journal is about the
    HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
    The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL. Implementation started in Apr-1983, prototype ready early 1984.

    "[3 stage] pipeline fetches and executes an instruction every 125 ns,
    a 4096-entry translation lookaside buffer (TLB) for high-speed address translation, and 128K bytes of cache memory."

    "The measured MIPS rate for the Model 840 varies from
    about 3.5 to 8 MIPS with an average of 4.5 to 5."

    which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

    https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Swindells on Sun Mar 2 18:30:24 2025
    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit >>>boards than VAX 11/780, making it a lot more expensive.
    ...
    You could look at the MIT Lisp Machine, it used basically the same chips
    as a VAX 11/780 but was a pipelined load/store architecture internally.

    And what was the effect on the number of circuit boards? What effect
    did the load/store architecture have, and what effect did the
    pipelining have?

    It's been a number of years since I read about Lisp Machines and
    Symbolics. My impression was that they were both based on CISCy
    ideas; it's about closing the semantic gap, no? Load/store would
    surprise me.

    And when the RISC revolution came, they could not compete. The RISCy
    way to Lisp implementation was explored in SPUR (and Smalltalk in
    SOAR) (one of which counts as RISC-III and the other as RISC-IV, I
    don't remember which), and commercialized in SPARC's instructions with
    support for tags (not used in the Lisp system that a former comp.arch
    regular contributed to).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Mar 2 20:02:03 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >>The first ARM design started in 1983 with working silicon in 1985. It was a >>decade later.

    The point is that ARM outperformed VAX without using caches. DRAM
    with 800ns cycle time was available in 1971 (the Nova 800 used it).
    By 1977, when the VAX 11/780 was released, certainly faster DRAM was >available.

    Oh, OK. How was the code density? I know ARM was pretty good but VAX
    was fantastic since they sacrified everything else to compact instructions.
    The pages were only 512B, they really thought memory was expensive even
    though the trend lines were clear.

    IBM tried to commercialize it in the ROMP in the IBM RT PC; ...
    ... The delay between the
    completion of the ROMP design, and introduction of the RT PC was
    caused by overly ambitious software plans for the RT PC and its
    operating system (OS)."

    I was there, designing AIX. IBM couldn't decide what they wanted, and
    they didn't understand Unix, but they wanted it yesterday, so they had
    an elaborate and slow "virtual resource manager" with the operating
    systems running on top. It turned out that the only operating system
    was AIX, with the VRM just extra overhead. We wasted a lot of time
    explaining why we weren't going to do random IBM stuff of which the
    most memorable was user labels in the inodes (well, OS DASD has them.)

    It did not help that I naively believed their initial schedule so we
    based AIX on 4.1BSD rather than the recently released 4.2 and it
    didn't have dynamic shared libraries and other 4.2 stuff. Someone else
    did a 4.2 port that ran way faster than AIX did.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to BGB on Sun Mar 2 13:16:52 2025
    On 3/2/2025 12:39 PM, BGB wrote:
    On 3/2/2025 5:46 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    It almost seems like they could have tried making a PDP-11 based PC.

    I dimly remember that there were efforts in that direction.  But the
    PDP-11 does not even have the cumbersome support for more than 64KB
    that the 8086 has (there were PDP-11s with more, but that was even
    more cumbersome to use).


    I had thought it apparently used a model similar to the 65C816.

    Namely, that you could address 64K code + 64K data at a time, but then
    load a value into a special register to access different RAM banks.

    Granted, no first hand experience with PDP-11.


    DEC also tried their hand in the PC-like business (DEC Rainbow 100).
    They did not succeed.  Maybe that's the decisive difference from HP:
    They did succeed in the PC market.


    I guess they could have also tried competing against the Commodore 64
    and Apple II, which were also popular around that era.

    No idea how their pricing compared with the IBM PC's, but in any case,
    those who had success were generally a lot cheaper.


    Well, except for the Macintosh apparently, which managed to survive with
    its comparably higher costs.

    Yes, but . . . Its earlier, more expensive incarnation, the Lisa did
    not survive, which shows there is a limit to how much more people are
    willing to pay. And Macintosh was initially successful as a sort of
    niche machine for "creative types", as opposed to "business users" who
    used PCs.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 21:26:34 2025
    On Sun, 2 Mar 2025 9:34:37 +0000, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:
    As for code size, we see significantly smaller code for RISC
    instruction sets with 16/32-bit encodings such as ARM T32/A32 and
    RV64GC than for all CISCs, including AMD64, i386, and S390x
    <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
    so much better in this respect that its code is significantly smaller
    than for these CPUs.

    VAX's advantage was it executed fewer instructions (VAX only executed
    65% of the number of instructions R2000 executed.)

    This agrees with my estimate that a CPU with 3 RV32GC MIPS would have
    the same performance as a CPU with 2 VAX MIPS.

    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce

    You would also have to convince the Computer Science department at
    CMU; Where a lot of VAX ideas were dreamed up based on the success
    of the PDP-11.

    Yes, include that in my magic wand.

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780?

    a) 68000, 010, 020 were latch based implementations using cross coupled
    sense amps as the flop-part of the latch.
    b) 88110 was a non-overlapping clock dual latch design.

    ~2/3rds of 68000 transistor count was in the 2 levels of ROM (it was
    both microcoded and nano-coded.)

    So, comparing the area of the 88100 integer unit to the 68020 integer
    unit, and certain making adjustments, the pipeline integer unit took
    a lot more area in the latching of operands and results.

    Where 68020 would read an operand, run it though integer calculation
    and write it back to the still asserted register select line, with
    only staging (delay) latches in the loop. Now, the loop took 2
    cycles, but an R2000 it took 4 (Decode, EX, Cache, writeback).

    There are a lot more flip-flops in the pipelined path that in the
    "use latches to create optimal delay" unpipelined path.

    I have no data about discrete implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite direction:

    Transistors area proc CPU
    125,000 74.82 3um MicroVAX 78032 (integer-only, some instructions missing)
    huge portion of the transistor count was ROM
    68,000 44 3.5um 68,000 (integer-only, no MMU)
    2/3rds of the transistor count in ROM
    So, here we are only using ~20K transistors for {address, data, pc, and
    pins}. Now revisit your comparisons.
    45,000 58.52 2um ROMP (integer-only, no MMU, three pipeline stages)
    Twice the 68K data path transistor count.
    25,000 50 3um ARM1 (integer-only, no MMU, pipelined)
    This gives some credence that is can be done
    110,000 ? 1.2um SPARC MB86900 (integer-only, pipelined)
    110,000 80 2um MIPS R2000 (integer-only, pipelined)
    These two counteract that credence, with 40K of those transistors
    found in the windowed register file.

    It seems that the MMU cost a lot of transistors, while the pipelining
    did not, as especially the ARM1 shows.

    The design point you target for the original VAX would have taken >>significantly longer to design, debug, and ship.

    What makes you think so? A major selling point of RISC especially
    compared to the VAX was that the reduced instruction-set complexity
    reduces the implementation effort.

    Reduces the effort is you have a RISC ISA, it does not reduce the
    effort as much if you have a VAX ISA with al of its decoding "stuff".

    And the fact that the students of
    Berkeley and Stanford could produce their prototypes in a short time
    lends credibility to the claim.

    Student projects run in quanta of semesters, often building on the
    work of the previous students in the previous semesters, guided
    by professors with an overall direction moving forward. This is not
    different than the VAX prototype designs at CMU.

    But academic efforts do not result in industrial quality products.
    So, the time lines in academia are fundamentally different than in
    industry.

    <snip>

    You write that VAX work began in 1973; it was introduced in 1977 (but
    when where machines shipped to customers?), which would mean that
    development also took 4 years. According to <https://en.wikipedia.org/wiki/VAX-11>, development began in 1976, but
    that is hard to believe, especially given the CISC-based problems such
    as having to keep many pages in physical memory at the same time.

    I remember walking by the 12 person conference room in the CS
    part of CMU in 1973 and listening to the participants discuss
    making a bigger-better PDP-11. The topics was quite advanced
    at that moment, but I can vouch for the '73 date. I had been
    using PDP-11 ISA for 6-months at that point and was significantly
    impressed with it, more so that PDP-10, or IBM 360/67.

    Another time I was walking by they were talking about how to
    adjust BLISS so that it ran well on the "Bigger and better PDP-11".

    Grayson, Bell, and Newell were at both meeting along with a host
    of students.

    <snip>


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 21:35:03 2025
    On Sun, 2 Mar 2025 12:03:58 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    How many clocks did Alpha take to process each instruction?

    For the 21064 see slide 15 of <https://people.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture19.pdf>

    Well, that was fun. Thanks.

    I.e., about 1 CPI for Ear, and about 4.3 CPI for TCP-B, with other
    benchmarks in between.

    Theoretical bottom CPI (peak performance) of the 21064 is 0.5.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sun Mar 2 21:56:08 2025
    On Sat, 1 Mar 2025 16:29:27 -0600, BGB wrote:

    It almost seems like they could have tried making a PDP-11 based PC.

    They did. In 1982, DEC already had, not one, but three different lines of
    PCs, based on the PDP-11 (the “Professional” line), the PDP-8 (“DECmate”),
    and a dual-processor Z80/8088 machine (the “Rainbow”). This last one could run 3 OSes: CP/M-80, CP/M-86, and MS-DOS. But only one at a time.

    So you’d think they had their bases covered on the PC front, quite early
    on.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Mar 2 21:50:14 2025
    On Sat, 1 Mar 2025 19:40:55 +0000, EricP wrote:

    Lawrence D'Oliveiro wrote:
    Found this paper
    <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
    at Gordon Bell’s website. Talking about the VAX, which was designed as
    the ultimate “kitchen-sink” architecture, with every conceivable
    feature to make it easy for compilers (and humans) to generate code,
    he explains:

    The VAX was designed to run programs using the same amount of
    memory as they occupied in a PDP-11. The VAX-11/780 memory range
    was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
    to have very efficient encoding of programs. Very efficient
    encoding of programs was achieved by having a large number of
    instructions, including those for decimal arithmetic, string
    handling, queue manipulation, and procedure calls. In essence, any
    frequent operation, such as the instruction address calculations,
    was put into the instruction-set. VAX became known as the
    ultimate, Complex (Complete) Instruction Set Computer. The Intel
    x86 architecture followed a similar evolution through various
    address sizes and architectural fads.

    The VAX project started roughly around the time the first RISC
    concepts were being researched. Could the VAX have been designed as a
    RISC architecture to begin with?

    Not with the people involved in VAX at both DEC and in academia.

    With some other group of individuals who worked under any of
    {Thornton, Cray, Smith, Cooke} and latched onto the things those
    people preached and performed.

    Because not doing so meant that, just
    over a decade later, RISC architectures took over the “real computer”
    market and wiped the floor with DEC’s flagship architecture,
    performance-wise.

    Remember, VAX was envisioned to be the 32-bit PDP-11.

    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    In a rough sense, that statement is true; and add in the fact that
    people building computers at the inception of VAX were still beholden
    to Wilkes (i.e., microcode.)

    If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
    was
    still taking multiple clocks to serially decode each instruction and
    that basically stalls away any benefits a pipeline might have given.

    When compilers spit out code so each VAX instruction contains
    a memory reference (anything from a displacement to an immediate
    to a real memory reference), serial decode is what lands on the
    table.

    If they had just only put in *the things they actually use*

    The problem was that they (the VAX compiler people) emit code
    using all the address modes, and VAX had a problem in not having
    enough registers ~12 compared to ~29 for R2000, so all use once
    memory references used address modes.

    (as show by DEC's own instruction usage stats from 1982),
    and left out all the things that they rarely or never use,
    it would have had 50 or so opcodes instead of 305,

    IMHO::: VAX should have 50-60 instructions and not have had the
    polynomial instructions, the queue handling instructions, and
    the editing instructions, CALLS, RET, Decimal arithmetic, two
    kinds of FP, (and a bunch more I can't think of).

    at most one operand that addressed memory on arithmetic and logic
    opcodes
    with 3 address modes (register, register address, register offset
    address)
    instead of 0 to 5 variable length operands with 13 address modes each
    (most combinations of which are either silly, redundant, or illegal).

    DEC literally jumped the shark. No instruction should have more
    than 3 or 4 operands.

    The reason it was designed the way it was, was because DEC had
    microcode and microprogramming on the brain.

    Yes, exactly, and it was the safe choice when VAX started.

    In this 1975 paper Bell and Strecher say it over and over and over.
    They were looking at the cpu design as one large parsing machine
    and not as a set of parallel hardware tasks.

    We were in the era where you typical instructions took 5-cycles
    to execute, heading to the era where the typical instruction
    takes 3 cycles to execute.

    RISC shows up at 1.4 cycles per instruction heading to 0.5 today.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Sun Mar 2 22:00:11 2025
    On Sun, 2 Mar 2025 13:16:52 -0800, Stephen Fuld wrote:

    And Macintosh was initially successful as a sort of niche machine
    for "creative types", as opposed to "business users" who used PCs.

    Desktop publishing! That was the killer new market that was basically
    created by/on the Macintosh (together with the Apple LaserWriter), and
    which it dominated for many years.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 22:00:31 2025
    On Sun, 2 Mar 2025 13:19:32 +0000, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    MIPS used 64KB caches for the R2000? Because they could, in 1986.
    Motorola used 16KB caches for the 88000? Obviously 64KB is not all
    that necessary. Acorn used a 4KB shared cache for ARM3? Because it
    allowed them to do it on a single chip; it still gives good benefits.

    MIPS used SRAM external to the chip and sent out addresses 1 on the
    high phase of the clock and 1 on the low phase of the clock and
    violating the timing of the SRAM specs themselves. I was told that
    later MIPS had a tester setup to sort SRAMs into those that met its
    needs and those that did not.

    Mc88100 was not allowed to use an interface twice per cycle (the
    test guys objected) so we had to use 2 interfaces 1 for I$ 1 for
    D$. We put FP on die and migrated the TLB to the Caches which
    were 4-way set instead of direct mapped unified.

    As to ARM's 4KB cache:: 68020 had a 256 byte cache, with a hit rate
    just good enough (70%) to separate Instructions accesses from data
    accesses at the pins. ARM's cache would have been big enough for
    there to be unused cycles on its external interface.

    My impression is that Bell was just grasping at straws to justify
    their wrong choices.

    Likely, but looking at it from the originating time perspective,
    VAX would have lost PDP-11 compatibility if it were more RISC-
    like. I put the mistakes up to hoping the other guys don't
    advance the start of the art as much as what actually happened.

    He looked at other differences (rather than the instruction set) between the MIPS R2000 and the VAX, and if it
    represented something that was not available at acceptable cost in
    1977 (in particular, 64KB caches), he used it as justification for the
    VAX.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun Mar 2 21:50:15 2025
    On Sun, 2 Mar 2025 20:02:03 -0000 (UTC), John Levine wrote:

    We wasted a lot of time explaining why we weren't going to do random
    IBM stuff of which the most memorable was user labels in the inodes
    (well, OS DASD has them.)

    Sounds like an early form of Linux-style extended attributes? <https://manpages.debian.org/xattr(7)>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 21:48:36 2025
    On Sun, 02 Mar 2025 13:19:32 GMT, Anton Ertl wrote:

    My impression is that Bell was just grasping at straws to justify their
    wrong choices.

    My impression was not. Note that the “big bang” arrival of RISC in the latter 1980s is pretty much in agreement with his timeline.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Mar 2 21:57:57 2025
    On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

    But academic efforts do not result in industrial quality products.

    *Cough* Unix *cough*

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Sun Mar 2 22:40:11 2025
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
    The first ARM design started in 1983 with working silicon in 1985. It was a >>>decade later.

    The point is that ARM outperformed VAX without using caches. DRAM
    with 800ns cycle time was available in 1971 (the Nova 800 used it).
    By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>available.

    How was the code density?

    I have no data on that. Interestingly, unlike the 68k, which was
    outcompeted by RISCs at around the same time, the VAX did not have an
    afterlife of hobbyists who produced Linux and Debian ports, so I
    cannot easily make a comparison.

    I know ARM was pretty good but VAX
    was fantastic since they sacrified everything else to compact instructions.

    I don't think they did. They spent encoding space on instructions
    that were very rare, and AFAIK instructions can be encoded that do not
    work (e.g., a consant as destination). The major idea seems to have
    been orthogonality, not compactness. They did take choices for
    compactness (e.g., the call instructions that includes a bitmask for
    the registers to be saved), but overall other ideas were more
    important.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Mar 2 22:27:59 2025
    According to BGB <cr88192@gmail.com>:
    I had thought it apparently used a model similar to the 65C816.

    Namely, that you could address 64K code + 64K data at a time, but then
    load a value into a special register to access different RAM banks.

    Not really. The low end PDP-11's were 16 bit, 64K was it.

    The larger ones had memory mapping with 8K pages, a size carefully
    chosen to be too large for paging, but too small to map whole programs.
    There were three modes, user, supervisor, and kernel, with 64K instruction
    and data in each. The kernel changed the maps by poking values into
    I/O addresses, so it's not something a normal program could do.

    Unix only used user and kernel so for our early bitmap terminals, I
    mapped the screen's video memory into supervisor data and set the
    mode bits so you could access it with MOVE TO/FROM PREVIOUS DATA
    SPACE. C didn't generate those so we had some little assembler
    routines.

    Given the way the PDP-11 was set up, it's hard to think of a memory
    expansion scheme that wasn't a grotesque kludge so I think it was
    the right decision for VAX to have a new instruction set with a mode
    to run PDP-11 code, sort of like the 386's virtual 86 mode for
    real mode 8086 code.

    That was not what customers were interested in. There were various
    Unix variants available for the PC, but the customers preferred using
    DOS, which was preinstalled and did not cost extra. ...

    Yup. PC/IX was a really nice Unix port for the IBM PC and nobody was interested.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 23:43:58 2025
    On Sun, 2 Mar 2025 21:57:57 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

    But academic efforts do not result in industrial quality products.

    *Cough* Unix *cough*

    Not sure you can call Bell Labs academia.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Mar 3 02:08:46 2025
    On Sun, 2 Mar 2025 23:43:58 +0000, MitchAlsup1 wrote:

    On Sun, 2 Mar 2025 21:57:57 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

    But academic efforts do not result in industrial quality products.

    *Cough* Unix *cough*

    Not sure you can call Bell Labs academia.

    Berkeley!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Mar 3 02:58:41 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    I know ARM was pretty good but VAX
    was fantastic since they sacrified everything else to compact instructions.

    I don't think they did. They spent encoding space on instructions
    that were very rare, and AFAIK instructions can be encoded that do not
    work (e.g., a consant as destination). The major idea seems to have
    been orthogonality, not compactness.

    It certainly was orthogonal. I was thinking that they had one-, two-, and four-
    byte offset versions of all of the relative addressing modes, which made the code smaller at the cost of forcing operands to be decoded one at a time since you couldn't tell where the N+1st operand was until you'd looked at the Nth.

    Nearly all opcodes were one byte other than the extended format floating point instructions so it's hard to see how they could have made that much smaller without making it a lot more complicated. On the other hand, we can compare it to the S/360 instruction set which was fairly compact but a lot easier to decode, e.g., you could tell from the high bits of the first opcode byte how long each instruction was and where the operands were so you could decode the rest and do address calculations in parallel.



    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon Mar 3 03:36:04 2025
    On Mon, 3 Mar 2025 02:58:41 -0000 (UTC), John Levine wrote:

    Nearly all opcodes were one byte other than the extended format floating point instructions so it's hard to see how they could have made that
    much smaller without making it a lot more complicated.

    Bell points that VAX code gets close to the code density of equivalent
    PDP-11 code. That was a major design goal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 07:24:57 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
    The first ARM design started in 1983 with working silicon in 1985. It was a >>>>decade later.

    The point is that ARM outperformed VAX without using caches. DRAM
    with 800ns cycle time was available in 1971 (the Nova 800 used it).
    By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>>available.

    How was the code density?

    I have no data on that. Interestingly, unlike the 68k, which was
    outcompeted by RISCs at around the same time, the VAX did not have an afterlife of hobbyists who produced Linux and Debian ports, so I
    cannot easily make a comparison.

    The VAX is still supported with gcc and binutils, with newlib as
    its C library, so building up a tool chain for assembly/disassembly
    should be doable with a few (CPU) hours; you can then compare
    sizes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Mon Mar 3 07:39:03 2025
    John Levine <johnl@taugh.com> writes:
    Nearly all opcodes were one byte other than the extended format floating point >instructions so it's hard to see how they could have made that much smaller >without making it a lot more complicated.

    One can look at IA-32 and compare the instruction lengths for frequent instructions like "add %reg1,%reg2", "add const,%reg", "mov (%reg1),
    %reg2", and "mov %reg1, (%reg2)" are. I expect that they are shorter
    than on the VAX (exception: if the constant fits in 16 bits, but not
    in 8). Of course there is a difference: VAX has 16 GPRs and IA-32 has
    only 8. AMD64 has 16 GPRs, and needs a REX prefix byte, but only if
    one of the additional registers is used (or 64-bit operation is
    needed), so for the frequent cases it probably still has shorter
    encodings on average than the VAX, especially if compilers prefer
    using the first 8 registers. For three-operand addition, IA-32/AMD64
    has lea, and other three-operand instructions are not that common.

    IA-32 has even shorter encodings for some operations on %eax (stemming
    from the need for compactness on the 8080 and the 8086, and the fact
    that assembly-language programmers are good at exploiting such
    things). One can use this to make the code even shorter by trying to
    get the compiler to use %eax for instructions where such encodings
    exist. Alternatively, one could reassign this encoding space for some
    other purpose, e.g., avoiding the REX prefix in some cases.

    Another opportunity for shorter instructions is that IA-32/AMD64
    supports byte-width register-to-register operations. These encodings
    are unnecessary and can be reused for better purposes.

    Another opportunity for making code shorter is that IA-32/AMD64 has
    redundant encodings for register-to-register operations: e.g., "sub
    %ecx, %edx" can be encoded with the first byte bein 0x29 or 0x2b (they
    make a difference if one of the operands is in memory). These
    encodings can be reused; one possibility would be to support only
    load-and-op instructions, not read-modify-write instructions; the the
    first byte 29 (for sub, and similar for the other operations) can be
    used for a different purpose, e.g., avoiding the need for a REX
    prefix.

    One idea I have had is that many instructions encode for a source
    register the same register as the target register of the previous
    instruction. One could just refer to the target of the previous
    instruction and thus save encoding space. The downside is that such instructions no longer are complete, but need the previous instruction
    to be decoded, which complicates interrupts and various tools.

    Bottom line: IA-32 is probably more compact than VAX, and even for
    IA-32 one can think of various ways to possibly make it even more
    compact.

    And looking at my latest code size measurements <2024Jan4.101941@mips.complang.tuwien.ac.at>, both armhf (ARM T32) and
    riscv64 (RV64GC) result in shorter code than IA-32 and AMD64:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    853892 152068 61124 i386

    Apparently the additional registers of AMD64 (or maybe the different
    calling convention) result in smaller code than IA-32 despite having
    to use REX prefixes not only if the additional registers are used, but
    also if 64-bit width is required.

    The 16-bit wide encodings of ARM T32 and the RISC-V C extension
    apparently catch many common cases. These load/store architectures
    avoids the encoding waste of having several operation widths[1], and
    redundant encodings for register-to-register operations. However, in
    those cases where load-and-op instructions are useful, they need to
    encode an intermediate register, twice. In those cases where
    read-modify-write instructions are useful, they need to encode an
    intermediate register, 4 times, and the memory operand a second time;
    but obviously on the balance these instruction sets are more compact.

    [1] RV64G includes 32-bit wide register-register ops that I consider unnecessary: Usually the top 32 bits of 32-bit operations are not
    used, and then one can just use the 64-bit version. In the few cases
    where they are used, a 64-bit operation followed by a sign extension
    will produce the same result. But maybe the RISC-V architects have
    data that shows that the top 32 bits are used more often than I
    expect; maybe in C code with int variables that are used for indexing
    arrays (in that case we can thank the people who decided to go with
    I32LP64 (rather than ILP64) for that).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Mon Mar 3 08:06:08 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    The VAX is still supported with gcc and binutils, with newlib as
    its C library, so building up a tool chain for assembly/disassembly
    should be doable with a few (CPU) hours; you can then compare
    sizes.

    Good. While the CPU hours are not the problem, I cannot spare the
    human hours for such a project.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MarkC@21:1/5 to Anton Ertl on Mon Mar 3 12:43:51 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    John Levine <johnl@taugh.com> writes:
    How was the code density?

    I have no data on that. Interestingly, unlike the 68k, which was
    outcompeted by RISCs at around the same time, the VAX did not have an >afterlife of hobbyists who produced Linux and Debian ports, so I
    cannot easily make a comparison.

    NetBSD still has a VAX port, so the sizes of pre-built packages from
    there might be informative.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Mar 3 14:45:37 2025
    On Mon, 03 Mar 2025 07:39:03 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    And looking at my latest code size measurements <2024Jan4.101941@mips.complang.tuwien.ac.at>, both armhf (ARM T32) and riscv64 (RV64GC) result in shorter code than IA-32 and AMD64:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    853892 152068 61124 i386


    I never measured size of gnu utilities, but my measurements of few
    of my own embedded projects and of some microbenchmarks always gave
    very different ratios.
    That is, in my measurements T32 was also a champion among extant 32b/64b architectures (extinct nanoMIPS was better), but i386 was MUCH closer
    than in your figures above. Up to twice closer, actually.
    It seems, newer gcc is much worse than older versions at generation of
    compact i386 code.
    Also in my measurement T32 was significantly denser than RV64GC,
    although in case of RV I only did microbenchmarks.

    One of my early measurements that I have bookmarked. https://www.realworldtech.com/forum/?threadid=86001&curpostid=86094

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Mar 3 14:56:13 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 2 Mar 2025 13:19:32 +0000, Anton Ertl wrote:

    My impression is that Bell was just grasping at straws to justify
    their wrong choices.

    Likely, but looking at it from the originating time perspective,
    VAX would have lost PDP-11 compatibility if it were more RISC-
    like.

    In fact, that PDP-11 compatibility provided early market opportunities
    for the VAX as until VMS 2.0, many of the utility programs and vax
    commands were from RSX-11M running in compatibility mode (e.g. PIP).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 3 16:34:35 2025
    Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    It almost seems like they could have tried making a PDP-11 based PC.

    I dimly remember that there were efforts in that direction. But the
    PDP-11 does not even have the cumbersome support for more than 64KB
    that the 8086 has (there were PDP-11s with more, but that was even
    more cumbersome to use).

    DEC also tried their hand in the PC-like business (DEC Rainbow 100).

    When I was hired as the PC guy in Hydro (that Fortune-100 corporation
    with 77K employees in 130+ countries) in 1984, I took over all
    PC-related stuff (HW/SW/OS/add-on HW etc) while the guy who hired me
    kept his belowed DEC Rainbow which he felt had the better architecture:

    For one thing they did not break Intel's rules about where to place the interrupt vectors. In hindsight this was a bad decision since 100% compatibility with Microsoft Flight Simulator was an absolute
    requirement at the time.

    They did not succeed. Maybe that's the decisive difference from HP:
    They did succeed in the PC market.

    For some definition of success, i.e they were sufficiently worse at PCs
    to later merge with Compaq who was the first significant vendor in the
    PC Compatible marketplace. Columbia beat both of them by half a year or
    so, but faded away a bit later.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MarkC on Mon Mar 3 16:44:17 2025
    MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:
    NetBSD still has a VAX port, so the sizes of pre-built packages from
    there might be informative.

    Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a
    wide variety of prebuilt stuff there. I took those that sound like architecture names (and probably belong to NetBSD): aarch64 alpha
    amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

    Unfortunately, they do not seem to port to RISC-V in any form yet, and
    their earmv7hf port uses ARM A32, not T32. So the NetBSD competition
    is performed without entries for those two instruction set encodings
    that showed the smallest code sizes on Debian. Anyway, here are the
    results:

    bash grep xz
    710838 42236 m68k
    748354 159304 40930 vax
    829077 176836 42840 amd64
    855400 164188 aarch64
    877284 186924 48032 sparc
    882847 187203 49866 i386
    898532 179844 earmv7hf
    962128 205776 54704 powerpc
    1004864 192256 53632 sparc64
    1025136 51160 mips64eb
    1147664 232688 63456 alpha
    1172692 mipsel

    I did not find packages for everything on all architectures. In
    particular, I did not find packages for gzip on vax, so I used xz
    instead.

    So VAX is indeed a leading architecture in terms of code size, at
    least if ARM T32 and RISC-V C is not in play.

    Here are the scripts I used:

    for i in aarch64 alpha amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax; do mkdir -p $i/unpacked; (cd $i; for j in bash-5.2.37.tgz grep-3.11.tgz xz-5.6.2.tgz; do wget https://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/$i/10.1/All/$j;
    done); done

    This did not get everything, because some packages are in other
    version directories or have other version "numbers", so I manually
    searched for and downloaded some of the packages. Next time I should
    leave some of that to wget, see <https://superuser.com/questions/1424700/wget-download-all-files-starting-with-a-specified-name>.

    for i in aarch64 alpha amd64 earmv7hf m68k mips64eb mipsel powerpc sparc sparc64 vax; do echo $i; mkdir -p $i/unpacked; (cd $i; for j in *.tgz; do (cd unpacked; if gzip -t ../$j; then tar xfz ../$j 2>/dev/null; else tar xfJ ../$j 2>/dev/null; fi); done);
    done
    for i in *; do (cd $i/unpacked/bin; for j in bash ggrep xz; do if test -f $j; then objdump -h $j|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'; else echo -n " "; fi; done); echo $i; done|sort -n

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Mar 3 17:21:32 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    They did not succeed. Maybe that's the decisive difference from HP:
    They did succeed in the PC market.

    For some definition of success, i.e they were sufficiently worse at PCs
    to later merge with Compaq who was the first significant vendor in the
    PC Compatible marketplace.

    Pfeiffer got Compaq into trouble by buying DEC and not being able to
    digest it. HP then bought Compaq and was able to digest all the
    parts, leading to a successful PC business (I have no idea how much
    Compaq contributed to that and how much HP did) and a successful HPE;
    pretty much all of the stuff coming from/through DEC went away (I
    think the Tandem legacy may still be identifiable), but maybe they
    managed to keep the customers.

    Columbia beat both of them by half a year or
    so, but faded away a bit later.

    I don't think I ever heard about Columbia. At what did they beat
    Compaq and HP?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 3 18:51:35 2025
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    They did not succeed. Maybe that's the decisive difference from HP:
    They did succeed in the PC market.

    For some definition of success, i.e they were sufficiently worse at PCs
    to later merge with Compaq who was the first significant vendor in the
    PC Compatible marketplace.

    Pfeiffer got Compaq into trouble by buying DEC and not being able to
    digest it. HP then bought Compaq and was able to digest all the
    parts, leading to a successful PC business (I have no idea how much
    Compaq contributed to that and how much HP did) and a successful HPE;
    pretty much all of the stuff coming from/through DEC went away (I
    think the Tandem legacy may still be identifiable), but maybe they
    managed to keep the customers.

    Columbia beat both of them by half a year or
    so, but faded away a bit later.

    I don't think I ever heard about Columbia. At what did they beat
    Compaq and HP?

    Columbia created the first ever PC compatibles (at least afaik), I
    bought a pair (one desktop and one luggable) at the same cost as a
    single IBM PC in order to develop SW for my father-in-law.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Mar 3 19:53:12 2025
    On Mon, 03 Mar 2025 17:21:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    They did not succeed. Maybe that's the decisive difference from
    HP: They did succeed in the PC market.


    Digital sold solid PCs in the 1990s. Some under brand DECpc, others
    under brand DEC Station. Later on (after 1994) the used yet another
    brand that I can not recollect. Wikipedia says that most of this
    machines were not manufactured on DEC factories, but I would think that
    as reseller they still got some profit.

    For some definition of success, i.e they were sufficiently worse at
    PCs to later merge with Compaq who was the first significant vendor
    in the PC Compatible marketplace.

    Pfeiffer got Compaq into trouble by buying DEC and not being able to
    digest it. HP then bought Compaq and was able to digest all the
    parts, leading to a successful PC business (I have no idea how much
    Compaq contributed to that and how much HP did)

    My impression is that Compaq contribution prevailed.

    and a successful HPE;
    pretty much all of the stuff coming from/through DEC went away (I
    think the Tandem legacy may still be identifiable), but maybe they
    managed to keep the customers.


    They preserved VMS. Sold it some 15+ years later.

    Columbia beat both of them by half a year or
    so, but faded away a bit later.

    I don't think I ever heard about Columbia. At what did they beat
    Compaq and HP?

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Mar 3 17:57:37 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Note that the “big bang” arrival of RISC in the
    latter 1980s is pretty much in agreement with his timeline.

    Correlation does not prove causation. And when the facts (performance
    of cacheless and small-cache RISCs) are counterevidence of his
    explanation, his explanation is obviously wrong.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon Mar 3 17:32:09 2025
    Michael S <already5chosen@yahoo.com> writes:
    It seems, newer gcc is much worse than older versions at generation of >compact i386 code.

    Yes, a weakness of my measurement method is that if the compiler does
    not care much about code size (and compilers usually do what gives
    good benchmark timings, not the smallest code), it can easily make the
    code a lot bigger in ways that have nothing to do with encoding size,
    e.g., by loop unrolling or padding before branch targets,

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 17:53:35 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:
    NetBSD still has a VAX port, so the sizes of pre-built packages from
    there might be informative.

    Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a
    wide variety of prebuilt stuff there. I took those that sound like architecture names (and probably belong to NetBSD): aarch64 alpha
    amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

    Unfortunately, they do not seem to port to RISC-V in any form yet, and
    their earmv7hf port uses ARM A32, not T32. So the NetBSD competition
    is performed without entries for those two instruction set encodings
    that showed the smallest code sizes on Debian. Anyway, here are the
    results:

    bash grep xz
    710838 42236 m68k
    748354 159304 40930 vax
    829077 176836 42840 amd64
    855400 164188 aarch64
    877284 186924 48032 sparc
    882847 187203 49866 i386
    898532 179844 earmv7hf
    962128 205776 54704 powerpc
    1004864 192256 53632 sparc64
    1025136 51160 mips64eb
    1147664 232688 63456 alpha
    1172692 mipsel

    Utilities are often compiled with a medium to high level of
    optimizationo (like -O2), which can do loop unrolling, inlining
    and function cloning, all of which can increase code size.
    These decisions can also depend on the number of available register.

    If your aim is small code size, it is better to compare output
    compiled with -Os.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Mar 3 18:05:27 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.

    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780? I have no data about discrete
    implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite
    direction:

    The first article in this Mar-1987 HP Journal is about the
    HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
    The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL. >Implementation started in Apr-1983, prototype ready early 1984.

    <https://people.csail.mit.edu/emer/media/papers/1999.06.retrospective.vax.pdf> says:

    |the VAX 11/780 CPU spanned about 20 boards.

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so. Are the boards of a different
    size? If the answers to both questions are "no", this would be
    counterevidence to Mitch Alsup's claim.

    "[3 stage] pipeline fetches and executes an instruction every 125 ns,
    a 4096-entry translation lookaside buffer (TLB) for high-speed address >translation, and 128K bytes of cache memory."

    "The measured MIPS rate for the Model 840 varies from
    about 3.5 to 8 MIPS with an average of 4.5 to 5."

    which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

    It's interesting that this HP machine needed a cache at 8MHz, while
    the contemporary ARM2 could run from DRAM at the same speed. But
    then, the HP machine supports bigger memories, and includes an MMU,
    both of which slow things down.

    https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Mar 3 20:04:39 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so.

    Oh yes, they did; there were nine years between the launch of the
    VAX and the launch of HP-PA.

    According to https://www.openpa.net/pa-risc_processor_pa-early.html#ts-1
    the first HP-PA CPU was introduced in 1986, and you can see pictures
    at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/


    For example, you could buy state machines programmable by FPGA in 1986,
    which was not available in 1977. (No idea if HP used them or not).

    I don't believe that HP used FPGAs for the first PA/HP3000 processor;
    the hp journal article posted earlier says that they were standard
    TTL SSI logic chips (which allowed them to build several hundred
    prototypes to use for software development).

    The VAX-11 design _shipped_ in 1978, so the logic family used
    was selected several years prior.

    We were designing a new mainframe using ECL gate-arrays in 1986.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 19:24:33 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so.

    Oh yes, they did; there were nine years between the launch of the
    VAX and the launch of HP-PA.

    According to https://www.openpa.net/pa-risc_processor_pa-early.html#ts-1
    the first HP-PA CPU was introduced in 1986, and you can see pictures
    at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/


    For example, you could buy state machines programmable by FPGA in 1986,
    which was not available in 1977. (No idea if HP used them or not).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Mon Mar 3 22:15:45 2025
    On Mon, 03 Mar 2025 16:44:17 GMT, Anton Ertl wrote:

    MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:
    NetBSD still has a VAX port, so the sizes of pre-built packages from
    there might be informative.

    Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a wide variety of prebuilt stuff there. I took those that sound like
    architecture names (and probably belong to NetBSD): aarch64 alpha amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

    Unfortunately, they do not seem to port to RISC-V in any form yet, and
    their earmv7hf port uses ARM A32, not T32. So the NetBSD competition is performed without entries for those two instruction set encodings that
    showed the smallest code sizes on Debian. Anyway, here are the results:

    You could compare sizes of applications in the base.tgz tarball for each architecture, this is available for RISC-V as well as all the others.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Mon Mar 3 22:41:46 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so.

    Oh yes, they did; there were nine years between the launch of the
    VAX and the launch of HP-PA.

    So what?

    and you can see pictures
    at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

    Nice! The pictures are pretty good. I can read the markings on the
    chip. The first chip I looked at was marked 74AS181. TI introduced
    the 74xx series of TTL chips starting in 1964, and when I read TTL, I
    expected to see 74xx chips. The 74181 was introduced in February
    1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
    11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
    for the VAX, and the photo confirms my expectation for the first HP-PA
    CPU.

    The AS family was only introduced in 1980, so there was some advances
    between the VAX and this HP-PA CPU indeed. However, as far as the
    number of boards is concerned, a 74AS181 takes as much space as a
    plain 74181, so that difference is irrelevant for that aspect.

    I leave it to you to point out a chip on the HP-PA CPU that did not
    have a same-sized variant avalable in, say, 1975.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Mon Mar 3 23:23:46 2025
    On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

    If your aim is small code size, it is better to compare output compiled
    with -Os.

    Then it becomes an artificial benchmark, trying to minimize code size at
    the expense of real-world performance.

    Remember, VAX was built for real-world use, not for academic benchmarks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Mar 3 23:27:57 2025
    On Mon, 03 Mar 2025 17:57:37 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Note that the “big bang” arrival of RISC in the latter 1980s is pretty >> much in agreement with his timeline.

    Correlation does not prove causation.

    Back atcha. Your attempt at drawing correlations (or lack thereof) between Bell’s claims and reality is no more valid than mine.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Mon Mar 3 23:32:07 2025
    On Mon, 3 Mar 2025 16:34:35 +0100, Terje Mathisen wrote:

    ... while the guy who hired me kept his belowed DEC Rainbow which he
    felt had the better architecture:

    I only had a brief exposure to them, but I think they were beautiful
    machines, too.

    For one thing they did not break Intel's rules about where to place the interrupt vectors. In hindsight this was a bad decision since 100% compatibility with Microsoft Flight Simulator was an absolute
    requirement at the time.

    This is why I refer to “Microsoft-compatible”, rather than “IBM- compatible”, PCs. Because it was Microsoft that very quickly took over the mantle of arbiter of “compatibility” from IBM.

    Anton Ertl wrote:

    They did not succeed. Maybe that's the decisive difference from HP:
    They did succeed in the PC market.

    Bell mentions that: DEC tried to set a standard (a reasonable thing to try
    in 1982), and failed. They should have quickly pivoted to embracing the
    actual standard that won, but they did not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Mar 3 23:26:03 2025
    On Mon, 03 Mar 2025 07:39:03 GMT, Anton Ertl wrote:

    ... VAX has 16 GPRs ...

    Technically only 13.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Mar 3 20:12:51 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit
    boards than VAX 11/780, making it a lot more expensive.
    What makes you think that a pipelined single-issue RV32GC would take
    more circuit boards than VAX11/780? I have no data about discrete
    implementations, but if we look at integrated ones and assume that the
    number of transistors or the area corresponds to the number of circuit
    boards in discrete implementations, the evidence goes in the opposite
    direction:
    The first article in this Mar-1987 HP Journal is about the
    HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
    The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL.
    Implementation started in Apr-1983, prototype ready early 1984.

    <https://people.csail.mit.edu/emer/media/papers/1999.06.retrospective.vax.pdf>
    says:

    |the VAX 11/780 CPU spanned about 20 boards.

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so. Are the boards of a different
    size? If the answers to both questions are "no", this would be counterevidence to Mitch Alsup's claim.

    "[3 stage] pipeline fetches and executes an instruction every 125 ns,
    a 4096-entry translation lookaside buffer (TLB) for high-speed address
    translation, and 128K bytes of cache memory."

    "The measured MIPS rate for the Model 840 varies from
    about 3.5 to 8 MIPS with an average of 4.5 to 5."

    which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

    It's interesting that this HP machine needed a cache at 8MHz, while
    the contemporary ARM2 could run from DRAM at the same speed. But
    then, the HP machine supports bigger memories, and includes an MMU,
    both of which slow things down.

    https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

    - anton

    ARM1 was launched 1985, ARM2 in 1986.
    I found an ARM2 manual and the short answer is that the chip
    drives the RAS and CAS signals to the dram directly.
    The chip's clock is adjustable from 100 kHz to 10 MHz
    and you match the cpu clock to your dram timing.
    There is no READY line on the memory bus.

    It does have one interesting feature that if the current address
    is sequential to the prior one it skips the RAS cycle.

    I looks like it takes 1 cycle to do a read but their timing diagrams
    are total crap. There are only two and they contradict each other.
    It looks like the RAS signal is changing state on 1/4 cycle for no reason.

    The Motorola Memory Book from 1979 shows MCM4027A 4kb*1 drams with
    80 to 165 ns CAS access, 120 to 250 RAS access, 320 to 375 R/W cycle.
    Similar numbers for MCM4116A 16kb*1 R/W cycle of 500 ns.

    VAX probably used 4kb 500 ns drams.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Mar 4 02:09:04 2025
    According to Brian G. Lucas <bagel99@gmail.com>:
    That was not what customers were interested in. There were various
    Unix variants available for the PC, but the customers preferred using
    DOS, which was preinstalled and did not cost extra. ...

    Yup. PC/IX was a really nice Unix port for the IBM PC and nobody was interested.

    As this (the kernel part) was my project, it was very disappointing. I think >IBM priced such that with DOS being "free", it had no chance.

    Nobody knew what the market for PC/IX was supposed to be beyond some
    handwaving "if 5% if PC users buy it we'll be rich." PC/IX could do
    anything a PDP-11 running Unix could do, give or take peripherals,
    but the PC market was very different from the PDP-11 market and by
    that time the PDP-11 was rather long in the tooth.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Mar 4 10:04:20 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

    If your aim is small code size, it is better to compare output compiled
    with -Os.

    Then it becomes an artificial benchmark, trying to minimize code size at
    the expense of real-world performance.

    Remember, VAX was built for real-world use, not for academic benchmarks.

    And supposedly the real-world constraints at the time made it
    necessary to minimize code size. In the current discussion we look at
    how RV32GC might have fared under this constraint. So compiling for
    small code size could be a way to find that out. Whether -Os really
    achives that is another question (some earlier things I have seen and
    discussed here make me doubt that).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Swindells on Tue Mar 4 08:39:16 2025
    Robert Swindells <rjs@fdy2.co.uk> writes:
    You could compare sizes of applications in the base.tgz tarball for each >architecture, this is available for RISC-V as well as all the others.

    I did that, see below. There is one problem: RISC-V is only available
    in the daily builds, many of the other architectures are not. So I
    used 10.0 for all architectures except RISC-V. I also measured the
    daily builds for AMD64, to see how the difference in versions of the
    source affects the code sizes.

    The .text section sizes are (sorted by the ksh column):

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64
    1030288 150686 79852 31492 mvme68k
    779393 155764 75795 31813 vax
    1302254 171505 83249 35085 amd64
    1229032 178332 89180 36876 evbarm-aarch64
    1539052 179055 82280 34717 amd64-daily
    1374961 184458 96971 37218 i386
    1247476 185792 96728 42028 evbarm-earmv7hf
    1333952 187452 96328 39472 sparc
    1586608 204032 106896 45408 evbppc
    1536144 204320 106768 43232 hppa
    1397024 216832 109792 48512 sparc64
    1538536 222336 107776 44912 evbmips-mips64eb
    1623952 243008 122096 50640 evbmips-mipseb
    1689920 251376 120672 51168 alpha
    2324752 2259984 1378000 ia64

    libc seems to be quite different between different architectures,
    probably with specialized code for different architectures (with vax
    being an outlier towards small sizes, see below), the programs seem to
    be less specialized. Looking at the two amd64 results, the current
    differences between 10.0 and daily seem to be small for pax and ed,
    while ksh and libc seem to have grown. quite a bit. The RISC-V
    variants use compressed instructions, evbarm-earmv7hf use A32 (no
    16-bit instructions). The ia64 binaries are statically linked and no
    shared libraries are present in the base package.

    I looked at what the largest functions in libc are:

    For vax:

    000afab4 g F .text 0000266d __ns_sprintrrf
    00079d06 g F .text 00002c09 __vfwprintf_unlocked_l
    000cd5f0 g F .text 00003888 __vfprintf_unlocked_l

    For riscv-riscv32:

    000792ac g F .text 0000238a __vfwprintf_unlocked_l
    001169ec g F .text 000026aa __vfprintf_unlocked_l
    000f64b2 g F .text 00002af0 __ns_sprintrrf
    000c798c l F .text 00002c74 malloc_conf_init_helper
    0013be64 l F .text 0000503a stats_arena_print

    The last two functions do not occur in the vax's libc (as do a lot of
    others), which probably explains much of the size difference.
    __ns_sprintrrf is larger on RISC-V while __vfprintf_unlocked_l and __vfwprintf_unlocked_l are smaller; for the total of these three
    functions: they are a factor of 1.186 larger on the vax than on
    riscv-riscv32. So the difference in libc sizes is probably due to
    additional functions in riscv-riscv32.

    Looking at ksh, pax and ed, the RISC-V variants have the smallest code
    sizes, even for ksh. The VAX has significantly larger sizes, even
    though it is still small relative to most other architectures.

    So if a major goal of the VAX project was to have small code sizes,
    going for RV32GC (riscv-riscv32) would have been a good idea. And an implementation somewhat similar to the HP-PA TS-1 (with smaller cache
    due to the SRAM technology at the time) plus a PDP-11-to-microcode
    decoder would not have increased the cost compared to the actual VAX,
    and probably resulted in faster execution.

    In an earlier posting I suggested a PDP-11->RV32G decoder, but that's
    not be a good match given the condition-code architecture of the
    PDP-11 and the CC-less architecture of RISC-V. So one solution is to
    have a microarchitecture that has a CC register for PDP-11 emulation
    and decode the PDP-11 code to that. Another approach would be to add
    carry and overflow to the GPRs as RISC-V extension as I suggested
    elsewhere, and I guess then PDP-11 -> extended RISC-V would be
    possible.

    Instead of having a cache, an interleaved memory subsystem might also
    be able to provide the memory bandwidth to make better use of the RISC execution rate potential. Also, the compressed instructions reduce
    the instruction bandwidth requirements (compared to RISC-V without
    compressed instructions), but require an additional instruction buffer (additional TTL chips).

    Here are the scripts I used:

    for i in alpha evbarm-earmv7hf evbmips-mips64eb evbmips-mipseb evbppc hppa i386 ia64 mvme68k sparc vax; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD/NetBSD-10.0/$i/binary/sets/base.tgz); done
    for i in amd64 evbarm-aarch64 sparc64; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD/NetBSD-10.0/$i/binary/sets/base.txz); done
    for i in riscv-riscv32 riscv-riscv64; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD-daily/HEAD/latest/$i/binary/sets/base.tgz); done
    mkdir -p amd64-daily/unpacked
    cd amd64-daily
    wget http://ftp.fr.netbsd.org/pub/NetBSD-daily/HEAD/latest/amd64/binary/sets/base.tar.xz
    cd ..
    for i in *; do (cd $i/unpacked; if test -f ../base.tgz; then tar xfz ../base.tgz; else tar xfJ ../base.tar.xz; fi); done
    for i in *; do for j in lib/libc.so bin/ksh bin/pax bin/ed; do (cd $i/unpacked; if test -f $j; then objdump -h $j|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'; else echo -n " "; fi); done; echo $i; done|sort -nk2

    For determining the largest functions in libc (in an unpacked/lib
    directory):

    objdump -t libc.so|grep '[.]text'|sort -t '\0' -k1.25

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Tue Mar 4 10:09:09 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I found an ARM2 manual and the short answer is that the chip
    drives the RAS and CAS signals to the dram directly.
    The chip's clock is adjustable from 100 kHz to 10 MHz
    and you match the cpu clock to your dram timing.
    There is no READY line on the memory bus.

    It does have one interesting feature that if the current address
    is sequential to the prior one it skips the RAS cycle.

    Yes, my memory is returning: someone explained here a while ago that
    this allowed fast execution: Instructions are executed sequentially,
    so as long as the row does not change, ARM get the instructions
    quickly out of the DRAM. Every data access incurs a RAS cycle, so the architects added load/store-multiple in order to benefit from this sequential-access optimization also for block copies and for register
    spill and refill around calls.

    Maybe switch from RV32GC to ARM T32 for my better-VAX time-travel
    project:-) Might also help with the condition codes.

    The Motorola Memory Book from 1979 shows MCM4027A 4kb*1 drams with
    80 to 165 ns CAS access, 120 to 250 RAS access, 320 to 375 R/W cycle.
    Similar numbers for MCM4116A 16kb*1 R/W cycle of 500 ns.

    VAX probably used 4kb 500 ns drams.

    But with the sequential-access optimization, 250ns cycles should be
    possible (with waiting on changing rows).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lucas on Tue Mar 4 16:53:00 2025
    In article <vq4jrl$1cguk$1@dont-email.me>, bagel99@gmail.com (Brian G.
    Lucas) wrote:
    On 3/2/25 5:27 PM, John Levine wrote:
    Yup. PC/IX was a really nice Unix port for the IBM PC and nobody
    was interested.
    As this (the kernel part) was my project, it was very
    disappointing. I think IBM priced such that with DOS being "free",
    it had no chance.

    Also, timing. According to Wikipedia, PC/IX cost $900 and was released in
    1984. By that time, there was a lot of business software and games
    available for DOS, but presumably, very little for PC/IX?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to already5chosen@yahoo.com on Tue Mar 4 16:32:42 2025
    On Mon, 3 Mar 2025 19:53:12 +0200, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 03 Mar 2025 17:21:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Digital sold solid PCs in the 1990s. Some under brand DECpc, others
    under brand DEC Station.

    Was there an Intel based DECstation?
    The only ones I ever saw were MIPS based.

    <searches>

    Ahh! Wikipedia says there were 3 different DECstation lines: one based
    on PDP-8, annother based on MIPS, and yet another based on Intel.
    Naturally one has to scan/read the entire article to find the Intel
    references.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to George Neuner on Wed Mar 5 01:28:18 2025
    On Tue, 04 Mar 2025 16:32:42 -0500
    George Neuner <gneuner2@comcast.net> wrote:

    On Mon, 3 Mar 2025 19:53:12 +0200, Michael S
    <already5chosen@yahoo.com> wrote:

    On Mon, 03 Mar 2025 17:21:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Digital sold solid PCs in the 1990s. Some under brand DECpc, others
    under brand DEC Station.

    Was there an Intel based DECstation?
    The only ones I ever saw were MIPS based.

    <searches>

    Ahh! Wikipedia says there were 3 different DECstation lines: one based
    on PDP-8, annother based on MIPS, and yet another based on Intel.
    Naturally one has to scan/read the entire article to find the Intel references.

    The name I dug out of 1994 Byte issue are DEC Celebris desktop and
    HiNote laptop. But that's not the names I had in mind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Mar 4 23:27:38 2025
    On Tue, 04 Mar 2025 10:04:20 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

    If your aim is small code size, it is better to compare output
    compiled with -Os.

    Then it becomes an artificial benchmark, trying to minimize code size
    at the expense of real-world performance.

    Remember, VAX was built for real-world use, not for academic
    benchmarks.

    And supposedly the real-world constraints at the time made it necessary
    to minimize code size.

    Remember, RAM was much more expensive back then.

    For comparison, when Data General started their “Eagle” project (as chronicled in Tracy Kidder’s book “The Soul Of A New Machine”), which finally shipped as the MV/8000, they decided that having a full 32-bit
    address, VAX-style, was unnecessary, so they used some of those bits--4, I think--to hold privilege levels.

    Overall, they managed to end up with a simpler architecture than VAX. But
    it ran out of address space a little bit sooner.

    In the current discussion we look at how RV32GC might have fared under
    this constraint.

    Sure. Except you need a much more complicated and resource-hungry compiler
    than would have been reasonable to run on a VAX back then.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Wed Mar 5 00:05:30 2025
    On Wed, 5 Mar 2025 01:28:18 +0200, Michael S wrote:

    On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner <gneuner2@comcast.net> wrote:

    Ahh! Wikipedia says there were 3 different DECstation lines: one based
    on PDP-8, annother based on MIPS, and yet another based on Intel.
    Naturally one has to scan/read the entire article to find the Intel
    references.

    The name I dug out of 1994 Byte issue are DEC Celebris desktop and
    HiNote laptop. But that's not the names I had in mind.

    Entirely different decades.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Tue Mar 4 23:29:12 2025
    On Tue, 4 Mar 2025 16:53 +0000 (GMT Standard Time), John Dallman wrote:

    According to Wikipedia, PC/IX cost $900 and was released
    in 1984. By that time, there was a lot of business software and games available for DOS, but presumably, very little for PC/IX?

    How would you have done games without being able to directly address
    screen memory? I’m sure PC/IX, being a Unix-type system, would have disallowed that. And X11 hadn’t even been developed yet.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 07:36:36 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    In the current discussion we look at how RV32GC might have fared under
    this constraint.

    Sure. Except you need a much more complicated and resource-hungry compiler >than would have been reasonable to run on a VAX back then.

    Looking at compiler technology available in 1975 close to DEC
    [wulf+75] (highly recommended), I don't think so. RISC-V code size
    (as well as VAX code size) benefits from register allocation (existent
    in that compiler, running on the PDP11). Instruction scheduling would
    have been helpful for performance, but would not help code size, and
    is not particularly complex or resource-hungry when done on the
    basic-block level (good enough for single-issue RISCs).

    By contrast, making good use of the complex instructions of VAX in a
    compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
    factor 1.5 more code in the code generator for VAX than for RISC-II).
    Compilers at the time did not use the CISCy features much, which is
    one reason why the IBM 801 project and later the Berkeley RISC and
    Stanford MIPS proposed replacing them with a load/store architecture.
    I think that a lot of that is inherent, but a part of it may be due to
    the state of the art in instruction selection at the time. So RISC
    code comes easily out of compiler technology of the time, and for
    smaller code, you just have to perform register allocation, which was
    possible at the time, as demonstrated by Wulf et al.

    @Book{wulf+75,
    author = {William Wulf and Richard K. Johnsson and Charles
    B. Weinstock and Steven O. Hobbs and Charles M. Geschke},
    title = {The Design of an Optimizing Compiler},
    publisher = {Elsvier},
    year = {1975},
    isbn = {0-444-0164-6},
    annote = {Describes a complete Bliss/11 compiler for the
    PDP-11. It uses some interesting techniques: it
    uses a (hand-constructed) tree parsing automaton for
    parts of the code selection (Section~3.4); it
    optimizes the use of unary complement operators
    (Section~3.3); it uses a smart scheme to represent
    a conservative approximation of the lifetime of
    variables in constant space and uses that for
    register allocation (Sections~4.1.3 and~4.3).}
    }

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Wed Mar 5 09:16:00 2025
    In article <vq82c8$232tl$7@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    How would you have done games without being able to directly
    address screen memory? I'm sure PC/IX, being a Unix-type system,
    would have disallowed that.

    How?

    There's no memory management hardware in an 8088, and PC/IX ran on a
    basic PC/XT.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 11:58:35 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    The answer was no, the VAX could not have been done as a RISC
    architecture. RISC wasn’t actually price-performance competitive until
    the latter 1980s:

    RISC didn’t cross over CISC until 1985. This occurred with the
    availability of large SRAMs that could be used for caches. It
    should be noted at the time the VAX-11/780 was introduced, DRAMs
    were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
    sizes continued to improve following Moore’s Law, but it wasn’t
    till 1985, that Reduced Instruction Set Computers could be built
    in a cost-effective fashion using SRAM caches. In essence RISC
    traded off cache memories built from SRAMs for the considerably
    faster, and less expensive Read Only Memories that held the more
    complex instructions of VAX (Bell, 1986).

    ARM2 had no caches, but was still table-topping in its era.

    The thing often missed in the CISC v RISC debate is the cost of main memory.

    In the 1970s DRAM (or discrete SRAM) was very expensive. So you want a very tight instruction encoding that is maximally expressive - resulting in
    complex microcode and many-cycle instructions. Effectively the microcode
    was a table of library functions and the assembly was more like a series of
    API calls.

    In the mid 1980s (~1984) the Japanese had entered the DRAM market which
    caused the price of DRAMs to fall dramatically. That meant you could have a RISC CPU which was more profligate with its instruction encoding but could
    have a much simplified pipeline and so much better IPC. You didn't need
    to have the microcode library any more, you could just let the compiler do
    it. Also memory bandwidth had improved, allowing better feeding of a more profligate CPU (and compilers had got better too)

    In the late 1980s process improvements meant that on-die caches had become
    more affordable, which assisted memory bandwidth and latency further.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 15:07:03 2025
    On Wed, 5 Mar 2025 00:05:30 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 5 Mar 2025 01:28:18 +0200, Michael S wrote:

    On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner
    <gneuner2@comcast.net> wrote:

    Ahh! Wikipedia says there were 3 different DECstation lines: one
    based on PDP-8, annother based on MIPS, and yet another based on
    Intel. Naturally one has to scan/read the entire article to find
    the Intel references.

    The name I dug out of 1994 Byte issue are DEC Celebris desktop and
    HiNote laptop. But that's not the names I had in mind.

    Entirely different decades.

    You are not obliged to take part in discussion you aren't able to
    follow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Wed Mar 5 15:01:11 2025
    On Sun, 02 Mar 2025 18:30:24 GMT, Anton Ertl wrote:

    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    A pipelined machine in 1978 would have had 50% to 100% more circuit >>>>boards than VAX 11/780, making it a lot more expensive.
    ...
    You could look at the MIT Lisp Machine, it used basically the same chips
    as a VAX 11/780 but was a pipelined load/store architecture internally.

    And what was the effect on the number of circuit boards? What effect
    did the load/store architecture have, and what effect did the pipelining have?

    It's been a number of years since I read about Lisp Machines and
    Symbolics. My impression was that they were both based on CISCy ideas;
    it's about closing the semantic gap, no? Load/store would surprise me.

    I don't know the internal architecture of Symbolics machines well enough
    to comment on it, only the MIT/LMI/TI ones.

    The MIT Lisp Machine was described as microcoded but this is more like a
    simple RTOS combined with an interpreter for the 16-bit instructions of
    the higher level emulated stack machine.

    The "micro" instruction set is three address, load/store, even has
    delay slots. There are a lot of registers so the instruction word is wide
    at 56 bits, there is 16kw of SRAM to hold this code.

    Code written in this looks like typical RISC assembler to me, I have added
    TFTP support to it, there was also an option to compile Lisp down to the
    real instruction set.

    Built using a 74181+74182 ALU with other 74 series logic same as a VAX
    11/780. The pipeline is two instructions deep.

    The design documentation is available online, someone could go through
    that to get the exact number of boards used. The purchase price was lower
    than a VAX though, even with a high-resolution display.

    And when the RISC revolution came, they could not compete. The RISCy
    way to Lisp implementation was explored in SPUR (and Smalltalk in SOAR)
    (one of which counts as RISC-III and the other as RISC-IV, I don't
    remember which), and commercialized in SPARC's instructions with support
    for tags (not used in the Lisp system that a former comp.arch regular contributed to).

    SOAR was before SPUR.

    The tags support on SPARC(32) only helps in Lisp for integer operations
    inline, like using a number as an array offset.

    The same word layout of using the "free" lower bits for tags when you know
    that objects are aligned to larger boundaries is still used in most Lisp systems today, just without any hardware support, you need to generate instructions to shift down an integer value before using it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Wed Mar 5 18:08:01 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so.

    Oh yes, they did; there were nine years between the launch of the
    VAX and the launch of HP-PA.

    So what?

    and you can see pictures
    at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

    Nice! The pictures are pretty good. I can read the markings on the
    chip. The first chip I looked at was marked 74AS181. TI introduced
    the 74xx series of TTL chips starting in 1964, and when I read TTL, I expected to see 74xx chips. The 74181 was introduced in February
    1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
    11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
    for the VAX, and the photo confirms my expectation for the first HP-PA
    CPU.

    The AS family was only introduced in 1980, so there was some advances
    between the VAX and this HP-PA CPU indeed. However, as far as the
    number of boards is concerned, a 74AS181 takes as much space as a
    plain 74181, so that difference is irrelevant for that aspect.

    I leave it to you to point out a chip on the HP-PA CPU that did not
    have a same-sized variant avalable in, say, 1975.

    What I found intruiging are the chips that have numbers on paper
    on them, like 09740-81710. That chip has a MMI logo still sticking
    out. This is the logo of Monolithic Memories, Inc. which developed
    the PAL chips of "Soul of a New Machine" and Eclipse MV 8000 fame.
    At https://en.wikipedia.org/wiki/Programmable_Array_Logic you can
    see the logo of the company.

    PALs were not available for the VAX development, and they certainly
    made implemting logic far less cumbersome, and they took up far less
    space than their equivalent in logic gates (again, as described in
    "The Soul of a New Machine", where Tom West gambled the development
    on MMI getting its act togetther).

    Given a (very rough) estimate that each PAL replaced four standard
    logic chips of simlar size, my guess would be that it saved
    them toe equivalent of two to three circuit boards, not bad.

    Another striking thing is how densely the circuit boards are packed,
    compared to the VAX boards one finds. I suspect they had access
    to more layers of printed circuit board than DEC ten years earlier.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Mar 5 15:28:16 2025
    Anton Ertl [2025-03-01 11:58:17] wrote:
    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce that I know how to
    design their next architecture, and how to compiler for it, I would
    give the implementation team RV32GC as architecture to implement, and
    that they should use pipelining for that, and of course also give that
    to the software people.

    I wonder if an RV32GC would be competitive if implemented in the
    technology available back in 1977 (when the VAX-11/780 came out,
    according to Wikipedia).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stefan Monnier on Wed Mar 5 21:15:44 2025
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    Anton Ertl [2025-03-01 11:58:17] wrote:
    Bottom line: If you sent, e.g., me and the needed documents back in
    time to the start of the VAX project, and gave me a magic wand that
    would convince the DEC management and workforce that I know how to
    design their next architecture, and how to compiler for it, I would
    give the implementation team RV32GC as architecture to implement, and
    that they should use pipelining for that, and of course also give that
    to the software people.

    I wonder if an RV32GC would be competitive if implemented in the
    technology available back in 1977 (when the VAX-11/780 came out,
    according to Wikipedia).

    RISC in general could have been - the 801 (although it was
    implemented in ECL), HP's first HP-PA, implemented in TTL, and
    ARM show that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Mar 5 19:04:20 2025
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Have TTL chips advanced between the VAX and the first HP-PA
    implementation? I don't think so.
    Oh yes, they did; there were nine years between the launch of the
    VAX and the launch of HP-PA.
    So what?

    and you can see pictures
    at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/
    Nice! The pictures are pretty good. I can read the markings on the
    chip. The first chip I looked at was marked 74AS181. TI introduced
    the 74xx series of TTL chips starting in 1964, and when I read TTL, I
    expected to see 74xx chips. The 74181 was introduced in February
    1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
    11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
    for the VAX, and the photo confirms my expectation for the first HP-PA
    CPU.

    The AS family was only introduced in 1980, so there was some advances
    between the VAX and this HP-PA CPU indeed. However, as far as the
    number of boards is concerned, a 74AS181 takes as much space as a
    plain 74181, so that difference is irrelevant for that aspect.

    I leave it to you to point out a chip on the HP-PA CPU that did not
    have a same-sized variant avalable in, say, 1975.

    What I found intruiging are the chips that have numbers on paper
    on them, like 09740-81710. That chip has a MMI logo still sticking
    out. This is the logo of Monolithic Memories, Inc. which developed
    the PAL chips of "Soul of a New Machine" and Eclipse MV 8000 fame.
    At https://en.wikipedia.org/wiki/Programmable_Array_Logic you can
    see the logo of the company.

    PALs were not available for the VAX development, and they certainly
    made implemting logic far less cumbersome, and they took up far less
    space than their equivalent in logic gates (again, as described in
    "The Soul of a New Machine", where Tom West gambled the development
    on MMI getting its act togetther).

    Given a (very rough) estimate that each PAL replaced four standard
    logic chips of simlar size, my guess would be that it saved
    them toe equivalent of two to three circuit boards, not bad.

    Another striking thing is how densely the circuit boards are packed,
    compared to the VAX boards one finds. I suspect they had access
    to more layers of printed circuit board than DEC ten years earlier.

    MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array. PAL has programmable AND matrix but a fixed OR matrix.
    PLA has both AND and OR matrix programmable.

    Mask programmed PLA's were available since 1970, and field programmable
    FPLA's available in 1976 from a number of suppliers (e.g. Signetics). https://en.wikipedia.org/wiki/Programmable_logic_array

    If one was building a RISC style ISA cpu in 1975 they could be used
    for decoding and state machines for fetch, load/store, page table walker.
    I don't know the price.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Robert Swindells on Thu Mar 6 00:24:25 2025
    On Wed, 5 Mar 2025 15:01:11 -0000 (UTC), Robert Swindells wrote:

    The same word layout of using the "free" lower bits for tags when
    you know that objects are aligned to larger boundaries is still used
    in most Lisp systems today, just without any hardware support, you
    need to generate instructions to shift down an integer value before
    using it.

    *Lightbulb moment*

    How much would it cost in hardware to add support for ignoring some
    bottommost N bits (N fixed? configurable?) for most accesses?

    This ties in with my idea that it would have been useful to reserve the
    bottom 3 bits for a bit offset, albeit ignored (or even MBZ) by normal load/store instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 6 01:01:41 2025
    On Thu, 6 Mar 2025 0:24:25 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 5 Mar 2025 15:01:11 -0000 (UTC), Robert Swindells wrote:

    The same word layout of using the "free" lower bits for tags when
    you know that objects are aligned to larger boundaries is still used
    in most Lisp systems today, just without any hardware support, you
    need to generate instructions to shift down an integer value before
    using it.

    *Lightbulb moment*

    How much would it cost in hardware to add support for ignoring some bottommost N bits (N fixed? configurable?) for most accesses?

    The gates used to query and control whether that functionality is
    present
    or absent will be 50× larger than the gates to ignore any misalignment.

    This ties in with my idea that it would have been useful to reserve the bottom 3 bits for a bit offset, albeit ignored (or even MBZ) by normal load/store instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Theo on Thu Mar 6 00:28:34 2025
    On 05 Mar 2025 11:58:35 +0000 (GMT), Theo wrote:

    ARM2 had no caches, but was still table-topping in its era.

    Table-topping in the PC market, perhaps, and less so if you look at price/ performance as opposed to performance.

    Was it table-topping in the workstation market? Somehow it never seems to
    have been considered seriously for that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Mar 6 02:30:55 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    By contrast, making good use of the complex instructions of VAX in a
    compiler consumed significant resources (e.g., Figure 2 of >https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
    factor 1.5 more code in the code generator for VAX than for RISC-II). >Compilers at the time did not use the CISCy features much, which is
    one reason why the IBM 801 project and later the Berkeley RISC and
    Stanford MIPS proposed replacing them with a load/store architecture.

    I'm not so sure. The IBM Fortran H compiler used a lot of the 360's instruction set and it is my recollection that even the dmr C compiler would generate memory
    to memory instructions when appropriate. The PL.8 compiler generated code for 5 architectures including S/360 and 68K, and I think I read somewhere that its S/360 code was considrably better than the native PL/I compilers.

    I get the impression that they found that once you have a reasonable number of registers, like 16 or more, the benefit of complex instructions drops because you can make good use of the values in the registers.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Thu Mar 6 06:53:23 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Given a (very rough) estimate that each PAL replaced four standard
    logic chips of simlar size, my guess would be that it saved
    them toe equivalent of two to three circuit boards, not bad.

    Another striking thing is how densely the circuit boards are packed,
    compared to the VAX boards one finds. I suspect they had access
    to more layers of printed circuit board than DEC ten years earlier.

    MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array. PAL has programmable AND matrix but a fixed OR matrix.
    PLA has both AND and OR matrix programmable.

    Mask programmed PLA's were available since 1970, and field programmable FPLA's available in 1976 from a number of suppliers (e.g. Signetics). https://en.wikipedia.org/wiki/Programmable_logic_array

    I read somewhere that these were not used much because, in the
    beginning, they were slow, big, expensive and difficult to program.

    This is probably why they were not considered as a replacement for
    the PAL chips for the MV/8000, had MMI failed - they were not up
    to the job.

    If one was building a RISC style ISA cpu in 1975 they could be used
    for decoding and state machines for fetch, load/store, page table walker.
    I don't know the price.

    They could have been used for the same things on the VAX 11/780. Does
    anybody know if they were?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Thu Mar 6 16:11:15 2025
    John Levine <johnl@taugh.com> writes:
    I'm not so sure. The IBM Fortran H compiler used a lot of the 360's instruction
    set and it is my recollection that even the dmr C compiler would generate memory
    to memory instructions when appropriate. The PL.8 compiler generated code for 5
    architectures including S/360 and 68K, and I think I read somewhere that its S/360 code was considrably better than the native PL/I compilers.

    I get the impression that they found that once you have a reasonable number of
    registers, like 16 or more, the benefit of complex instructions drops because you can make good use of the values in the registers.


    long ago and far away ... comparing pascal to pascal front-end with
    pl.8 back-end (3033 is 370 about 4.5MIPS)

    Date: 8 August 1981, 16:47:28 EDT
    To: wheeler

    the 801 group here has run a program under several different PASCAL
    "systems". The program was about 350 statements and basically
    "solved" SOMA (block puzzle..). Although this is only one test, and
    all of the usual caveats apply, I thought the numbers were
    interesting... The numbers given in each case are EXECUTION TIME ONLY
    (Virtual on 3033).

    6m 30 secs PERQ (with PERQ's Pascal compiler, of course)
    4m 55 secs 68000 with PASCAL/PL.8 compiler at OPT 2
    0m 21.5 secs 3033 PASCAL/VS with Optimization
    0m 10.5 secs 3033 with PASCAL/PL.8 at OPT 0
    0m 5.9 secs 3033 with PASCAL/PL.8 at OPT 3

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Mar 7 02:27:59 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    By contrast, making good use of the complex instructions of VAX in a
    compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
    factor 1.5 more code in the code generator for VAX than for RISC-II). Compilers at the time did not use the CISCy features much, which is
    one reason why the IBM 801 project and later the Berkeley RISC and
    Stanford MIPS proposed replacing them with a load/store architecture.

    VAX intstructions are very complex and much of that complexity
    is hard to use in compilers. But even extremaly simple compiler
    can generate load-op combinations decreasing number of instructions.
    Rather simple hack is enough to combine additions in address
    artihmetic into addressing mode. Also, operations with two or three
    memory addresses are easy to generate from compiler. I think
    that chains of pointer dereferences in C should be not hard to
    convert to indirect addressing mode.

    I think that state of chip technology was more important. For
    example 486 has RISC-like pipeline with load-ops, but load-ops
    take the same time as two separate instructions. Similarly,
    operations on memory take the same time as load-op-store.
    So there were no execution time gain from combined instructions
    and clearly some complication compared to load/store
    architecture. Main speed gain of RISC came from having
    pipeline on a chip (multichip processors were pipelined,
    but expensive, earlier single chip ones had no pipeline).
    So load/store architecture (and no microcode) meant that
    early RISC could offer good pipeline earlier.


    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Fri Mar 7 04:09:13 2025
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

    In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to John Dallman on Fri Mar 7 06:03:13 2025
    On Wed, 5 Mar 2025 09:16 +0000 (GMT Standard Time), jgd@cix.co.uk
    (John Dallman) wrote:

    In article <vq82c8$232tl$7@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

    How would you have done games without being able to directly
    address screen memory? I'm sure PC/IX, being a Unix-type system,
    would have disallowed that.

    How?

    There's no memory management hardware in an 8088, and PC/IX ran on a
    basic PC/XT.


    Programmatically - the compiler (and/or assembler) could disallow it.

    Programmatic isolation works quite well as long as everyone plays by
    the rules. [Which, of course, is hard to enforce.]

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Waldek Hebisch on Fri Mar 7 13:57:02 2025
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    By contrast, making good use of the complex instructions of VAX in a compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
    factor 1.5 more code in the code generator for VAX than for
    RISC-II). Compilers at the time did not use the CISCy features
    much, which is one reason why the IBM 801 project and later the
    Berkeley RISC and Stanford MIPS proposed replacing them with a
    load/store architecture.

    VAX intstructions are very complex and much of that complexity
    is hard to use in compilers. But even extremaly simple compiler
    can generate load-op combinations decreasing number of instructions.
    Rather simple hack is enough to combine additions in address
    artihmetic into addressing mode. Also, operations with two or three
    memory addresses are easy to generate from compiler. I think
    that chains of pointer dereferences in C should be not hard to
    convert to indirect addressing mode.

    I think that state of chip technology was more important. For
    example 486 has RISC-like pipeline with load-ops, but load-ops
    take the same time as two separate instructions. Similarly,
    operations on memory take the same time as load-op-store.
    So there were no execution time gain from combined instructions
    and clearly some complication compared to load/store
    architecture.

    In specific case of i486, with its small (8KB) unfied I+D cache,
    you will see good gain from load+Op combining, even if going by cycle
    count in the manual they are the same.
    For Pentium , not necessarily so.

    Main speed gain of RISC came from having
    pipeline on a chip (multichip processors were pipelined,
    but expensive, earlier single chip ones had no pipeline).
    So load/store architecture (and no microcode) meant that
    early RISC could offer good pipeline earlier.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Mar 7 15:23:02 2025
    Michael S wrote:
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    By contrast, making good use of the complex instructions of VAX in a
    compiler consumed significant resources (e.g., Figure 2 of
    https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
    factor 1.5 more code in the code generator for VAX than for
    RISC-II). Compilers at the time did not use the CISCy features
    much, which is one reason why the IBM 801 project and later the
    Berkeley RISC and Stanford MIPS proposed replacing them with a
    load/store architecture.

    VAX intstructions are very complex and much of that complexity
    is hard to use in compilers. But even extremaly simple compiler
    can generate load-op combinations decreasing number of instructions.
    Rather simple hack is enough to combine additions in address
    artihmetic into addressing mode. Also, operations with two or three
    memory addresses are easy to generate from compiler. I think
    that chains of pointer dereferences in C should be not hard to
    convert to indirect addressing mode.

    I think that state of chip technology was more important. For
    example 486 has RISC-like pipeline with load-ops, but load-ops
    take the same time as two separate instructions. Similarly,
    operations on memory take the same time as load-op-store.
    So there were no execution time gain from combined instructions
    and clearly some complication compared to load/store
    architecture.

    In specific case of i486, with its small (8KB) unfied I+D cache,
    you will see good gain from load+Op combining, even if going by cycle
    count in the manual they are the same.
    For Pentium , not necessarily so.

    Right:

    My Pentium-optimized Word Count program ran nearly twice as fast (in
    cycle counts) on a Pentium as on a 486. The inner loop was inverted to
    maximize the load-use distance and I got close to perfect pairing:

    From memory, similar to

    REPT 64
    add ax,dx
    mov dx,increment_table[bx]
    mov bl,[es:di] ;; 64 KB table to classify a pair of chars
    mov di,[si+OFFSET]

    add ax,dx
    mov dx,increment_table[bx+16]
    mov bh,[es:di]
    mov di,[si+OFFSET+2]
    ENDM

    On the Pentium this was only possible with separate load and operate instructions.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri Mar 7 17:35:57 2025
    On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g.
    MOVC3/
    MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

    In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

    One could also say at that point in time that FORTRAN was not that high
    of a high level language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Mar 7 17:34:21 2025
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:
    ----------------

    So, writing things like:
    y[55:48]=x[19:12];

    2 instructions in My 66000. One extract, one insert.


    And:
    j=x[19:12];
    Also a single instruction, or 2 or 3 in the fallback case (encoded as a
    shift and mask).

    1 instruction--extract (SLL or SLA)

    ----------------------

    For a simple test:
    lj[ 7: 0]=li[31:24];
    lj[15: 8]=li[23:16];
    lj[23:16]=li[15: 8];
    lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>

    Though, looking at the compiler code, it would be subject to the "side effects in lvalue may be applied twice" bug:
    (*ct++)[19:12]=(*cs++)[15:8];

    5 instructions:: LD, LD, EXT, INS, ST; with deferred ADD to Rcs and Rct.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Mar 7 18:52:43 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g.
    MOVC3/
    MOVC5 for string copying, and of course POLYx for direct evaluation of
    polynomial functions.

    In a way, one could say that, in many ways, VAX machine language was a
    higher-level language than Fortran.

    One could also say at that point in time that FORTRAN was not that high
    of a high level language.

    It was high enough, right from the start, to abstract away a _lot_
    of the machine, while still being quite efficient.

    "Since Fortran should virtually eliminate coding and debugging"
    was rather optimistic, though; programming tasks expanded too
    fast for that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Fri Mar 7 14:26:51 2025
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Given a (very rough) estimate that each PAL replaced four standard
    logic chips of simlar size, my guess would be that it saved
    them toe equivalent of two to three circuit boards, not bad.

    Another striking thing is how densely the circuit boards are packed,
    compared to the VAX boards one finds. I suspect they had access
    to more layers of printed circuit board than DEC ten years earlier.
    MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array.
    PAL has programmable AND matrix but a fixed OR matrix.
    PLA has both AND and OR matrix programmable.

    Mask programmed PLA's were available since 1970, and field programmable
    FPLA's available in 1976 from a number of suppliers (e.g. Signetics).
    https://en.wikipedia.org/wiki/Programmable_logic_array

    I read somewhere that these were not used much because, in the
    beginning, they were slow, big, expensive and difficult to program.

    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.

    That seems like a lot but I looked up what it would take if you built
    the same out of discrete 2-input NOR and and 8-input NAND and its
    basically about the same delay but a lot more parts and board space.

    This is probably why they were not considered as a replacement for
    the PAL chips for the MV/8000, had MMI failed - they were not up
    to the job.

    If one was building a RISC style ISA cpu in 1975 they could be used
    for decoding and state machines for fetch, load/store, page table walker.
    I don't know the price.

    They could have been used for the same things on the VAX 11/780. Does anybody know if they were?

    PLA's are useful when you have don't care bits or have different bit-wise layout formats to parse, exactly like RISC ISA's do.

    VAX opcode was 1 or two whole bytes so PLA would have been a waste.
    The operand specifier did have a varying format but was simple so
    was decoded with discrete logic.

    The VAX 780 decoder used 2k * 12 ROM's to look up the starting uAddress.
    I haven't checked if it actually did this but a microsequencer might execute
    a multiway microsubroutine call by jamming that ROM output into the low
    address bits of the next address field while pushing the return uAddr
    on a hardware stack.

    If anyone is interested the VAX 780 hardware designs are available
    but you have to know how to read TTL and have old hardware manuals
    to look up part numbers.

    CPU Assembly
    http://www.bitsavers.org/pdf/dec/vax/780/MP00496_KA780_197911.pdf

    Data Path Description http://www.bitsavers.org/pdf/dec/vax/780/AA-H307-TE_VAX-11_780_Data_Path_Description_197902.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Fri Mar 7 14:59:47 2025
    Lawrence D'Oliveiro wrote:
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

    How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

    In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

    And the decimal instructions for COBOL (also on some PDP-11's).

    The only reason to add complex instructions like MOVC3, MOVC5 and
    others SKIPC, SPANC, etc is if hardware can do a better job than a
    software subroutine. And you only add those instructions when you
    know you can afford the hardware, not in anticipation that someday
    we might do a better job.

    The reason VAX and 8086 benefit from string instructions is because
    they are sequential processors. It allows them to do decode once and
    sit in a tight loop doing execute. But both still move byte-by-byte
    and do not attempt to optimize memory access operations.
    Also the sequencer is sequential so the loop counting and branch testing
    each take microcycles.

    So there is some benefit when comparing a VAX MOVc3 to a VAX subroutine,
    but not compared to a pipelined TTL RISC.

    If it is a pipelined RISC then decode is overlapped with execute
    so there is no advantage to these complex instructions vs a RISC
    subroutine doing the same in a loop. And the RISC subroutine might be
    faster because it can overlap the loop count and branch with memory access.

    In both cases the real advantage is when you can afford the HW to
    optimize bus accesses as this is where the majority of cycles are spent.
    When you can afford the HW optimizer then you add them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Fri Mar 7 20:07:35 2025
    On Fri, 7 Mar 2025 18:27:06 +0000, Robert Finch wrote:

    On 2025-03-07 12:34 p.m., MitchAlsup1 wrote:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:
    ----------------

    So, writing things like:
       y[55:48]=x[19:12];

    2 instructions in My 66000. One extract, one insert.

    Ibid for Q+. The logic for an extract and insert as one operation might
    add to the timing. Extract, sign/zero extend and copy back. Fields may
    be different sizes.



    And:
       j=x[19:12];
    Also a single instruction, or 2 or 3 in the fallback case (encoded as a
    shift and mask).

    1 instruction--extract (SLL or SLA)

    Q+ has EXT/EXTU which is basically a SRL or SRA with mask applied
    afterwards. PowerPC has a rotate-left-and-mask instruction. In my
    opinion it makes more sense for extracts to be shifting right.

    Both SR and SL have both sign control and masking.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Mar 7 20:25:44 2025
    On Fri, 7 Mar 2025 19:59:47 +0000, EricP wrote:

    Lawrence D'Oliveiro wrote:
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g.
    MOVC3/
    MOVC5 for string copying, and of course POLYx for direct evaluation of
    polynomial functions.

    How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

    In a way, one could say that, in many ways, VAX machine language was a
    higher-level language than Fortran.

    And the decimal instructions for COBOL (also on some PDP-11's).

    The only reason to add complex instructions like MOVC3, MOVC5 and
    others SKIPC, SPANC, etc is if hardware can do a better job than a
    software subroutine. And you only add those instructions when you
    know you can afford the hardware, not in anticipation that someday
    we might do a better job.

    The reason VAX and 8086 benefit from string instructions is because
    they are sequential processors. It allows them to do decode once and
    sit in a tight loop doing execute. But both still move byte-by-byte
    and do not attempt to optimize memory access operations.
    Also the sequencer is sequential so the loop counting and branch testing
    each take microcycles.

    So there is some benefit when comparing a VAX MOVc3 to a VAX subroutine,
    but not compared to a pipelined TTL RISC.

    If it is a pipelined RISC then decode is overlapped with execute
    so there is no advantage to these complex instructions vs a RISC
    subroutine doing the same in a loop.

    You forgot the word "sequentially" in the previous sentence.

    And the RISC subroutine might be
    faster because it can overlap the loop count and branch with memory
    access.

    In both cases the real advantage is when you can afford the HW to
    optimize bus accesses as this is where the majority of cycles are spent.
    When you can afford the HW optimizer then you add them.

    As to MOVc3; once your cache supports wide access (in support of
    64-bit misaligned access) you can get 128-bits read or written
    per cycle per port. So, there is very little added to the HW in
    order to support doing MOVc3 stuff at 64-bits per cycle:: in cycle
    1 we read 128-bits, in cycle 2 we write 128-bits and increment the
    iterator. For startup and terminations, the incrementation of the
    iterator creates a mask shutting down the lanes byte by byte.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Mar 7 22:25:12 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    For a simple test:
    lj[ 7: 0]=li[31:24];
    lj[15: 8]=li[23:16];
    lj[23:16]=li[15: 8];
    lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>

    Isn't that just 'bswap32' on x86, or REV32 on ARM64?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 7 22:45:56 2025
    On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    For a simple test:
    lj[ 7: 0]=li[31:24];
    lj[15: 8]=li[23:16];
    lj[23:16]=li[15: 8];
    lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>

    Isn't that just 'bswap32' on x86, or REV32 on ARM64?

    A degenerate version is:: but consider::

    BITR Rd,Rs1,<1>

    performs bit reversion, while::

    BITR Rd,Rs1,<2>

    reverses pairs of bits, ...

    BITR Rs,Rs1,<16>

    reverses halfwords.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Mar 7 22:51:13 2025
    On Fri, 7 Mar 2025 21:00:09 +0000, BGB wrote:

    On 3/7/2025 11:34 AM, MitchAlsup1 wrote:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:
    ----------------

    So, writing things like:
       y[55:48]=x[19:12];

    2 instructions in My 66000. One extract, one insert.


    1 instruction in this case...

    The 3 sub-fields being, 36, 48, and 56.

    The way I defined things does mean adding 1 to the high bit in the
    encoding, so 63:56 would be expressed as 64:56, which nominally uses 1
    more bit of range. Though, if expressed in 6 bits, the behavior I had
    defined it as, effectively causes it to be modulo.

    My 66000 uses the same trick, allowing both 64 and 0 to indicate
    64-bits.

    ----------------------

    For a simple test:
       lj[ 7: 0]=li[31:24];
       lj[15: 8]=li[23:16];
       lj[23:16]=li[15: 8];
       lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>


    In this particular case, there is also a SWAP.L instruction, but I was ignoring it for sake of this example, and my compiler isn't that clever.
    ------------

    Unlike Verilog, in C mode it will currently require single-bit fetch to
    use a notation like x[17:17], but this is more because a person is much
    more likely to type "x[17]" by accident (such as by using the wrong
    variable, a missing '*', or ...).

    I use <1,16> where the first is the length of the field, and the second
    is the offset from bit<0>.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Mar 8 00:57:13 2025
    On Fri, 7 Mar 2025 17:35:57 +0000, MitchAlsup1 wrote:

    On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

    In a way, one could say that, in many ways, VAX machine language was a
    higher-level language than Fortran.

    One could also say at that point in time that FORTRAN was not that high
    of a high level language.

    It was to most people of the time, particularly in the USA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Mar 8 01:00:02 2025
    On Fri, 7 Mar 2025 18:52:43 -0000 (UTC), Thomas Koenig wrote:

    [Fortran] was high enough, right from the start, to abstract away a
    _lot_ of the machine, while still being quite efficient.

    John Backus described the techno-cultural milieu in which Fortran was
    born, in one of a collection of papers on the origins of historically- significant programming languages, that I read many decades ago.

    It was one of the first, if not the first, serious attempt at an
    optimizing compiler. He mentions the surprise people felt (including, of course, seasoned assembly-language programmers) at how far the generated
    code departed from a simple correspondence to the individual statements of
    the original source.

    As I recall, much of the competition at the time took the form of floating-point calculation engines with interpreted languages on top.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Mar 8 01:02:29 2025
    On Fri, 7 Mar 2025 16:57:31 -0600, BGB wrote:

    Like, bitfield helpers were too weird/obscure, but hard-coding parts of
    the CRC or stuff related to DES encryption and similar into the ISA is fine...

    I blame C. The fact that C does not have built-in constructs to make
    convenient use of variable bitfields seems to be the main excuse for not supporting them in hardware instruction sets.

    And then in return, the lack of efficient support in hardware becomes an
    excuse for not having such constructs in the higher-level language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Mar 8 02:08:26 2025
    On Sat, 8 Mar 2025 1:02:29 +0000, Lawrence D'Oliveiro wrote:

    On Fri, 7 Mar 2025 16:57:31 -0600, BGB wrote:

    Like, bitfield helpers were too weird/obscure, but hard-coding parts of
    the CRC or stuff related to DES encryption and similar into the ISA is
    fine...

    I blame C. The fact that C does not have built-in constructs to make convenient use of variable bitfields seems to be the main excuse for not supporting them in hardware instruction sets.

    Way back in 1983, John Perry of Bell Northern Research convinced me
    to add the masker to the barrel shifter in Mc 88100, and I saw then
    and there that it was an advancement in the state of "shift
    instructions".

    C notwithstanding.

    And then in return, the lack of efficient support in hardware becomes an excuse for not having such constructs in the higher-level language.

    Same excuse as misaligned memory references--shortsightedness.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sat Mar 8 03:28:49 2025
    On Sat, 8 Mar 2025 2:49:50 +0000, BGB wrote:

    ------------------------

    I guess, while a person could do something like (in C):
    _BitInt(1048576) bmp;
    _Boolean b;
    ...
    b=(bmp>>i)&1; //*blarg* (shift here would be absurdly expensive)

    This is liklely to be rare vs more traditional strategies, say:
    uint64_t *bmp;
    int b, i;
    ...
    b=(bmp[i>>6]>>(i&63))&1;

    Question: How do you handle the case where the bit vector is an odd
    number of bits in width ?? Say <3, 5, 7, 17, ...>

    As well as the traditional strategy being a whole lot more efficient in
    this case...


    I guess the case could be made for a generic dense bit array.

    Mc 68020 had instructions to access bit-fields that cross word
    boundaries.

    Though, an open question is how one would define it in a way that is consistent with existing semantics rules.

    Architecture is more about what gets left OUT than what gets left IN.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Lynn Wheeler on Sat Mar 8 07:27:46 2025
    Lynn Wheeler <lynn@garlic.com> schrieb:

    long ago and far away ... comparing pascal to pascal front-end with
    pl.8 back-end (3033 is 370 about 4.5MIPS)

    Date: 8 August 1981, 16:47:28 EDT
    To: wheeler

    the 801 group here has run a program under several different PASCAL "systems". The program was about 350 statements and basically
    "solved" SOMA (block puzzle..). Although this is only one test, and
    all of the usual caveats apply, I thought the numbers were
    interesting... The numbers given in each case are EXECUTION TIME ONLY (Virtual on 3033).

    6m 30 secs PERQ (with PERQ's Pascal compiler, of course)
    4m 55 secs 68000 with PASCAL/PL.8 compiler at OPT 2
    0m 21.5 secs 3033 PASCAL/VS with Optimization
    0m 10.5 secs 3033 with PASCAL/PL.8 at OPT 0
    0m 5.9 secs 3033 with PASCAL/PL.8 at OPT 3

    Interesting figures. there is a factor of 50 of 68000 vs the 3033,
    which was in the ~1 MIPS range, with the same compiler technology.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Mar 8 10:52:36 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.

    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).


    He reports that the 82S100 generates the #CASRAM signal with a
    propagation delay of 35ns in one direction and 25ns in the other, and
    the #ROMH signal with a propagation delay of 25ns in both directions
    (table 3.4). I guess that the 50ns are the worst case of anything you
    can do with the 82S100.

    He reports a current consumption of 102mA for the 82S100 (table 3.3),
    which at 5V (the regular voltage at the time) is pretty close to the
    600mW given in the data sheet. The rest of the board, including
    several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
    (I/O)) , consumed at most 770mA in his measurements; most of the rest
    was NMOS, while the 82S100 was TTL.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Mar 8 14:36:37 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    For a simple test:
    lj[ 7: 0]=li[31:24];
    lj[15: 8]=li[23:16];
    lj[23:16]=li[15: 8];
    lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>

    Isn't that just 'bswap32' on x86, or REV32 on ARM64?

    A degenerate version is:: but consider::

    BITR Rd,Rs1,<1>

    performs bit reversion, while::

    BITR Rd,Rs1,<2>

    reverses pairs of bits, ...

    Is there an application for this particular variant?


    BITR Rs,Rs1,<16>

    reverses halfwords.

    Since there generally aren't higher level language
    constructs that encapsulate this behavior, how useful
    is it in the real world? Does it justify the verif
    costs, much less the engineering cost?

    Bswap32/64 are genuinely useful in real world applications
    (particularly networking) thus the presence in most modern instruction sets.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 8 17:57:57 2025
    On Sat, 8 Mar 2025 14:36:37 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

    For a simple test:
    lj[ 7: 0]=li[31:24];
    lj[15: 8]=li[23:16];
    lj[23:16]=li[15: 8];
    lj[31:24]=li[ 7: 0];
    Does seem to compile down to 4 instructions.

    1 innstruction:: BITR rd,rs1,<8>

    Isn't that just 'bswap32' on x86, or REV32 on ARM64?

    A degenerate version is:: but consider::

    BITR Rd,Rs1,<1>

    performs bit reversion, while::

    BITR Rd,Rs1,<2>

    reverses pairs of bits, ...

    Is there an application for this particular variant?


    BITR Rs,Rs1,<16>

    reverses halfwords.

    Since there generally aren't higher level language
    constructs that encapsulate this behavior, how useful
    is it in the real world? Does it justify the verif
    costs, much less the engineering cost?

    You already have a vector of bits that is spread across
    the data path (reversing bits.)

    The only cost is the multiplexer that selects certain
    orders of bits out as a result. So, the cost in area is
    negligible.

    Bswap32/64 are genuinely useful in real world applications
    (particularly networking) thus the presence in most modern instruction
    sets.

    You might want to do a BE->LE conversion where they byte
    string contains a mixture of 8-bit and 16-bit characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sat Mar 8 13:03:38 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.

    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).

    It looks like the C64's circuit design was one culprit.
    Though I do remember back then hearing about failures over time with
    other fuse programmable devices like PROMs.
    Something about the sputter from the blown fuses.

    He reports that the 82S100 generates the #CASRAM signal with a
    propagation delay of 35ns in one direction and 25ns in the other, and
    the #ROMH signal with a propagation delay of 25ns in both directions
    (table 3.4). I guess that the 50ns are the worst case of anything you
    can do with the 82S100.

    Yes, and it sounds like the circuit design depends on a race condition
    between two logic paths to work. Big no-no.

    He reports a current consumption of 102mA for the 82S100 (table 3.3),
    which at 5V (the regular voltage at the time) is pretty close to the
    600mW given in the data sheet. The rest of the board, including
    several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
    (I/O)) , consumed at most 770mA in his measurements; most of the rest
    was NMOS, while the 82S100 was TTL.

    - anton

    This is not a problem with the 82S100.
    Whoever designed that circuit didn't know what they were doing.
    One can't use any combinatorial logic circuit and expect exact timing.
    The manufacturer specs indicate a range of speeds which depend on
    things like variations in power supply voltage, load, temperature.
    In the case of the 82S100 it is 35 ns typical, 50 ns max.
    Also these are logic chains, so each gate adds its own variations.

    The circuit should be designed so it works across all timing variations
    which is what synchronization clocks and flip flops are for.
    And even then flip flops have their own timing variations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Mar 8 19:42:46 2025
    On Sat, 8 Mar 2025 18:03:38 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.

    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).

    It looks like the C64's circuit design was one culprit.
    Though I do remember back then hearing about failures over time with
    other fuse programmable devices like PROMs.
    Something about the sputter from the blown fuses.

    A laser blasts a short wire so that there is no longer any connection.
    Then, in use, the electrical forces cause the still present aluminum
    wires to reconstruct themselves making contact and changing the state.

    The blowable wire is still immersed within an oxide layer, preventing
    the blown aluminum atoms from "really going anywhere" allowing small
    forces to reassemble the wire.

    He reports that the 82S100 generates the #CASRAM signal with a
    propagation delay of 35ns in one direction and 25ns in the other, and
    the #ROMH signal with a propagation delay of 25ns in both directions
    (table 3.4). I guess that the 50ns are the worst case of anything you
    can do with the 82S100.

    Yes, and it sounds like the circuit design depends on a race condition between two logic paths to work. Big no-no.

    In general, all signals that interact with the data path, must be
    clocked and driven from the same edge of the data path. But even
    here, designers must be careful to load each select line evenly
    so that the line driving operand 1 forwarding has the same "cross
    data path" delay as the line driving every other data path select
    line.

    He reports a current consumption of 102mA for the 82S100 (table 3.3),
    which at 5V (the regular voltage at the time) is pretty close to the
    600mW given in the data sheet. The rest of the board, including
    several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
    (I/O)) , consumed at most 770mA in his measurements; most of the rest
    was NMOS, while the 82S100 was TTL.

    - anton

    This is not a problem with the 82S100.
    Whoever designed that circuit didn't know what they were doing.
    One can't use any combinatorial logic circuit and expect exact timing.
    The manufacturer specs indicate a range of speeds which depend on
    things like variations in power supply voltage, load, temperature.
    In the case of the 82S100 it is 35 ns typical, 50 ns max.
    Also these are logic chains, so each gate adds its own variations.

    The circuit should be designed so it works across all timing variations
    which is what synchronization clocks and flip flops are for.
    And even then flip flops have their own timing variations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Torbjorn Lindgren on Sun Mar 9 01:46:03 2025
    On Sun, 9 Mar 2025 1:27:19 +0000, Torbjorn Lindgren wrote:


    Notice the common factor here - MT/Commodore was making a lot of
    "working but only barely" chips and Commodore used them internally to
    save money and also sold them to others which used them because, well,
    they were frequently the cheapest.

    Radio Shack TRS-80 would buy every Z80 that did not make 2MHz
    operating frequency. They used something around 1.87 MHz so the
    CPU clock and the TV clock were the same clock.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Torbjorn Lindgren@21:1/5 to Anton Ertl on Sun Mar 9 01:27:19 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.

    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report ><http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).

    AFAIK the Signetics 82S100 isn't failure-prone in C64, the MOS
    Technology (IE Commodore) *clone* that they switched to due to cost
    reasons IS known to be failure prone. Those are the reason there's
    lots of PLA replacement projects.

    If you have any actual Signetics device in a C64 it'll very likely
    fine unless the power supply failed and fed everything too high
    voltages which is unfortunately a common failure mode on the C64 power
    brick. This PSU failure usually destroys most or all of the memory
    chips, the SID and one or more likely multiple of CPU, PLA and ROMs.

    Other things with known high failure rates are the MOS Technology 74xx
    clones and the MT memory. These failure also include MT branded memory
    chips of that specific type when used in non-Commodore items like PC
    clones so it's not just C64.

    Notice the common factor here - MT/Commodore was making a lot of
    "working but only barely" chips and Commodore used them internally to
    save money and also sold them to others which used them because, well,
    they were frequently the cheapest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sun Mar 9 02:55:09 2025
    On Sat, 8 Mar 2025 03:28:49 +0000, MitchAlsup1 wrote:

    Mc 68020 had instructions to access bit-fields that cross word
    boundaries.

    And, with typical big-endian quirkiness†, the numbering was completely the opposite way round from the single-bit instructions from the earlier
    members of the 680x0 family.

    †to put it politely

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Mar 9 07:02:06 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sun, 9 Mar 2025 1:27:19 +0000, Torbjorn Lindgren wrote:


    Notice the common factor here - MT/Commodore was making a lot of
    "working but only barely" chips and Commodore used them internally to
    save money and also sold them to others which used them because, well,
    they were frequently the cheapest.

    Radio Shack TRS-80 would buy every Z80 that did not make 2MHz
    operating frequency. They used something around 1.87 MHz so the
    CPU clock and the TV clock were the same clock.

    RaptorCS buys POWER 9 chips where a higher number of cores failed
    than permitted by IBM's specs, and then sells them as systems with
    a lower working number of cores.

    The main disadvantages are a) price and b) they are stuck with POWER
    9 (due to the binary driver blob on Power 10, among other things).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Mar 9 10:13:06 2025
    MitchAlsup1 wrote:
    On Sat, 8 Mar 2025 18:03:38 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw >>>> in a 28 pins dip.

    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).

    It looks like the C64's circuit design was one culprit.
    Though I do remember back then hearing about failures over time with
    other fuse programmable devices like PROMs.
    Something about the sputter from the blown fuses.

    A laser blasts a short wire so that there is no longer any connection.
    Then, in use, the electrical forces cause the still present aluminum
    wires to reconstruct themselves making contact and changing the state.

    The blowable wire is still immersed within an oxide layer, preventing
    the blown aluminum atoms from "really going anywhere" allowing small
    forces to reassemble the wire.

    I'm referring to the electrically programmable bipolar PROMs.
    The 1977 TI Bipolar Memory manual says theirs had titanium-tungsten fuses. These were programmed with a 10.5V pulse up to 750 mA for 1 us to 1 ms.

    This would create a metallic vapor cloud around the cell inside the chip.
    I can envision that perhaps over time the electric field on the cell
    might attract the debris to grow into dendrites that eventually
    short the cell and turn a 0 to 1.

    As these could only be programmed once and cost a bundle,
    they were soon replaced by ultraviolet light erasable EPROMs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Torbjorn Lindgren on Sun Mar 9 10:15:00 2025
    Torbjorn Lindgren wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
    in a 28 pins dip.
    The Commodore 64 used a 82S100 or compatible for various purposes,
    especially for producing various chip select and RAM control signals
    from the addresses produced by the CPU or the VIC (graphics chip).
    Thomas Giesel wrote a very detailed report
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
    the original PLAs and their behaviour, in order to replace it
    (apparently it's a chip that was failure-prone).

    AFAIK the Signetics 82S100 isn't failure-prone in C64, the MOS
    Technology (IE Commodore) *clone* that they switched to due to cost
    reasons IS known to be failure prone. Those are the reason there's
    lots of PLA replacement projects.

    C64 could just as easily have failed if Signetics had rev'd the
    chip design to improve the yield.

    All the 82S100 spec says is that if any input changes then all the outputs
    will have settled after 50 ns. It makes no statement about the relative
    order that outputs will change for different input combinations.
    As it is NORMAL for many outputs of combinatorial circuits to glitch
    when changing state, it is wrong to depend on them not doing so.

    This is independent of failures of the MT clone which also could be true.

    If you have any actual Signetics device in a C64 it'll very likely
    fine unless the power supply failed and fed everything too high
    voltages which is unfortunately a common failure mode on the C64 power
    brick. This PSU failure usually destroys most or all of the memory
    chips, the SID and one or more likely multiple of CPU, PLA and ROMs.

    Affordable switching power supplies were relatively new back then.
    IIRC there were two designs, the expensive one that used a transformer and failed safe to ground, and the cheap one that failed to line voltage.

    Other things with known high failure rates are the MOS Technology 74xx
    clones and the MT memory. These failure also include MT branded memory
    chips of that specific type when used in non-Commodore items like PC
    clones so it's not just C64.

    Notice the common factor here - MT/Commodore was making a lot of
    "working but only barely" chips and Commodore used them internally to
    save money and also sold them to others which used them because, well,
    they were frequently the cheapest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Mar 10 23:17:38 2025
    On Mon, 10 Mar 2025 17:40:55 -0500, BGB wrote:

    It is rare for bitmap bits to not be a power of 2...

    In MPEG, some timestamp field was 33 bits in length, as I recall.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 00:53:31 2025
    On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:

    On 3/7/2025 9:28 PM, MitchAlsup1 wrote:
    On Sat, 8 Mar 2025 2:49:50 +0000, BGB wrote:

    ------------------------

    I guess, while a person could do something like (in C):
       _BitInt(1048576) bmp;
       _Boolean b;
       ...
       b=(bmp>>i)&1;  //*blarg* (shift here would be absurdly expensive)

    This is liklely to be rare vs more traditional strategies, say:
       uint64_t *bmp;
       int b, i;
       ...
       b=(bmp[i>>6]>>(i&63))&1;

    Question: How do you handle the case where the bit vector is an odd
    number of bits in width ?? Say <3, 5, 7, 17, ...>


    It is rare for bitmap bits to not be a power of 2...

    I would guess, at least for C, something like (for 3 bits):
    uint32_t *bmp;
    uint64_t bv;
    int i, b, bp;
    ...
    bp=i*3;
    bv=*(uint64_t *)(bmp+(bp>>5));
    b=(bv>>(bp&31))&7;

    Could apply to anything up to 31 bits.

    Not bad.

    Could do similar with __int128 (or uint128_t), which extends it up to 63 bits.
    ------------
    Mc 68020 had instructions to access bit-fields that cross word
    boundaries.


    I guess one could argue the use-case for adding a generic funnel shift instruction.

    My 66000 has CARRY-SL/SR which performs a double wide operand shifted
    by a single wide count (0..63) and produces a double wide result {IO}.

    If I added it, it would probably be a 64-bit encoding (generally needed
    for 4R).

    By placing the width in position {31..37} you can compress this down
    to 3-Operand.

    ----------
    Architecture is more about what gets left OUT than what gets left IN.

    Well, except in this case it was more a question of trying to fit it in
    with C semantics (and not consideration for more ISA features).

    Clearly, you want to support C semantics--but you can do this in a way
    that also supports languages with real bit-field support.
    ---------------
    There are still some limitations, for example:
    In my current implementation, CSR's are very limited (may only be used
    to load and store CSRs; not do RMW operations on CSRs).

    <y 66000 only has 8 CPU CRs, and even these are R/W through MMI/O
    space. All the other (effective) CRs are auto loaded in line quanta.

    This mechanism allows one CPU to figure out what another CPU is up to
    simply by meandering through its CRs...

    Though, have noted that seemingly some number of actual RISC-V cores
    also have this limitation.


    A more drastic option might be to try to rework the hardware interfaces
    and memory map hopefully enough to try to make it possible to run an OS
    like Linux, but there doesn't really seem to be a standardized set of hardware interfaces or memory map defined.

    Some amount of SOC's though seem to use a map like:
    00000000..0000FFFF: ROM goes here.
    00010000..0XXXXXXX: RAM goes here.
    ZXXXXXXX..FFFFFFFF: Hardware / MMIO

    My 66000::
    00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
    01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
    10 0000000000000000..FFFFFFFFFFFFFFFF: config
    11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

    Whatever you are trying to do, you won't run out of address space until
    64 bits becomes insufficient. Note: all HW interfaces are in config
    space
    and all CRs are in MMI/O space.

    ------------
    They seem to also be asking for a UEFI based boot process, but this
    would require having a bigger BootROM (can't likely fit a UEFI
    implementation into 32K). Seems that the idea is to have the UEFI BIOS
    boot the kernel directly as an ELF image (traditionally UEFI was always PE/COFF based?...).

    Boot ROM should be big enough that no BOOT ROM will ever exceed its
    size.
    --------------
    There is a probable need to move away from the "BJX2" name, which as
    noted, has some unfortunate connotations (turns out it was also used for
    a lewd act) and seems to be triggering to Google's automatic content filtering (probably for a similar reason).

    Hilarious--and reason enough to change names.

    When you do change names, can you spell LD and ST instead of MOV ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 17:57:02 2025
    On Tue, 11 Mar 2025 4:49:16 +0000, BGB wrote:

    On 3/10/2025 7:53 PM, MitchAlsup1 wrote:
    -------------------
    I guess one could argue the use-case for adding a generic funnel shift
    instruction.

    My 66000 has CARRY-SL/SR which performs a double wide operand shifted
    by a single wide count (0..63) and produces a double wide result {IO}.


    OK.


    If I added it, it would probably be a 64-bit encoding (generally needed
    for 4R).

    By placing the width in position {31..37} you can compress this down
    to 3-Operand.


    It is 3 operand if being used as a 128-bit shift op.
    But, funnel shift operators implies 3 independent inputs and 1 output.

    And 2 shifts or exotic masking. Which is why I stopped early.

    ----------
    Architecture is more about what gets left OUT than what gets left IN.

    Well, except in this case it was more a question of trying to fit it in
    with C semantics (and not consideration for more ISA features).

    Clearly, you want to support C semantics--but you can do this in a way
    that also supports languages with real bit-field support.
    ---------------

    Yeah.

    Amidst debugging and considering Verilog support...


    There are still some limitations, for example:
    In my current implementation, CSR's are very limited (may only be used
    to load and store CSRs; not do RMW operations on CSRs).

    My 66000 only has 16 CPU CRs, and even these are R/W through MMI/O
    space. All the other (effective) CRs are auto loaded in line quanta.

    This mechanism allows one CPU to figure out what another CPU is up to
    simply by meandering through its CRs...


    I had enough space for 64 CRs, but only a small subset are actually
    used. Some more had space reserved, but were related to non-implemented features.

    RISC-V has a 12-bit CSR space, of which:
    Some map to existing CRs;
    My whole CR space was stuck into an implementation-dependent range.

    My whole space is mapped by BAR registers as if they were on PCIe.

    Some read-only CSRs were mapped over to CPUID.

    I don't even have a CPUID--if you want this you go to config space
    and read the configuration lists and extended configuration lists.

    Of which, all of the CPUID indices were also mapped into CSR space.

    CPUID is soooooo pre-PCIe.


    Seemingly lacks defined user CSRs for timer or HW-RNG, which do exist in
    my case. It is very useful to be able to access a HW timer in userland,
    as otherwise it would waste a lot of clock-cycles using system calls for "clock()" and similar.

    That is why they are ALL available in MMI/O Space. If this user needs
    access to that timer, then there is a PTE that translated the LD/ST
    into an access to that device.


    Though, have noted that seemingly some number of actual RISC-V cores
    also have this limitation.


    A more drastic option might be to try to rework the hardware interfaces
    and memory map hopefully enough to try to make it possible to run an OS
    like Linux, but there doesn't really seem to be a standardized set of
    hardware interfaces or memory map defined.

    Some amount of SOC's though seem to use a map like:
       00000000..0000FFFF: ROM goes here.
       00010000..0XXXXXXX: RAM goes here.
       ZXXXXXXX..FFFFFFFF: Hardware / MMIO

    My 66000::
     00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
     01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
     10 0000000000000000..FFFFFFFFFFFFFFFF: config
     11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

    Whatever you are trying to do, you won't run out of address space until
    64 bits becomes insufficient. Note: all HW interfaces are in config
    space
    and all CRs are in MMI/O space.


    There seems to be a lot here defined in terms of 32-bit physical spaces, including on 64-bit targets.

    Though, thus far, my existing core also has pretty all of its physical
    map in 32-bit space.

    My 66000 does not even have a 32-bit space to map into.
    You can synthesize such a space by not using any of the
    top 32-address bits in PTEs--but why ??


    The physical ranges from 0001_00000000 .. 7FFF_FFFFFFFF currently
    contain a whole lot of nothing.


    I once speculated on the possibility of special hardware to memory-map
    the whole SDcard into physical space, but nothing has been done yet (and
    such a hardware interface would be a lot more complicated than my
    existing interface).


    An intermediate option being to expand the SPI interface to support 256
    bit bursts.

    My interconnect bus is 1 cache line (512-bits) per cycle plus
    address and command.

    Say:
    P_SPI_QDATA0..P_SPI_QDATA3

    It appears this has already been partly defined (though not fully
    implemented in the 256-bit case).

    Where, the supported XMIT sizes are:
    8 bit: Single Byte
    64 bit: 8 bytes
    256 bit: 32 bytes

    With larger bursts mostly to reduce the amount of round-trip delay over
    the bus.

    My 66000 interconnect bus can transmit a whole page in a single
    burst--that appears ATOMIC to interested 3rd parties.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Mar 11 17:44:45 2025
    On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

    On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
    On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
    ------------------------

    My 66000::
     00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
     01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
     10 0000000000000000..FFFFFFFFFFFFFFFF: config
     11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

    How does one reference DRAM vs MMI/O at the same address using a LD / ST instruction?

    The MMU translates the virtual address to a universal address.
    The PTE supplies the extra bits.

    Q+ CPU just uses a 64-bit address range. The config space is specified
    in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
    access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with addresses. There is a region table in the system that describes up to
    eight distinct regions.

    Every major block in my architecture has ports in config space that
    smell just like that of a device on PCIe having said control block.
    My thought was that adding all these to the config name space might
    cramp any fixed (or programmable) partition. So, the easiest thing
    is to give it its own big space.

    Then every device header gets 1 or more pages of address space for
    its own control registers. PCIe is now a 42-bit address space::
    segment, bus; device; function, xreg, reg and likely to grow as
    ACHI can consume a whole PCIe segment by itself.

    Whatever you are trying to do, you won't run out of address space until
    64 bits becomes insufficient. Note: all HW interfaces are in config
    space
    and all CRs are in MMI/O space.

    Are there any CRs accessible with any instructions besides LD / ST?

    CRs accessible via HR instruction theoretically == 40
    CRs accessible via HR instruction at a privilege >= 16

    Basically, HR provides access to this threads critical CRs
    {IP, Root, ASID, CSP, exception ctrl, inst ctrl, interrupts ...}
    and has access to the CPU SW stack according to privilege.

    ------------
    They seem to also be asking for a UEFI based boot process, but this
    would require having a bigger BootROM (can't likely fit a UEFI
    implementation into 32K). Seems that the idea is to have the UEFI BIOS
    boot the kernel directly as an ELF image (traditionally UEFI was always
    PE/COFF based?...).

    Boot ROM should be big enough that no BOOT ROM will ever exceed its
    size.
    --------------
    There is a probable need to move away from the "BJX2" name, which as
    noted, has some unfortunate connotations (turns out it was also used for >>> a lewd act) and seems to be triggering to Google's automatic content
    filtering (probably for a similar reason).

    Coming up with names is surprisingly difficult. I got into a discussion
    with a colleague a while ago about this. They were having difficulty
    coding something an it turned out to be simply what names to choose for routines.

    Hilarious--and reason enough to change names.

    When you do change names, can you spell LD and ST instead of MOV ??

    Yes, please LD / ST it is so much clearer what is going on. Less trouble getting confused by the placement of operands.

    I always put the memory operand second, which breaks the pattern of
    having the destination operand first. Otherwise the destination is
    first.

    I go cross-eyed reading code that is a whole lot of moves.

    I agree.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 18:00:45 2025
    On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

    A low of people swear by:
    movl %eax, 16(%rdi)
    ....

    More swear at it than for it.

    Most likely: those who swear by it have brain damage by x86-ism.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Mar 11 11:15:06 2025
    On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

    On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
    On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
    ------------------------

    My 66000::
      00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
      01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
      10 0000000000000000..FFFFFFFFFFFFFFFF: config
      11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

    How does one reference DRAM vs MMI/O at the same address using a LD / ST
    instruction?

    The MMU translates the virtual address to a universal address.
    The PTE supplies the extra bits.

    Q+ CPU just uses a 64-bit address range. The config space is specified
    in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
    access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
    addresses. There is a region table in the system that describes up to
    eight distinct regions.

    Every major block in my architecture has ports in config space that
    smell just like that of a device on PCIe having said control block.
    My thought was that adding all these to the config name space might
    cramp any fixed (or programmable) partition. So, the easiest thing
    is to give it its own big space.

    Then every device header gets 1 or more pages of address space for
    its own control registers. PCIe is now a 42-bit address space::
    segment, bus; device; function, xreg, reg and likely to grow as
    ACHI can consume a whole PCIe segment by itself.

    Whatever you are trying to do, you won't run out of address space until
    64 bits becomes insufficient. Note: all HW interfaces are in config
    space
    and all CRs are in MMI/O space.

    Are there any CRs accessible with any instructions besides LD / ST?

    CRs accessible via HR instruction theoretically  == 40
    CRs accessible via HR instruction at a privilege >= 16

    Basically, HR provides access to this threads critical CRs
    {IP, Root, ASID, CSP, exception ctrl, inst ctrl, interrupts ...}
    and has access to the CPU SW stack according to privilege.

    ------------
    They seem to also be asking for a UEFI based boot process, but this
    would require having a bigger BootROM (can't likely fit a UEFI
    implementation into 32K). Seems that the idea is to have the UEFI BIOS >>>> boot the kernel directly as an ELF image (traditionally UEFI was always >>>> PE/COFF based?...).

    Boot ROM should be big enough that no BOOT ROM will ever exceed its
    size.
    --------------
    There is a probable need to move away from the "BJX2" name, which as
    noted, has some unfortunate connotations (turns out it was also used
    for
    a lewd act) and seems to be triggering to Google's automatic content
    filtering (probably for a similar reason).

    Coming up with names is surprisingly difficult. I got into a discussion
    with a colleague a while ago about this. They were having difficulty
    coding something an it turned out to be simply what names to choose for
    routines.

    Hilarious--and reason enough to change names.

    When you do change names, can you spell LD and ST instead of MOV ??

    Yes, please LD / ST it is so much clearer what is going on. Less trouble
    getting confused by the placement of operands.

    I always put the memory operand second, which breaks the pattern of
    having the destination operand first. Otherwise the destination is
    first.

    I go cross-eyed reading code that is a whole lot of moves.

    I agree.

    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background? The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc. However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move. I don't know if this is right, but I think it is interesting.

    A somewhat related question, if one wants to copy the contents of R3
    into R2, is that a load or a store? :-)



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 11 18:56:52 2025
    On Tue, 11 Mar 2025 18:15:06 +0000, Stephen Fuld wrote:

    On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
    When you do change names, can you spell LD and ST instead of MOV ??

    Yes, please LD / ST it is so much clearer what is going on. Less trouble >>> getting confused by the placement of operands.

    I always put the memory operand second, which breaks the pattern of
    having the destination operand first. Otherwise the destination is
    first.

    I go cross-eyed reading code that is a whole lot of moves.

    I agree.

    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?

    Even when both LD and ST are written MOV there is a different OpCode
    for the inbound MOV versus the outbound MOV, so, in effect, they are
    really different instructions requiring different pipeline semantics.

    Only (O N L Y) when one has a memory to memory move instruction can
    the LDs and STs be MOVs. VAX had this, BJX* does not.

    One should argue that different pipeline semantics requires a different OpCode--and you already have said OpCode having different bit patterns
    and different signedness semantics different translation access rights
    , ... At the HW level about the only thing LD has in common with ST is
    the way the address is generated--although MIPS did something different.

    The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc. However
    software guys think of a language construct, e.g. X = Y, which is

    MARY and MARY2 used X = Y to mean the value in X is deposited into Y.
    Both were left to right only languages. This should surprise most !!
    {{Although to be fair, Mary used the =: operator to perform assign.}}

    It took me 40 years of writing specifications to get to the point where
    I can write a specification such that neither the uninformed benevolent
    reader nor the malicious engineer can misread that specification. MOV
    is one of those things that makes getting the specification perfect
    harder--and down the road, you too will figure out why I carry this
    torch ...

    logically a move. I don't know if this is right, but I think it is interesting.

    A somewhat related question, if one wants to copy the contents of R3
    into R2, is that a load or a store? :-)

    In several RISC ISAs it is an ADD #0 or OR #0, however, in My 66000
    we get a MOV instruction (reg-reg only) as a degenerate version of
    the CMOV instruction {Hey, fell out for free}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to Stephen Fuld on Tue Mar 11 19:07:08 2025
    On 11/03/2025 18:15, Stephen Fuld wrote:


    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?  The idea is that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move.  I don't know if this is right, but I think it is interesting.

    No, it is logically a copy.

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Tue Mar 11 13:46:24 2025
    On 3/11/2025 11:56 AM, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 18:15:06 +0000, Stephen Fuld wrote:

    On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
    When you do change names, can you spell LD and ST instead of MOV ??

    Yes, please LD / ST it is so much clearer what is going on. Less
    trouble
    getting confused by the placement of operands.

    I always put the memory operand second, which breaks the pattern of
    having the destination operand first. Otherwise the destination is
    first.

    I go cross-eyed reading code that is a whole lot of moves.

    I agree.

    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?

    Even when both LD and ST are written MOV there is a different OpCode
    for the inbound MOV versus the outbound MOV, so, in effect, they are
    really different instructions requiring different pipeline semantics.

    Only (O N L Y) when one has a memory to memory move instruction can
    the LDs and STs be MOVs. VAX had this, BJX* does not.

    One should argue that different pipeline semantics requires a different OpCode--and you already have said OpCode having different bit patterns
    and different signedness semantics different translation access rights
    , ... At the HW level about the only thing LD has in common with ST is
    the way the address is generated--although MIPS did something different.

    You are making my point. No software guy talks about "pipeline
    semantics" :-) Note that I am not saying you are wrong, just noting the difference.




                                                            The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is

    MARY and MARY2 used X = Y to mean the value in X is deposited into Y.
    Both were left to right only languages. This should surprise most !! {{Although to be fair, Mary used the =: operator to perform assign.}}

    And see my point about COBOL in the post above.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to moi on Tue Mar 11 13:39:55 2025
    On 3/11/2025 12:07 PM, moi wrote:
    On 11/03/2025 18:15, Stephen Fuld wrote:


    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?  The idea
    is that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move.  I don't know if this is right, but I think it is
    interesting.

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op
    code. :-) I had thought about mentioning in the software part of the
    argument that COBOL actually has a "move" verb to accomplish that, i.e.
    "Move A to B." even though you are technically right that it is a copy.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Tue Mar 11 21:26:15 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

    I always put the memory operand second, which breaks the pattern of
    having the destination operand first. Otherwise the destination is
    first.

    I go cross-eyed reading code that is a whole lot of moves.

    I agree.

    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background? The idea is

    I think it may depend on first experiences with assembler language; in
    mine first experiences (on VAX, Burroughs mainframes, AT&T Unix) the first
    was always the source operand and the second operand was the destination operand. The VAX and PDP-11 also followed this paradigm, as does
    the AT&T style of x86 assembler (which I find vastly preferable than
    the over-annotated microsoft/intel form).

    Vax:
    ;++
    ; Initialize FAO buffer.
    ;--
    movl #80,obuf
    movab obuffer,obuf+4
    ;++

    Burroughs:

    MVN Source, Dest
    ADD src1, src2, dest

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 11 21:29:12 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

    A low of people swear by:
    movl %eax, 16(%rdi)
    ....

    More swear at it than for it.

    Most likely: those who swear by it have brain damage by x86-ism.

    It's the oververbose, bas-ackwards intel syntax that one does swear at.

    The AT&T syntax that BGB noted above is far superior.

    YMMVO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 11 21:18:34 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

    On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
    On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
    ------------------------

    My 66000::
     00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
     01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
     10 0000000000000000..FFFFFFFFFFFFFFFF: config
     11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

    How does one reference DRAM vs MMI/O at the same address using a LD / ST
    instruction?

    The MMU translates the virtual address to a universal address.
    The PTE supplies the extra bits.

    Q+ CPU just uses a 64-bit address range. The config space is specified
    in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
    access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
    addresses. There is a region table in the system that describes up to
    eight distinct regions.

    Every major block in my architecture has ports in config space that
    smell just like that of a device on PCIe having said control block.
    My thought was that adding all these to the config name space might
    cramp any fixed (or programmable) partition. So, the easiest thing
    is to give it its own big space.

    Or make it relocatable anywhere in the address space. Assuming you
    format the config data in a standard form (e.g. PCI configuration
    space registers) to support off-the-shelf software, you can arrange
    most of your functions on a handful of buses on a single pci segment
    using ARI.


    Then every device header gets 1 or more pages of address space for
    its own control registers. PCIe is now a 42-bit address space::
    segment, bus; device; function, xreg, reg and likely to grow as
    ACHI can consume a whole PCIe segment by itself.

    AHCI consumes a single function on a bus, you can have up to
    seven functions per bus (without ARI) or 256 functions
    (with ARI). One or more of those functions may be a
    PCI-PCI bridge, which forwards to additional downstream buses.

    SRIOV capable devices, on the other hand, can consume
    an entire segment when supporting up to 64k virtual functions
    and the physical function for the device usually occupies
    function zero of the bus zero in a given segment. The
    endpoint device doesn't have any knowledge of the actual
    bus upon which it exists; if the physical function is on bus 8, function 0,
    for example, the virtual functions can occupy the rest of
    bus 8 through bus 255 on that segment for a maximum VF count
    of ((256 - 8) * 256 - PF#).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Terje Mathisen on Tue Mar 11 16:57:40 2025
    On 3/11/25 4:41 PM, Terje Mathisen wrote:

    x86 asm, as used in MASM and DEBUG, was the first assembler language O
    used, I found it very familiar that

      mov ax,bx
    or
      mov ax,[bx]
    or
      mov ax,[bx+1234]

    all correspond nicely to

     a = b
     a = *b
     a = b[1234]

    Looks more like move a to b.

    --
    http://davesrocketworks.com
    David Schultz
    "The cheeper the crook, the gaudier the patter." - Sam Spade

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Mar 11 22:41:48 2025
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

    A low of people swear by:
    movl %eax, 16(%rdi)
    ....

    More swear at it than for it.

    Most likely: those who swear by it have brain damage by x86-ism.

    It's the oververbose, bas-ackwards intel syntax that one does swear at.

    The AT&T syntax that BGB noted above is far superior.

    YMMVO.

    x86 asm, as used in MASM and DEBUG, was the first assembler language O
    used, I found it very familiar that

    mov ax,bx
    or
    mov ax,[bx]
    or
    mov ax,[bx+1234]

    all correspond nicely to

    a = b
    a = *b
    a = b[1234]

    I.e having the target on the left is the only one that makes sense to
    me. I can read the wrong one of course, but I have internally translate everything before I can grok it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Lawrence D'Oliveiro on Tue Mar 11 22:15:30 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

    VAX intstructions are very complex and much of that complexity is hard
    to use in compilers.

    A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

    In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

    Trouble is that such "common" operations have rather low
    frequency compared to simple stuff. They are really library
    functions. String copies, if done well in microcode could
    give some measurable speed gain. Other probably not.

    If they managed to make some simpler instruction faster,
    there would be substantial gain. RISC folks understood
    this, but it is not clear if VAX folks were aware of this.
    Of course, it is possible that VAX designers understood
    performace implications of their decisons (or rather
    meager speed gain from complex instructions), but bet
    that "nice" instruction set will tie programs to their
    platform.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:43:08 2025
    On Tue, 11 Mar 2025 13:46:24 -0700, Stephen Fuld wrote:

    No software guy talks about "pipeline semantics" :-)

    Remember what pipes are in Unix, and how they can be used to construct
    command pipelines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Mar 11 22:40:19 2025
    On Tue, 11 Mar 2025 22:41:48 +0100, Terje Mathisen wrote:

    I.e having the target on the left is the only one that makes sense to
    me.

    One of the early programming languages I came across was POP-2. This was
    fully dynamic and heap-based, like Lisp, but also had an operand stack. So
    a simple assignment statement looked like

    a -> b;

    but this could actually be written as two separate statements:

    a;
    -> b;

    The first one pushed the value of a on the stack, the second one popped it
    off and stored it in b.

    This made it easy to do things like swap variable values:

    a, b -> a -> b;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:42:02 2025
    On Tue, 11 Mar 2025 11:15:06 -0700, Stephen Fuld wrote:

    A somewhat related question, if one wants to copy the contents of R3
    into R2, is that a load or a store? :-)

    The true hardware engineer knows that it is neither, it is merely a
    register rename. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:44:31 2025
    On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

    On 3/11/2025 12:07 PM, moi wrote:

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op
    code. :-) I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
    "Move A to B." even though you are technically right that it is a copy.

    There is a language (C++) which has introduced reference operators that distinguish between “move semantics” versus “copy semantics”.

    No, I haven’t got my head around it either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Mar 11 22:48:17 2025
    On Tue, 11 Mar 2025 22:15:30 -0000 (UTC), Waldek Hebisch wrote:

    Trouble is that such "common" operations have rather low frequency
    compared to simple stuff. They are really library functions.

    Inline library functions. And they did contribute to keeping the code
    compact, as Bell said.

    One thing, though, I don’t think the POLYx instruction was all that
    useful. It is typical, when computing functions approximated by
    polynomials, for the polynomial to actually be infinite. And so you have a
    loop that computes each term in turn, accumulates it to the result, works
    out an estimate of the remaining error, and stops only when this falls
    below some threshold.

    This cannot be expressed by some fixed-length table of coefficients, as
    the POLYx instruction expects.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to It appears that Scott Lurndal on Wed Mar 12 00:26:50 2025
    It appears that Scott Lurndal <slp53@pacbell.net> said:
    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background? The idea is

    I think it may depend on first experiences with assembler language;

    Absolutely. My first assembler was PDP-8, where nothing has more than
    one operand. Next was S/360 where the result goes first, e.g. LR R1,R2
    copies R2 into R1. Next was PDP=11 where MOV R1,R2 copies R1 into R2.

    So I think they're all about equally bad. People should get over it.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to Stephen Fuld on Wed Mar 12 00:27:26 2025
    On 11/03/2025 20:39, Stephen Fuld wrote:
    On 3/11/2025 12:07 PM, moi wrote:
    On 11/03/2025 18:15, Stephen Fuld wrote:


    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?  The idea
    is that when hardware guys see the instruction, they think in terms
    of register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move.  I don't know if this is right, but I think it is
    interesting.

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op
    code.  :-)  I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
    "Move A to B." even though you are technically right that it is a copy.

    Being technically right is the best kind of right. 8-)

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Mar 12 00:34:08 2025
    On Tue, 11 Mar 2025 18:52:50 -0500, BGB wrote:

    Putting the destination on the right is also fairly common in general in
    Unix style command notation:
    dosomething args infile outfile

    prog1 infile | prog2 > outfile

    True enough. Except that the diff command shows changes as additions to
    the file on the right, and deletions from the file on the left. So if I
    want to change the diff command to cp, to actually replace the old file
    with the new one, I have to remember to swap the file names around.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Mar 12 00:33:06 2025
    On Tue, 11 Mar 2025 21:20:22 +0000, BGB wrote:

    On 3/11/2025 12:57 PM, MitchAlsup1 wrote:
    --------------
    My whole space is mapped by BAR registers as if they were on PCIe.


    Not a thing yet.

    But, PCIe may need to exist for Linux or similar.

    But, may still be an issue as Linux could only use known hardware IDs,
    and it is a question what IDs it would know about (and if any happen to
    map closely enough to my existing interfaces).

    Otherwise, would be necessary to write custom HW drivers, which would
    add a lot more pain to all of this.

    There is already a driver in BOOT that reads config headers for
    manufacture
    and model, and use those to look up an actual driver for that device.
    I simply plan on having My 66000 BOOT up code indexed by Mfg:Dev.

    --------------

       Some read-only CSRs were mapped over to CPUID.

    I don't even have a CPUID--if you want this you go to config space
    and read the configuration lists and extended configuration lists.


    Errm, so vendor/Hardware ID's for each feature flag...

    No, a manufacture:device for every CPU-type on the die. Then all of
    the core identification is found on the [extended] configuration
    lists.
    core kind
    fetch width
    decode width
    execute width
    retire width
    cache sizes
    TLB sizes
    predictor stuff
    ..

    In practice, I expect some later phase in BOOT will read all this out
    and package it for user consumption (and likely another copy for
    supervisor consumption.) Then it is accessed as fast as any cached
    chunk of memory.

    30 and 31 give the microsecond timer and HW-RNG, which are more relevant
    to user-land.

    The timer running in virtual time or the one running in physical time ??

    32..63: Currently unused.


    There is also a cycle counter (along vaguely similar lines to x86
    RDTSC), but for many uses a microsecond counter is more useful (where
    the timer-tick count updates at 1.0 MHz, and all cores would have the
    same epoch).

    On x86, trying to use RDTSC as a timer is rather annoying as it may jump around and goes at a different rate depending on current clock speed.

    By placing the timers in MMI/O memory address space*, accesses from
    different cores necessarily get different value--so the RTC can be
    used to distinguish "who got there first".

    MMI/O space is sequentially consistent across all cores in the system.

    ------------
    This scheme will not roll over for around 400k years (for a 64-bit microsecond timer), so "good enough".

    So at 1GHz the roll over time is 400 years. Looks good enough to me.

    Conceptually, this time would be in UTC, likely with time-zones handled
    by adding another bias value.

    What is UTC time when standing on the north or south poles ??

    This can in turn be used to derive the output from "clock()" and
    similar.


    Also, there are relatively few software timing tasks where we have much reason to care about nanoseconds. For many tasks, milliseconds are sufficient, but there are some things where microseconds matters.

    We used to run a benchmark 1,000,000 times in order to get accurate
    information using a time with 1 second resolution. We do not want
    to continue on that level.


         Of which, all of the CPUID indices were also mapped into CSR space.

    CPUID is soooooo pre-PCIe.


    Dunno.

    Mine is different from x86, in that it mostly functions like read-only registers.

    x86 uses a narrow bus that runs around the whole chip so it can access
    all sorts of stuff (only some it is available to users) and some of
    these
    accesses take 1,000's of cycles.

    RISC-V land seemingly exposes a microsecond timer via MMIO instead, but
    this is much less useful as this means needing to use a syscall to fetch
    the current time, which is slow.

    Or a generous MMU handler that lets some trusted low privilege level
    processes direct access.

    Doom manages to fetch the current time frequently enough that doing so
    via a syscall has a visible effect on performance.

    I had an old Timex a long time ago that I had to adjust the time
    about 3 times a day to have any chance of accuracy. Solution--
    quit wearing a watch.


    My 66000 does not even have a 32-bit space to map into.
    You can synthesize such a space by not using any of the
    top 32-address bits in PTEs--but why ??


    32-bit space is just the first 4GB of physical space.
    But, as-is, there is pretty much nothing outside of the first 4GB.


    The actually in use MMIO space is also still 28 bits.

    You are not trying to access 1,000 ACHI disks on a single rack, either;
    each disk supporting several hundred GuestOSs.

    The VRAM maps 128K in MMIO space, but in retrospect probably should have
    been more. When I designed it, I didn't figure there would have been
    more than 128K. The RAM backed framebuffer can be bigger though, but not
    too much bigger, as then screen refresh starts getting too glitchy (as
    it competes with the CPU for the L2 cache, but is more timing
    sensitive).

    One would think the very minimum to be do 32-bit color (8,8,8,8)
    on an 8K display/monitor.

    -----------

    My interconnect bus is 1 cache line (512-bits) per cycle plus
    address and command.


    My bus is 128 bits, but MMIO operations are 64-bits.

    Where, for MMIO, every access involves a whole round-trip over the bus (unlike for RAM-like access, where things can be held in the L1 cache).

    In theory, MMIO operations could be widened to allow 128-bit access, but haven't done so. This would require widening the data path for MMIO
    devices.

    Can note that when the request goes onto the MMIO bus, data narrows to
    64-bit and address narrows to 28 bits. Non-MMIO range requests (from the ringbus) are not allowed onto the MMIO bus, and the MMIO bus will not
    accept any new requests until the prior request has either finished or
    timed out.

    I see a big source of timing problems here.

    We are approaching 128-cores on a die and more than 256-devices down
    the PCIe tree. Does allowing only 1 access from one core to one device
    at a time make any sense ?? No, you specify virtual channels, accesses
    down a PCIe segment remain ordered while on the Tree and serialize at
    the device (function) itself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Mar 12 00:37:13 2025
    On Tue, 11 Mar 2025 18:58:11 -0500, BGB wrote:

    I still haven't seen any good reason to move to C++.

    No disagreement here. ;)

    Some people (at the C people): use C++, it has features...

    It appears the GNU C compiler itself is written in C++ now.

    Others (at the C++ people): Use Rust, it is less of a trash fire.

    Google started a project called “Carbon” a little while back, kind of a
    C++ done right, with all the accumulated legacy crap removed.

    Wonder what happened to it ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed Mar 12 00:38:59 2025
    On Wed, 12 Mar 2025 00:26:50 -0000 (UTC), John Levine wrote:

    Next was PDP=11 where MOV R1,R2 copies R1 into R2.

    What about CMP (compare) versus SUB (subtract)? CMP does the subtract
    without updating the destination operand, only setting the condition
    codes. But are the operands the same way around as SUB (i.e. backwards for comparison purposes) or are they flipped?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to moi on Wed Mar 12 00:37:53 2025
    On Tue, 11 Mar 2025 19:07:08 +0000, moi wrote:

    On 11/03/2025 18:15, Stephen Fuld wrote:


    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?  The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move.  I don't know if this is right, but I think it is
    interesting.

    No, it is logically a copy.

    But does it copy X into Y or copy Y into X ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:36:00 2025
    On Tue, 11 Mar 2025 22:44:31 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

    On 3/11/2025 12:07 PM, moi wrote:

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op
    code. :-) I had thought about mentioning in the software part of the
    argument that COBOL actually has a "move" verb to accomplish that, i.e.
    "Move A to B." even though you are technically right that it is a copy.

    There is a language (C++) which has introduced reference operators that distinguish between “move semantics” versus “copy semantics”.

    Is the distinction between overlapping and non-overlapping memory
    to memory moves ?? Ala:: memcopy versus memmove !

    No, I haven’t got my head around it either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Wed Mar 12 00:43:24 2025
    On 12/03/2025 00:37, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 19:07:08 +0000, moi wrote:

    On 11/03/2025 18:15, Stephen Fuld wrote:


    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background?  The idea is >>> that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc.  However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move.  I don't know if this is right, but I think it is
    interesting.

    No, it is logically a copy.

    But does it copy X into Y or copy Y into X ??

    Quite so.
    My preference is for:

    LD dest source - reads as load dest from source
    ST source dest - reads as store source in dest

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:46:29 2025
    On Tue, 11 Mar 2025 22:48:17 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 11 Mar 2025 22:15:30 -0000 (UTC), Waldek Hebisch wrote:

    Trouble is that such "common" operations have rather low frequency
    compared to simple stuff. They are really library functions.

    Inline library functions. And they did contribute to keeping the code compact, as Bell said.

    One thing, though, I don’t think the POLYx instruction was all that
    useful. It is typical, when computing functions approximated by
    polynomials, for the polynomial to actually be infinite. And so you have
    a
    loop that computes each term in turn, accumulates it to the result,
    works
    out an estimate of the remaining error, and stops only when this falls
    below some threshold.

    This cannot be expressed by some fixed-length table of coefficients, as
    the POLYx instruction expects.

    People quit doing it that way because Chebyshev (and later Remez)
    coefficients can be sufficiently accurate whereas Taylor (or Maclaurin)
    cannot. By the time of the Cody and Waite Book the continue the poly
    until convergence was already old in the tooth.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:42:19 2025
    On Tue, 11 Mar 2025 22:42:02 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 11 Mar 2025 11:15:06 -0700, Stephen Fuld wrote:

    A somewhat related question, if one wants to copy the contents of R3
    into R2, is that a load or a store? :-)

    The true hardware engineer knows that it is neither, it is merely a
    register rename. ;)

    And a move of the pre-renamed register back to the free pool.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Wed Mar 12 00:51:57 2025
    On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

    On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
    On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

    On 3/11/2025 12:07 PM, moi wrote:

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op
    code. :-) I had thought about mentioning in the software part of the
    argument that COBOL actually has a "move" verb to accomplish that, i.e.
    "Move A to B." even though you are technically right that it is a copy.

    There is a language (C++) which has introduced reference operators that
    distinguish between “move semantics” versus “copy semantics”.

    No, I haven’t got my head around it either.

    I still haven't seen any good reason to move to C++.

    C++ is for those situations where you want to write a small amount of
    code and have it compile into a vast string of instructions.

    C is for those situations where you want to write a small amount of
    code and have it compile into a small string of instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Tue Mar 11 19:28:15 2025
    On 3/11/2025 5:37 PM, Lawrence D'Oliveiro wrote:
    On Tue, 11 Mar 2025 18:58:11 -0500, BGB wrote:

    I still haven't seen any good reason to move to C++.

    No disagreement here. ;)

    Some people (at the C people): use C++, it has features...

    It appears the GNU C compiler itself is written in C++ now.

    Others (at the C++ people): Use Rust, it is less of a trash fire.

    Google started a project called “Carbon” a little while back, kind of a C++ done right, with all the accumulated legacy crap removed.

    Wonder what happened to it ...

    Still being developed

    https://en.wikipedia.org/wiki/Carbon_(programming_language)




    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Wed Mar 12 08:42:28 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Of course, it is possible that VAX designers understood
    performace implications of their decisons (or rather
    meager speed gain from complex instructions), but bet
    that "nice" instruction set will tie programs to their
    platform.

    I don't think that they fully understood the performance implications,
    but I believe that creating an appealing environment for software
    developers was a major consideration of the architects: For the assembly-language programmers, provide orthogonality; that also makes
    it easy to write compilers (optimility in some form is a different
    story). The much-critized VAX CALL instruction is designed for a
    software ecosystem where various languages can call each other, there
    exists a common debugger for all of them, etc. I am sure that they
    were aware that this call instruction was expensive, but they expected
    that it was worth the cost, and also expected that implementors would
    reduce the cost to below what a sequence of simpler instructions would
    cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
    we see such expectations disappointed; I have not measured recent
    generations, though).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Wed Mar 12 08:02:07 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background? The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc. However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move. I don't know if this is right, but I think it is >interesting.

    I am a software person. When talking about register-memory copies, I
    prefer to talk about load and store operations, whether I talk about
    assembly language (even one where the mnemonic for these operations is
    MOV) or C; in Forth the spoken names for these operations are "fetch"
    (written: @) and "store" (written: !).

    My first programming was on the TI-58C programmable calculator, which
    has RCL (recall) and STO (store).

    Then some BASIC, where you write A = Expr.

    Then some 6502 assembly language, which has, e.g., LDA (load into
    accumulator) and STA (store from accumulator).

    In his HOPL paper Ritchie describes a number of competitors of C
    (e.g., BLISS), most of which make the use of addresses explicit,
    unlike C. So copying from one variable to another in one of these
    languages was along the lines of

    a = .b

    where a and b are addresses of variables, and "." fetches the value
    from the address (i.e., C's *). C won out over these languages, and
    it seems to me that Ritchie thought that the different approach in C
    with lexprs and (value) exprs was a major contributing factor.

    Back to assembly languages:

    In PDP-11-style architectures (e.g., 8086, IA-32, AMD64, 68000), where
    you have instructions of the form

    op r ,m/r
    op m/r,r

    it's pretty obvious that you implement load, store, and
    register-register copy as special cases of this scheme. And the
    common mnemonic for these three operations in all these architectures
    is MOV. This does not come out of higher-level software
    considerations, this comes out of the architecture, and how to
    implement the assembler and disassembler for it in the simplest way.

    By contrast, RISC architectures do not follow this scheme, so they
    have mnemonics starting with L (for load) and S (for store).

    Looking at <https://www.righto.com/2023/08/datapoint-to-8086.html>,
    the Datapoint 2200 and the 8008 called both LAM (load A from M(emory))
    and LMA (store A to M(emory)) a load. This continues in the 8008.
    The 8080 assembly language replaces LAM with MOV A,M and LMA with
    MOV A,M. On the 8086 they are replaced with MOV AL,[BX] and
    MOV [BX],AL.

    A somewhat related question, if one wants to copy the contents of R3
    into R2, is that a load or a store? :-)

    It's a register-register copy.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 09:07:09 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    There is a language (C++) which has introduced reference operators that >distinguish between “move semantics” versus “copy semantics”.

    This is the first time I read about copy semantics; the thing that is
    not reference semantics is usually called value semantics. But a web
    search indeed turns up uses of "copy semantics". Interestingly, it
    also turns up a link to a web site from the C++ standards committee:

    https://isocpp.org/wiki/faq/value-vs-ref-semantics

    The title says: "value vs ref semantics", and while there are 11
    occurences of "copy" on the page, "copy semantics" does not occur
    once. However, it does say

    |Value (or "copy") semantics

    so apparently "copy semantics" is a synonym for "value semantics".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 08:57:19 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    One of the early programming languages I came across was POP-2. This was >fully dynamic and heap-based, like Lisp, but also had an operand stack. So
    a simple assignment statement looked like

    a -> b;

    but this could actually be written as two separate statements:

    a;
    -> b;

    The first one pushed the value of a on the stack, the second one popped it >off and stored it in b.

    This made it easy to do things like swap variable values:

    a, b -> a -> b;

    In Forth you can define VALUEs that work like these POP-11 variables.
    In standard Forth you can write

    3 value a
    5 value b
    a b to a to b \ swaps the contents of a and b
    a . b . \ print a and b

    In some Forth systems you can write it in a syntax even closer to that
    of POP:

    On VFX Forth you can write:

    a b -> a -> b

    In Gforth (development version) you can write

    a b ->a ->b

    That's all not very idiomatic, though. VARIABLEs (which push their
    address rather than their value) are more popular than VALUEs, but
    usage such as the following is also not idiomatic; you usually use
    variables (and values) sparingly.

    variable a 3 a !
    variable b 5 b !
    a @ b @ a ! b !
    a @ . b @ .

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Mar 12 11:48:28 2025
    On Wed, 12 Mar 2025 08:42:28 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    antispam@fricas.org (Waldek Hebisch) writes:
    Of course, it is possible that VAX designers understood
    performace implications of their decisons (or rather
    meager speed gain from complex instructions), but bet
    that "nice" instruction set will tie programs to their
    platform.

    I don't think that they fully understood the performance implications,
    but I believe that creating an appealing environment for software
    developers was a major consideration of the architects: For the assembly-language programmers, provide orthogonality; that also makes
    it easy to write compilers (optimility in some form is a different
    story). The much-critized VAX CALL instruction is designed for a
    software ecosystem where various languages can call each other, there
    exists a common debugger for all of them, etc. I am sure that they
    were aware that this call instruction was expensive, but they expected
    that it was worth the cost, and also expected that implementors would
    reduce the cost to below what a sequence of simpler instructions would
    cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
    we see such expectations disappointed; I have not measured recent generations, though).

    - anton

    It depends on what your call "a sequence of simpler instructions".
    For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
    non-unrolled loop of single-byte loads and stores on pretty much any
    Intel or AMD CPU since a down of time. If we are talking about this
    century, then, at least for Intel, I think that we can claim that the
    same is true even relatively to simple loop of 32-bit loads and stores.
    If we replace a dozen with hundred or three then it will become true
    for loop of 64-bit loads/stores as well.

    Or, may be, in your book 5KB of elaborate code that contains unrolled
    and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory accesses
    still considered 'a sequence of simpler instructions' ?
    If it is a case then I am not going to argue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 09:14:00 2025
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Wed, 12 Mar 2025 00:26:50 -0000 (UTC), John Levine wrote:

    Next was PDP=11 where MOV R1,R2 copies R1 into R2.

    What about CMP (compare) versus SUB (subtract)? CMP does the subtract
    without updating the destination operand, only setting the condition
    codes. But are the operands the same way around as SUB (i.e. backwards for >comparison purposes) or are they flipped?

    In AT&T syntax for IA-32 and AMD64, the operands are the same way
    round as for SUB (i.e., flipped compared to Intel syntax). This leads
    to counterintuitive combinations of compare and branches. E.g., "cmp %rax,%rbx; jlt ..." jumps if %rax is greater than %rbx.

    For the PDP-11, where the mnemonics etc. were created with that order
    in mind, the mnemonics of the flags-using instructions could be named appropriately. I.e., after "cmp x,y" or "sub x,y", "blt" could branch
    if x<y. I have not checked if that's the case for the PDP-11, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Wed Mar 12 11:28:36 2025
    Michael S <already5chosen@yahoo.com> writes:
    I am sure that they
    were aware that this call instruction was expensive, but they expected
    that it was worth the cost, and also expected that implementors would
    reduce the cost to below what a sequence of simpler instructions would
    cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
    we see such expectations disappointed; I have not measured recent
    generations, though).
    ...
    It depends on what your call "a sequence of simpler instructions".
    For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
    non-unrolled loop of single-byte loads and stores on pretty much any
    Intel or AMD CPU since a down of time. If we are talking about this
    century, then, at least for Intel, I think that we can claim that the
    same is true even relatively to simple loop of 32-bit loads and stores.
    If we replace a dozen with hundred or three then it will become true
    for loop of 64-bit loads/stores as well.

    Or, may be, in your book 5KB of elaborate code that contains unrolled
    and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory accesses
    still considered 'a sequence of simpler instructions' ?
    If it is a case then I am not going to argue.

    My experiments were with the code in
    <https://github.com/AntonErtl/move/>. I posted performance results in <2017Sep19.082137@mips.complang.tuwien.ac.at> <2017Sep20.184358@mips.complang.tuwien.ac.at> <2017Sep23.174313@mips.complang.tuwien.ac.at>

    My routines were generally faster than rep movsb, except for pretty
    large blocks (16KB).

    The longest of the routines is ssememmove at 275 bytes.

    I expect that an avx512memmove would be quite a bit smaller, thanks to predication, but I have not yet written that nor measured how that
    performs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Mar 12 14:09:15 2025
    On Wed, 12 Mar 2025 11:28:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    I am sure that they
    were aware that this call instruction was expensive, but they
    expected that it was worth the cost, and also expected that
    implementors would reduce the cost to below what a sequence of
    simpler instructions would cost (looking at REP MOVSB in many
    generations of Intel and AMD CPUs, we see such expectations
    disappointed; I have not measured recent generations, though).
    ...
    It depends on what your call "a sequence of simpler instructions".
    For R/E/CX above, say, dozen 'rep movsb' is faster than a simple >non-unrolled loop of single-byte loads and stores on pretty much any
    Intel or AMD CPU since a down of time. If we are talking about this >century, then, at least for Intel, I think that we can claim that the
    same is true even relatively to simple loop of 32-bit loads and
    stores. If we replace a dozen with hundred or three then it will
    become true for loop of 64-bit loads/stores as well.

    Or, may be, in your book 5KB of elaborate code that contains unrolled
    and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory
    accesses still considered 'a sequence of simpler instructions' ?
    If it is a case then I am not going to argue.

    My experiments were with the code in
    <https://github.com/AntonErtl/move/>.

    Non of those are simple loops that I mentioned above.

    I posted performance results in <2017Sep19.082137@mips.complang.tuwien.ac.at> <2017Sep20.184358@mips.complang.tuwien.ac.at> <2017Sep23.174313@mips.complang.tuwien.ac.at>

    My routines were generally faster than rep movsb, except for pretty
    large blocks (16KB).


    Idiots from corporate IT blocked http://al.howardknight.net/
    Trying to argue is not totally futile, but almost so.
    So, link to google groups or, if posts are relatively recent, to https://www.novabbs.com/devel/thread.php?group=comp.arch
    would be helpful.

    The longest of the routines is ssememmove at 275 bytes.


    I don't know why gnu memcpy is huge. I don't even know if it is
    really *that* huge. But several KB is number that I had seen
    stated by other people.

    I expect that an avx512memmove would be quite a bit smaller, thanks to predication, but I have not yet written that nor measured how that
    performs.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB on Wed Mar 12 14:14:22 2025
    BGB <cr88192@gmail.com> writes:
    On 3/11/2025 7:51 PM, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

    On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
    On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

    On 3/11/2025 12:07 PM, moi wrote:

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op >>>>> code.  :-)  I had thought about mentioning in the software part of the >>>>> argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>> "Move A to B." even though you are technically right that it is a copy. >>>>
    There is a language (C++) which has introduced reference operators that >>>> distinguish between “move semantics” versus “copy semantics”.

    No, I haven’t got my head around it either.

    I still haven't seen any good reason to move to C++.

    C++ is for those situations where you want to write a small amount of
    code and have it compile into a vast string of instructions.


    Yeah, one can use iostream and have a trivial "hello world" type program
    have build times and binary size like it was something quite substantial...

    You don't have to use iostream. vsnprintf/snprintf/printf all work
    fine in C++ code and are far more efficient (and far less verbose).

    Use a subset of C++ (C with classes) and the resulting code is
    quite compact, but you still get data encapsulation and
    inheritance (with a minor perf hit for virtual functions).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Wed Mar 12 17:08:51 2025
    Scott Lurndal wrote:
    BGB <cr88192@gmail.com> writes:
    On 3/11/2025 7:51 PM, MitchAlsup1 wrote:
    On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

    On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
    On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

    On 3/11/2025 12:07 PM, moi wrote:

    No, it is logically a copy.

    While that is true, I don't think anyone is talking about a "copy" op >>>>>> code.  :-)  I had thought about mentioning in the software part of the
    argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>>> "Move A to B." even though you are technically right that it is a copy. >>>>>
    There is a language (C++) which has introduced reference operators that >>>>> distinguish between “move semantics” versus “copy semantics”.

    No, I haven’t got my head around it either.

    I still haven't seen any good reason to move to C++.

    C++ is for those situations where you want to write a small amount of
    code and have it compile into a vast string of instructions.


    Yeah, one can use iostream and have a trivial "hello world" type program
    have build times and binary size like it was something quite substantial...

    You don't have to use iostream. vsnprintf/snprintf/printf all work
    fine in C++ code and are far more efficient (and far less verbose).

    Use a subset of C++ (C with classes) and the resulting code is
    quite compact, but you still get data encapsulation and
    inheritance (with a minor perf hit for virtual functions).


    This the "C+" language that I started to use several decades ago! Just
    getting access to local declarations and inline comments was sufficient
    back then. Now regular C has of course mostly caught up here, but it
    doesn't matter since I'm using Rust for anything time-critical.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Wed Mar 12 16:46:36 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 12 Mar 2025 11:28:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My experiments were with the code in
    <https://github.com/AntonErtl/move/>.

    Non of those are simple loops that I mentioned above.

    They are not. If you want short code, rep movsb is unbeatable (for
    memmove(), you have to do a little more, however).

    I posted performance results in
    <2017Sep19.082137@mips.complang.tuwien.ac.at>
    <2017Sep20.184358@mips.complang.tuwien.ac.at>
    <2017Sep23.174313@mips.complang.tuwien.ac.at>

    My routines were generally faster than rep movsb, except for pretty
    large blocks (16KB).


    Idiots from corporate IT blocked http://al.howardknight.net/

    I feel with you. In my workplace, Usenet is blocked (probably unintentionally). I have to post from home.

    So, link to google groups

    Sorry, I cannot provide that service. Trying to access
    groups.google.com tells me:

    |Couldn’t sign you in
    |
    |The browser you’re using doesn’t support JavaScript, or has JavaScript |turned off.
    |
    |To keep your Google Account secure, try signing in on a browser that
    |has JavaScript turned on.

    I certainly won't turn on JavaScript for Google, and apparently Google
    wants me to log in to a Google account to access groups.google.com. I
    don't have a Google account and I don't want one.

    But all I would do is try whether google groups finds the message-ids.
    You can do that yourself.

    or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
    would be helpful.

    The posts are from 2017; these message-ids are not random-generated.

    I don't know why gnu memcpy is huge. I don't even know if it is
    really *that* huge. But several KB is number that I had seen
    stated by other people.

    I stated in one of these messages that I have seen an 11KB memmove in
    glibc. Let's see:

    objdump -t /debian8/usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep 'memmove' 00000000000001a0 g i .text 0000000000000047 __libc_memmove 0000000000000000 g F .text 000000000000019f __memmove_sse2 00000000000001a0 g i .text 0000000000000047 memmove
    0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3 0000000000000010 g F .text.ssse3 0000000000002b67 __memmove_ssse3 0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3_back
    0000000000000010 g F .text.ssse3 0000000000002b06 __memmove_ssse3_back ...

    Yes, 11111 bytes for __memmove_ssse3. Debian 8 is one of the systems
    I used at the time.

    Let's see how it looks in Debian 12:

    objdump -t /usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep 'memmove'|grep -v wmemmove
    0000000000000000 l F .text 00000000000000f6 __libc_memmove_ifunc 0000000000000000 g i .text 00000000000000f6 __libc_memmove 0000000000000000 g i .text 00000000000000f6 memmove
    0000000000000010 g F .text.avx 000000000000002f __memmove_avx_unaligned
    0000000000000080 g F .text.avx 00000000000006de __memmove_avx_unaligned_erms
    0000000000000010 g F .text.avx.rtm 000000000000002d __memmove_avx_unaligned_rtm
    0000000000000080 g F .text.avx.rtm 00000000000006df __memmove_avx_unaligned_erms_rtm
    0000000000000020 g F .text.avx512 0000000000000009 __memmove_chk_avx512_no_vzeroupper
    0000000000000030 g F .text.avx512 000000000000073b __memmove_avx512_no_vzeroupper
    0000000000000010 g F .text.evex512 0000000000000037 __memmove_avx512_unaligned
    0000000000000080 g F .text.evex512 00000000000007a0 __memmove_avx512_unaligned_erms
    0000000000000020 g F .text 0000000000000009 __memmove_chk_erms 0000000000000030 g F .text 000000000000002d __memmove_erms 0000000000000010 g F .text.evex 0000000000000034 __memmove_evex_unaligned
    0000000000000080 g F .text.evex 00000000000007bb __memmove_evex_unaligned_erms
    0000000000000010 g F .text 0000000000000028 __memmove_sse2_unaligned 0000000000000080 g F .text 0000000000000552 __memmove_sse2_unaligned_erms 0000000000000040 g F .text.ssse3 0000000000000f3d __memmove_ssse3 0000000000000000 g F .text 000000000000000e __memmove_chk

    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 21:35:47 2025
    On Wed, 12 Mar 2025 08:57:19 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    One of the early programming languages I came across was POP-2. This
    was fully dynamic and heap-based, like Lisp, but also had an operand
    stack.

    In Forth you can define VALUEs that work like these POP-11 variables.

    Never saw much point in Forth. I suppose in the early days when memories
    were smaller and CPUs were slower, it gave you something a little bit
    better than assembler, but not by much.

    Besides garbage collection, POP-2 also had macros and custom operators.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 23:58:08 2025
    On Wed, 12 Mar 2025 08:02:07 GMT, Anton Ertl wrote:

    My first programming was on the TI-58C programmable calculator, which
    has RCL (recall) and STO (store).

    It also had RCL IND and STO IND. Pointers, no less.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 23:55:14 2025
    On Wed, 12 Mar 2025 09:07:09 GMT, Anton Ertl wrote:

    https://isocpp.org/wiki/faq/value-vs-ref-semantics

    What I was talking about was this <https://en.cppreference.com/w/cpp/language/reference>
    (described there as “lvalue” versus “rvalue” references).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Thu Mar 13 00:03:15 2025
    On Wed, 12 Mar 2025 14:09:15 +0200, Michael S wrote:

    I don't know why gnu memcpy is huge.

    Full of special cases, for speed.

    Back in the old MacOS days, there was a system routine called BlockMove
    (plus later BlockMoveData), that similarly got larger over time as it got faster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Mar 13 00:01:08 2025
    On Wed, 12 Mar 2025 08:42:28 GMT, Anton Ertl wrote:

    The much-critized VAX CALL instruction is designed for a software
    ecosystem where various languages can call each other, there exists a
    common debugger for all of them, etc.

    And don’t forget it also reserved a longword in the call frame for use for exception handling.

    I am sure that they were aware that this call instruction was
    expensive, but they expected that it was worth the cost ...

    For high-level languages, yes. For lower-level stuff you had BSBB/BSBW/
    JSB, which did nothing more than push a return address on the stack and
    jump.

    And remember, all kernel calls were done via CHMK/CHME instructions
    wrapped in procedures meant to be invoked via CALL.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Thu Mar 13 12:08:37 2025
    On 3/12/2025 1:02 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    I wonder if the different preferences is at least partially due to
    whether the person has a hardware or a software background? The idea is
    that when hardware guys see the instruction, they think in terms of
    register ports (read versus write), what is required of the memory
    system (somewhat different for loads versus stores), etc. However
    software guys think of a language construct, e.g. X = Y, which is
    logically a move. I don't know if this is right, but I think it is
    interesting.

    I am a software person. When talking about register-memory copies, I
    prefer to talk about load and store operations, whether I talk about
    assembly language (even one where the mnemonic for these operations is
    MOV) or C; in Forth the spoken names for these operations are "fetch" (written: @) and "store" (written: !).


    snipped lots of interesting history. A good counterpoint to my
    assertion. Thanks Anton.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Mar 13 23:06:02 2025
    On Wed, 12 Mar 2025 16:46:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 12 Mar 2025 11:28:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My experiments were with the code in
    <https://github.com/AntonErtl/move/>.

    Non of those are simple loops that I mentioned above.

    They are not. If you want short code, rep movsb is unbeatable (for memmove(), you have to do a little more, however).

    I posted performance results in
    <2017Sep19.082137@mips.complang.tuwien.ac.at>
    <2017Sep20.184358@mips.complang.tuwien.ac.at>
    <2017Sep23.174313@mips.complang.tuwien.ac.at>

    My routines were generally faster than rep movsb, except for pretty
    large blocks (16KB).


    Idiots from corporate IT blocked http://al.howardknight.net/

    I feel with you. In my workplace, Usenet is blocked (probably unintentionally). I have to post from home.

    So, link to google groups

    Sorry, I cannot provide that service. Trying to access
    groups.google.com tells me:

    |Couldn’t sign you in
    |
    |The browser you’re using doesn’t support JavaScript, or has JavaScript |turned off.
    |
    |To keep your Google Account secure, try signing in on a browser that
    |has JavaScript turned on.

    I certainly won't turn on JavaScript for Google, and apparently Google
    wants me to log in to a Google account to access groups.google.com. I
    don't have a Google account and I don't want one.


    For me it works fine without login. But not without JS.
    For those who are willing to use JS, the link: https://groups.google.com/g/comp.arch/c/ULvFgEM_ZSY/m/ysPySToGAwAJ

    But all I would do is try whether google groups finds the message-ids.
    You can do that yourself.


    GG only searches by contexts. It appears to have no idea about message
    ids.

    or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
    would be helpful.

    The posts are from 2017; these message-ids are not random-generated.


    Then GG is the only place to find it that I am aware of. http://al.howardknight.net helped me to see that start of the message,
    but not the full message.
    And eternal-september is still struggling with restoration of its
    archives after the crash of 9 months ago. More and more it looks like
    they will never be restored.

    I don't know why gnu memcpy is huge. I don't even know if it is
    really *that* huge. But several KB is number that I had seen
    stated by other people.

    I stated in one of these messages that I have seen an 11KB memmove in
    glibc. Let's see:

    objdump -t /debian8/usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
    'memmove' 00000000000001a0 g i .text 0000000000000047
    __libc_memmove 0000000000000000 g F .text 000000000000019f __memmove_sse2 00000000000001a0 g i .text 0000000000000047
    memmove 0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3 0000000000000010 g F .text.ssse3
    0000000000002b67 __memmove_ssse3 0000000000000000 g F .text.ssse3
    0000000000000009 __memmove_chk_ssse3_back 0000000000000010 g F .text.ssse3 0000000000002b06 __memmove_ssse3_back ...

    Yes, 11111 bytes for __memmove_ssse3. Debian 8 is one of the systems
    I used at the time.

    Let's see how it looks in Debian 12:

    objdump -t /usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
    'memmove'|grep -v wmemmove 0000000000000000 l F .text
    00000000000000f6 __libc_memmove_ifunc 0000000000000000 g i .text 00000000000000f6 __libc_memmove 0000000000000000 g i .text 00000000000000f6 memmove 0000000000000010 g F .text.avx
    000000000000002f __memmove_avx_unaligned 0000000000000080 g F
    .text.avx 00000000000006de __memmove_avx_unaligned_erms
    0000000000000010 g F .text.avx.rtm 000000000000002d __memmove_avx_unaligned_rtm 0000000000000080 g F .text.avx.rtm 00000000000006df __memmove_avx_unaligned_erms_rtm 0000000000000020 g
    F .text.avx512 0000000000000009
    __memmove_chk_avx512_no_vzeroupper 0000000000000030 g F
    .text.avx512 000000000000073b __memmove_avx512_no_vzeroupper 0000000000000010 g F .text.evex512 0000000000000037 __memmove_avx512_unaligned 0000000000000080 g F .text.evex512 00000000000007a0 __memmove_avx512_unaligned_erms 0000000000000020 g
    F .text 0000000000000009 __memmove_chk_erms 0000000000000030 g
    F .text 000000000000002d __memmove_erms 0000000000000010 g F
    .text.evex 0000000000000034 __memmove_evex_unaligned
    0000000000000080 g F .text.evex 00000000000007bb __memmove_evex_unaligned_erms 0000000000000010 g F .text
    0000000000000028 __memmove_sse2_unaligned 0000000000000080 g F
    .text 0000000000000552 __memmove_sse2_unaligned_erms
    0000000000000040 g F .text.ssse3 0000000000000f3d
    __memmove_ssse3 0000000000000000 g F .text 000000000000000e __memmove_chk

    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 12:43:27 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 12 Mar 2025 16:46:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 12 Mar 2025 11:28:36 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: =20
    My experiments were with the code in
    <https://github.com/AntonErtl/move/>. =20
    ...
    I posted performance results in
    <2017Sep19.082137@mips.complang.tuwien.ac.at>
    <2017Sep20.184358@mips.complang.tuwien.ac.at>
    <2017Sep23.174313@mips.complang.tuwien.ac.at>
    ...
    http://al.howardknight.net helped me to see that start of the message,
    but not the full message.=20

    That's deplorable. The postings with the second and third Message-Id
    are delivered from http://al.howardknight.net in full. For <2017Sep19.082137@mips.complang.tuwien.ac.at>, the remaining parts
    (including a few lines still shown by http://al.howardknight.net) are:

    |K8 (Athlon 64 X2 4400+), glibc 2.3.6
    | 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    | 21 28 54 90 162 307 595 1171 2325 4632 9244 18467 repmovsb
    | 17 40 69 80 104 161 253 433 794 1514 2955 5836 memmove
    | 24 31 57 82 98 129 199 323 570 1064 2053 4032 memcpy
    | 21 28 53 87 155 292 566 1113 2206 4394 8768 17516 repmovsb aligned | 17 40 33 37 46 68 118 234 451 834 1635 3237 memmove aligned
    | 24 31 56 45 54 72 120 193 338 627 1207 2367 memcpy aligned
    | 17 27 53 89 161 306 594 1171 2325 4629 9248 18461 repmovsb blksz-1 | 17 37 61 81 105 152 251 433 792 1513 2952 5825 memmove blksz-1
    | 20 30 56 83 100 130 202 325 572 1067 2054 4030 memcpy blksz-1
    |
    |K10 (Phenom II X2 560), glibc 2.19
    | 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    | 15 22 48 84 157 309 566 1080 2107 4161 8270 16487 repmovsb
    | 16 35 56 69 104 152 262 456 839 1604 3135 6201 memmove
    | 16 19 13 19 31 68 114 226 408 774 1505 2968 memcpy
    | 14 21 48 85 158 122 154 219 348 606 1122 2155 repmovsb aligned
    | 16 39 35 38 46 63 95 190 364 664 1268 2583 memmove aligned
    | 19 21 13 20 25 56 89 177 306 566 1084 2121 memcpy aligned
    | 14 21 47 83 155 300 565 1079 2106 4160 8269 16487 repmovsb blksz-1 | 17 32 55 68 91 156 261 454 837 1602 3131 6190 memmove blksz-1
    | 17 23 13 18 30 69 114 228 411 774 1508 2966 memcpy blksz-1
    |
    |Zen (Ryzen 5 1600X), glibc 2.24
    | 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    | 25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
    | 13 14 13 14 30 42 65 107 175 325 600 1222 memmove
    | 10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
    | 25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
    | 12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
    | 11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
    | 23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
    | 13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
    | 10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1
    |
    |Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
    | 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
    | 25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
    | 10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
    | 26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
    | 25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
    | 10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
    | 26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
    | 22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
    | 8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
    | 24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1

    And eternal-september is still struggling with restoration of its
    archives after the crash of 9 months ago. More and more it looks like
    they will never be restored.

    My impression pretty soon after the event was that it would not
    happen. Given that they lost the mapping message-id <-> article
    number, the insertion of the old messages would have been disruptive
    to clients that work with the article number. It was bad enough as it
    was.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Sun Mar 16 01:22:09 2025
    On Wed, 12 Mar 2025 16:46:36 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    ... Trying to access groups.google.com tells me:

    |Couldn’t sign you in
    |
    |The browser you’re using doesn’t support JavaScript, or has JavaScript >|turned off.
    |
    |To keep your Google Account secure, try signing in on a browser that
    |has JavaScript turned on.

    I certainly won't turn on JavaScript for Google, and apparently Google
    wants me to log in to a Google account to access groups.google.com. I
    don't have a Google account and I don't want one.

    But all I would do is try whether google groups finds the message-ids.
    You can do that yourself.

    You don't have to log in: just enter "group:comp.arch" (or whatever
    group you want) into your search engine [I use duckduckgo] and follow
    the link that says "Google Groups".

    Once in the the group, you can move around and search within it. What
    you can't do (easily) is switch to another group or get back to the
    start point if you lose track of it ... for these things you have to
    return to your search engine and start over.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)