• Re: VAX

    From Thomas Koenig@21:1/5 to BGB on Wed Jul 30 16:24:40 2025
    BGB <cr88192@gmail.com> schrieb:

    I can't say much for or against VAX, as I don't currently have any
    compilers that target it.

    If you want to look at code, godbolt has a few gcc versions for it.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Thu Jul 31 04:26:27 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>>neither the VAX designers nor their competition realized this, and >>>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >>instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >>compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >>instructions like POLY that should have been subroutines. The 801 project and >>PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >>presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    Related to the microcode issue they also don't seem to have anticipated how >>important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >>easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    I looked into VAX architecure handbook from 1977. Handbook claims
    that VAX-780 used 96-bit microcode words. That is enough bits to
    control pipelined machine with 1 instruction per cycle, provided
    enough excution resources (register ports, buses and 1-cycle
    execution units). However, VAX hardware allowed only one memory
    access per cycle so instructions with multiple memory addreses
    or using indirection trough memory by necessity needed multiple
    cycles.

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte. That
    can be done using 2 negators and 4 input NAND gate. For normal
    instructions lowest bit of opcode seem to select between 2
    and 3 operand instructions. For 1 byte opcode with all
    register arguments operand specifiers are in predictable placese,
    so together modest number of gates could recognize register-only
    operand specifiers. Of course, to be sure that this is
    register instruction one needs to look at opcode. I am
    guessing that VAX fetches microcode word based on opcode,
    so this microcode word could conditionaly (based on result
    of circuit mentioned above) pass instruction to pipeline
    and initiate processing of next instruction or start
    argument processing. Such one cycle conditional branch
    in general may be problematic, but I would be surprising if
    it were problematic for VAX microcode. Namely it was
    ususal for microdode to specify address of next microcode
    word. So with a pipeline and small number of extra gates
    VAX should be able to do register-only instructions in
    1 cycle. Escalating a bit, with managable number of
    gates one should be able to recognize operand of
    "defered mode", "autodecement mode" and "autoincrement mode".
    For each such input operand microcode engine could
    insert a load into pipeline and proceed with rest of
    instruction. Similarly, for write operand microcode
    could pass instruction to the pipeline, but also pass
    special bit changing destination and insert store
    after instruction. Once given memory operand is
    handled decodin gates would indicate if this was last
    memory operand which would allow either going to next
    instruction or handling next memory operand. Together,
    for normal istructions each memory operand should add
    one cycle to execution time. Also short immediates
    could be handled in similar way. This leaves some nasty
    cases: longer immediates, displacement and modes with
    double indirection. Displacement could probly be handled
    at cost of extra cycle. Other modes probably would
    cost one or two cycle penalty.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Given actual speed of VAX possibilities seem to be:
    - extra factors slowing both VAX and RISC, like cache
    misses (VAX archtecture handbook says that due to
    misses cache had effective access time of 290 ns),
    - VAX designers could not afford pipeline
    - maybe VAX designers decided to avoid pipelne to reduce
    complexity

    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.
    Also, PDP-11 compatibility depended on microcode.
    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX especially given constraint
    of PDP-11 compatibility. OTOH VAX designers probably felt
    that CISC nature added significant value: they understood
    that cost of programming was significant and believed that
    ortogonal instruction set, in particular allowing complex
    addresing on all operands made programming simpler. They
    probably thought that providing resonably common procedures
    as microcoded instructions made work of programmers simpler
    even if routines were only marginally faster than ordinary
    code. Part of this thinking was probably like "future
    system" motivation at IBM: Digital did not want to produce
    "commodity" systems, they wanted something with unique
    features that custemer will want to use. Without
    isight into future it is hard to say that they were
    wrong.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Jul 31 16:05:14 2025
    According to Waldek Hebisch <antispam@fricas.org>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    That was the plan but the people building Vaxen didn't get the memo
    so even on the original 780, it got different answers with and without
    the optional floating point accelerator.

    If they wanted more accurate results, they should have

    https://simh.trailing-edge.com/docs/vax_poly.pdf

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Right, but detecting the abnormal cases wasn't trivial.

    R's,
    John
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Thu Jul 31 19:01:36 2025
    John Levine <johnl@taugh.com> writes:
    According to Waldek Hebisch <antispam@fricas.org>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    That was the plan but the people building Vaxen didn't get the memo
    so even on the original 780, it got different answers with and without
    the optional floating point accelerator.

    If they wanted more accurate results, they should have

    https://simh.trailing-edge.com/docs/vax_poly.pdf

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    There were some adjacent dependencies:

    ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
    ADDB3 #48,R1,(R9)+ ;AND NEXT


    and a handful of others. Probably only a single-digit
    percentage of instructions used autoincrement/decrement and only
    a couple used the updated register in the same
    instruction.

    in some of my code from the era, I used auto-decrement frequently,
    mainly to push 8 or 16bit data onto the stack.

    ;
    ; Deallocate Virtual Memory used to buffer records in copy.
    ;
    pushl copy_in_rab+rab$l_ubf ; Record address
    movzwl copy_in_rab+rab$w_usz,-(sp) ; Record size
    pushab 4(sp)
    pushab 4(sp)
    calls #2,g^lib$free_vm ; Get rid of vm
    ret

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Jul 31 19:57:43 2025
    According to Scott Lurndal <slp53@pacbell.net>:
    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    Wow, that's some funky code. The #240 is syntactic sugar for (PC)+
    followed by a byte with 240 (octal) in it. VAX had an immediate
    address mode that could represent 0 to 77 octal so the assembler used
    that for immediates that would fit, (PC)+ if not. The S^#OPN explictly
    tells it to use the short immediate mode. #^A/;/ is a literal
    semicolon which fits in an immediate.

    There were some adjacent dependencies:

    ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
    ADDB3 #48,R1,(R9)+ ;AND NEXT


    and a handful of others. Probably only a single-digit
    percentage of instructions used autoincrement/decrement and only
    a couple used the updated register in the same
    instruction.

    Right, but it always had to check for it. As I said a few messages ago,
    if they didn't allow register updates to affect other operands, or changed
    the spec so the registers were all updated at the end of the instruction, it wouldn't have affected much code but would have made decoding and pipelining easier.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Thu Jul 31 21:24:29 2025
    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    Wow, that's some funky code.

    .TITLE FOCAL MAIN SEGMENT
    ;FOCAL MAIN SEGMENT
    ;DAVE MONAHAN MARCH 1978

    ...

    HEADER: .ASCII /C VAX FOCAL V1.0 /
    DATE: .BLKB 24
    .ASCII / -NOT A DEC PRODUCT/

    I had it on a 9-track from 1980 that Al was nice enough to
    copy to a CD-ROM for me.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to John Levine on Fri Aug 1 02:18:17 2025
    John Levine <johnl@taugh.com> wrote:
    According to Waldek Hebisch <antispam@fricas.org>:

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    where TMP is special forwarding register in the CPU. AFAICS normal
    forwarding in the pipeline would handle this. In case of

    ADDL R2, (R2)+, R3

    one would need something which we could denote

    MOV (R2)+, TMP
    ADDL R2*, TMP, R3

    where R2* denotes previous value of R2, which introduces extra
    complication, but does not look hard to handle.

    Note that I do _not_ aim at executiong complex VAX instructions in
    one cycle. Rather each memory operand is handled separately
    and they are handled in order.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    I considered only popular integer instructions, everthing else
    would be handled by microcode at the same speed as real VAX.
    VAX had 32-bit bus, so 8-bytes operand needed 2 cycles anyway,
    so slower decoding for such operands would not be a problem.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Right, but detecting the abnormal cases wasn't trivial.

    Maybe I was unclear, but the whole point was that distinguishing
    between normal cases and abnomal ones could be done by
    moderately complex hardware. Also, I am comparing to execution
    time for equvalent functionality: VAX instruction with 1 memory
    operand would take 2 cycles (the same as 2 instructions needed
    by RISC). And I am comparing to early RISC, that is 32-bit
    integer operations. Similar speedup for floating point operations
    or for 64-bit operands would need bigger decoders, handling
    more than 1 memory operand per cycle or going superscalar
    probably would lead to too complex decoders.

    And a little corection: proposed decoder effectively add 1 more
    pipeline stage, so taken jump would be 1 cycle slower than
    classic RISC having the same pipeline (and 2 cycle slower than
    RISC delayed jumps). OTOH RISC-V compressed instructions
    seem to require similar decoding stage, so Anton VAX-RISC-V
    would have similar timing.

    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    But after your post I find it more likely that in DEC could
    not afford pipeline for VAX-780, even with simple instructions
    one has to decide between accessing register file and using
    forwarded value, one needs intelocks to wait for cache misses
    etc.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Fri Aug 1 17:02:33 2025
    BGB <cr88192@gmail.com> writes:
    On 7/30/2025 12:59 AM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >>> instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.
    ...
    But, if so, it would more speak for the weakness of VAX code density
    than the goodness of RISC-V.

    For the question at hand, what counts is that one can do a RISC that
    is more compact than the VAX.

    And neither among the Debian binaries nor among the NetBSD binaries I
    measured I have found anything consistently more compact than RISC-V
    with the C extension. There is one strong competitor, though: armhf
    (Thumb2) on Debian, which is a little smaller then RV64GC in 2 out of
    3 cases and a little larger in the third case.

    There is, however, a fairly notable size difference between RV32 and
    RV64 here, but I had usually been messing with RV64.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64

    If I were to put it on a ranking (for ISAs I have messed with), it would
    be, roughly (smallest first):
    i386 with VS2008 or GCC 3.x (*1)

    i386 has significantly larger binaries than RV64GC on both Debian and
    NetBSD, also bigger than AMD64 and ARM A64.

    For those who want to see all the numbers in one posting: <2025Jun17.161742@mips.complang.tuwien.ac.at>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Aug 1 17:25:22 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    The reproducability did not happen.

    It actually might have been better if the ISA contained instructions
    for the individual steps. According to <http://simh.trailing-edge.com/docs/vax_poly.pdf>

    |For example, POLY specified that in the central an*x+bn step:
    |- The multiply result was truncated to 31b/63b prior to normalization.
    |- The extended-precision multiply result was added to the next coefficient.
    |- The addition result was truncated to 31b/63b prior to normalization and
    | rounding.

    One could specify an FMA instruction for that step like many recent
    ISAs have done, but I think that the reproducibility would be better
    if the truncation was a separate instruction. And of course, all of
    this would require having at least a few registers with extra bits.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what
    RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a
    MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of
    conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    I looked into VAX architecure handbook from 1977. Handbook claims
    that VAX-780 used 96-bit microcode words. That is enough bits to
    control pipelined machine with 1 instruction per cycle, provided
    enough excution resources (register ports, buses and 1-cycle
    execution units). However, VAX hardware allowed only one memory
    access per cycle so instructions with multiple memory addreses
    or using indirection trough memory by necessity needed multiple
    cycles.

    I must admit that I do not understand why VAX needed so many
    cycles per instruction.

    It was not pipelined much. Assuming a totally unpipelined machine, an
    ADD3.L R1,R2,R3 instruction might be executed in the following steps:

    decode add3.l
    decode first operand (r1)
    read r1 from the register file | decode second operand (r2)
    read r2 from the register file
    add r1 and r2 | decode r3
    write the result to r3

    That's 6 cycles, and without any cycles for instruction fetching.

    For 1 byte opcode with all
    register arguments operand specifiers are in predictable placese,
    so together modest number of gates could recognize register-only
    operand specifiers.

    Yes, but they wanted to implement the VAX, where every operand can be
    anything. If they thought that focusing on register-only instructions
    was the way to go, they would not have designed the VAX, but the IBM
    801. The ISA was designed for a non-pipelined microcoded
    implementation, obviously without any thought given to future
    pipelined implementations, and that's how the VAX 11/780 was
    implemented.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    The VAX 11/780 was not pipelined. The VAX 8700/8800 (introduced 1986,
    but apparently the 8800 replaced the 8600 as high-end VAX only
    starting from 1987) was pipelined at the microcode level, like you
    suggest, but despite having a 4.4 times higher clock rate, the 8700
    achieved only 6 VUP, i.e., 6 times the VAX 11/780 performance (the
    8800 just had two CPUs, but each CPU with the same speed). So if the
    VAX 11/780 takes 10 cycles/instruction on average, the VAX 8700 still
    takes 7.4 cycles per instruction on average, whereas typical RISCs
    contemporary with the VAX 8700 required <2 CPI. They needed
    a more instruction, but the bottom line was still a big speed
    advantage for the RISCs.

    A few years later, there was the pipelined 91MHz NVAX+ with 35 VUP,
    and, implemented in the same process, the 200MHz 21064 with 106.9
    SPECint92 and 159.6 SPECfp92 (https://ntrs.nasa.gov/api/citations/19960008936/downloads/19960008936.pdf). Note that both VUP and SPEC92 scale relative to the VAX 11/780 (i.e.,
    the 11/780 has 1 VUP and SPEC92 int and fp results of 1). So we see
    that they did not manage to get the NVAX+ up to the same clock rate as
    the 21064 in the same process, and that the performance disadvantage
    of the VAX is even higher than the clock rate disadvantage.

    Given actual speed of VAX possibilities seem to be:
    - extra factors slowing both VAX and RISC, like cache
    misses (VAX archtecture handbook says that due to
    misses cache had effective access time of 290 ns),
    - VAX designers could not afford pipeline
    - maybe VAX designers decided to avoid pipelne to reduce
    complexity

    Yes to all. And even when they finally pipelined the VAX, it was far
    less effective than for RISCs.

    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    RISCs like the ARM, MIPS R2000, and SPARC implement a pipelined
    integer instruction set in one chip in 1985/86, with the R2000 running
    at up 12.5MHz. At around the same time the MicroVAX 78032 appeared
    with a similar number of transistors (R2000 110,000, 78032 125000).
    The 78032 runs at 5MHz and has a similar performance to the VAX
    11/780. So for these single-chip implementations, the RISC could be
    pipelined (and clocked higher), whereas the VAX could not*. I expect
    that with the resources needed for the VAX 11/780, a pipelined RISC
    could be implemented.

    * And did the 78032 implement the whole integer instruction set? I
    have certainly read about MicroVAXen that trapped rare instructions
    and implemented them in software.

    Also, PDP-11 compatibility depended on microcode.
    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    The PDP-11 instruction set is relatively simple. I expect that the
    effort for decoding it to the RISC-VAX (whether in hardware or with
    microcode) would not take that many resources.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX

    They were significantly faster in later technologies, and the IBM 801 demonstrates the superiority of RISC at around the time of the VAX, so
    it is very likely that a pipelined and faster RISC-VAX would have been
    doable with the resources of the VAX.

    Without
    isight into future it is hard to say that they were
    wrong.

    It's now the past. And now we have all the data to see that the
    result was certainly not very future-proof, and very likely not even
    the best-performing design possible at the time. But ok, they did not
    know better, that's why there's a time-machine involved in my
    scenario.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Sat Aug 2 09:02:37 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sat Aug 2 15:33:07 2025
    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret

    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.

    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret

    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a> 1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b> 1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c> 1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d> 1198: c3 ret

    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret

    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other
    sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the
    .sdata size to the .text size.

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in
    Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to
    auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang)
    versions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue Aug 5 22:17:00 2025
    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX. Some modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc', but
    when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
    the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8]. It does not depend on the previous value of rbx, except for control dependency that hopefully would be speculated around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Scott Lurndal on Tue Aug 5 20:34:27 2025
    Scott Lurndal <scott@slp53.sl.home> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against >>>>superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    The basic question is if VAX could afford the pipeline. VAX had
    rather complex memory and bus interface, cache added complexity
    too. Ditching microcode could allow more resources for execution
    path. Clearly VAX could afford and probably had 1-cycle 32-bit
    ALU. I doubt that they could afford 1-cycle multiply or
    even a barrel shifter. So they needed a seqencer for sane
    assembly programming. I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port. Multiported register file
    probably would need a lot of separate register chips and
    multiplexer. Alternatively, they could try some very fast
    RAM and run it at multiple of base clock frequency (66 ns
    cycle time caches were available at that time, so 3 ports
    via multiplexing seem possible). But any of this adds
    considerable complexity. Sane pipeline needs interlocks
    and forwarding.

    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Using terminology of late seventies VAX was mixture of SSI,
    MSI and LSI chips. I am not sure if VAX used it, but there
    were 4-bit TTL ALU chips, 8 such chips would give 32-bit ALU
    (for better speed one would add carry propagation chips,
    which would increase chip count).

    Probably only memory used LSI chips. That could add bias
    for microcode: microcode used densest MOS chips (memory) and
    replaced less dense random TTL logic. After switching to CMOS
    on-chip logic was more comparable to memory, so balance
    shifted.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Aug 6 00:21:25 2025
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...

    Terje



    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Michael S on Tue Aug 5 21:13:50 2025
    XPost: comp.lang.c

    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    (I don't know that to be true; an extension has to be documented other
    than by omission. But anyway, if the GCC documentation says somewhere
    something like, "no other identifier is reserved in this version of
    GCC", then it means that the remaining portions of the reserved
    namespaces are available to the program. Since it is undefined behavior
    to use those identifiers (or in certain ways in certain circumstances,
    as the case may be), being able to use them with the documentation's
    blessing constitutes use of a documented extension.)

    I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    __builtin also in a standard-defined reserved namespace; the double
    underscore namespace. It is no more or less conservative to name
    something __bitInt as _BitInt.


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Keith Thompson on Tue Aug 5 21:25:17 2025
    XPost: comp.lang.c

    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a
    library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    It would be unthinkable for GCC to introduce, say, an extension
    using the identifier __libc_malloc.

    In addition to libraries, if some other important project that serves as
    a base package in many distributions happens to claim identifiers in
    those spaces, it wouldn't be wise for GCC (or the C libraries) to start
    taking them away.

    You can't just rename the identifier out of the way in the offending
    package, because that only fixes the issue going forward. Older versions
    of the package can't be compiled with the new compiler without a patch. Compiling older things with newer GCC happens.

    There are always the questions:

    1. Is there an issue? Is anything broken?

    2. If so, is what is broken important such that it becomes a showstopper
    if the compiler change is rolled out (major distros are on fire?)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tkoenig@netcologne.de on Tue Aug 5 17:41:30 2025
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.

    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Keith Thompson on Wed Aug 6 04:31:59 2025
    XPost: comp.lang.c

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will break
    any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library using _CreamPuff as an identifier, or of a compiler which misbehaves when a
    program uses it, on grounds of it being undefined behavior. Someone
    using _CreamPuff in their code is taking a risk that is vanishingly
    small, the same way that introducing _BigInt is a risk that is
    vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced some identifier is vastly larger than the audience of implementations that a
    given program will face that has introduced some funny identifier.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Wed Aug 6 05:53:22 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Kaz Kylheku on Wed Aug 6 11:48:09 2025
    XPost: comp.lang.c

    On Wed, 6 Aug 2025 04:31:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
    wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring
    to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will
    break any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.


    Exactly.
    The World is a very big place. Even nowadays it is not completely
    transparent. Even those parts that are publicly visible in theory not necessarily had been had been observed recently by a single person even
    if the person in question is Keith.
    Besides, according to my understanding majority of gcc users didn't yet
    migrate to gcc14 or 15.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library
    using _CreamPuff as an identifier, or of a compiler which misbehaves
    when a program uses it, on grounds of it being undefined behavior.
    Someone using _CreamPuff in their code is taking a risk that is
    vanishingly small, the same way that introducing _BigInt is a risk
    that is vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced
    some identifier is vastly larger than the audience of implementations
    that a given program will face that has introduced some funny
    identifier.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to tkoenig@netcologne.de on Wed Aug 6 11:10:46 2025
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Wed Aug 6 16:19:11 2025
    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?

    No, you are not. I skipped pretty much all the setup code. :-)


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...


    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to George Neuner on Wed Aug 6 10:23:26 2025
    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode
    bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Kaz Kylheku on Wed Aug 6 11:54:57 2025
    XPost: comp.lang.c

    On 2025-08-05 17:13, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    ...
    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    No, I meant very specifically that if, as reported, _BitInt was
    supported even in earlier versions, then it was supported as an extension.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Kaz Kylheku on Wed Aug 6 11:56:04 2025
    XPost: comp.lang.c

    On 2025-08-05 17:25, Kaz Kylheku wrote:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    GCC cannot be implemented in such a way as to create a fully conforming implementation of C when used in connection with an arbitrary
    implementation of the C standard library. This is just one example of a
    more general potential problem: Both gcc and the library must use some
    reserved identifiers, and they might have made conflicting choices.
    That's just one example of the many things that might prevent them from
    being combined to form a conforming implementation of C. It doesn't mean
    that either one is defective. It does mean that the two groups of
    implementors should consider working together to resolve the conflicts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Wed Aug 6 20:06:00 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Aug 6 17:00:03 2025
    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.
    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Wed Aug 6 21:14:07 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,
    especially for one of the BUNCH, for whom backward compatability was
    paramount.

    [*] The machine (Unisys V530) sold for well over a megabuck in
    a single processor configuration.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Wed Aug 6 17:57:03 2025
    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent state machines, each with its own logic sequencer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to EricP on Wed Aug 6 23:43:12 2025
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA, which
    used PALs which were not available to the VAX 11/780 designers, so it
    could be clocked a bit higher, but at a multiple of the performance
    than the VAX.

    So, Anton visiting DEC or me visiting Data General could have brought
    them a technology which would significantly outperformed the VAX
    (especially if we brought along the algorithm for graph coloring. Some
    people at IBM would have been peeved at having somebody else "develop"
    this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
    matrix)
    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Wed Aug 6 20:41:44 2025
    EricP wrote:

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)
    ^^^^
    Oops... typo. Should be FPLA.
    PAL or Programmable Array Logic was a slightly different thing,
    also an AND-OR matrix from Monolithic Memories.

    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    And PAL's too. Whatever works and is cheapest.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Thomas Koenig on Thu Aug 7 11:16:20 2025
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    Russians in late sixties proposed graph coloring as a way of
    memory allocation (and proved that optimal allocation is
    equivalent to graph coloring). They also proposed heuristics
    for graph coloring and experimentaly showed that they
    are reasonably effective. This is not the same thing as
    register allocation, but connection is rather obvious.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Robert Swindells on Thu Aug 7 10:47:50 2025
    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lars Poulsen on Thu Aug 7 11:21:56 2025
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
    ...
    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have native types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to EricP on Thu Aug 7 11:29:46 2025
    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    IIUC description of IBM 360-85 it had a pipeline which was much
    more aggresivly clocked than VAX. 360-85 probaly used ECL, but
    at VAX clock speed should be easily doable in Schottky TTL
    (used in VAX).

    The question is could one build this at a commercially competitive price?

    Yes.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Thu Aug 7 11:59:35 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to backwards compatibility?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu Aug 7 11:38:54 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 7 13:34:26 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:

    ...
    My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have nativea types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    The more likely solution would be to push the protocol processing
    into an attached I/O processor, in those days.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 7 15:03:23 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >>paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?

    Because you need to sell it. Without disrupting your existing
    customer base.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Michael S on Tue Aug 5 21:08:53 2025
    XPost: comp.lang.c

    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Aug 8 10:08:43 2025
    Anton Ertl wrote:
    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.
    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    I don't know why they think these are problems with the 82S100.
    These complaints sound like from a hobbyist.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton

    Yes. This risc-VAX would have to decode 1 instruction per clock to
    keep keep a pipeline full so I envision running the fetch buffer
    through a bank of those PLA and generating a uOp out.

    I don't know whether the instructions can be byte aligned variable size
    or have to be fixed 32-bits in order to meet performance requirements.
    I would prefer the flexibility of variable size but
    the Fetch byte alignment shifter adds a lot of logic.

    If variable then the high frequency instructions like MOV rd,rs
    and ADD rsd,rs fit into two bytes. The longest instruction looks like
    12 bytes, 4 bytes operation specifier (opcode plus registers)
    plus 8 bytes immediate FP64.

    If a variable size instruction arranges that all the critical parse
    information is located in the first 8-16 bits then we can just run
    those bits through a PLAs in parallel and have that control the
    alignment shifter as well as generate the uOp.

    I envision this Fetch buffer alignment shifter built from tri-state
    buffers rather than muxes as TTL muxes are very slow and we would
    need a lot of them.

    The whole fetch-parse-decode should fit on a single board.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Fri Aug 8 19:48:59 2025
    On Fri, 08 Aug 2025 06:16:51 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    George Neuner <gneuner2@comcast.net> writes:

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache.

    No such cache in the P6 or any of its descendents until the Sandy
    Bridge (2011). The Pentium 4 has a microop cache, but eventually
    (with Core Duo, Core2 Duo) was replaced with P6 descendents that have
    no microop cache. Actually, the Core 2 Duo has a loop buffer which
    might be seen as a tiny microop cache. Microop caches and loop
    buffers still have to contain information about which microops belong
    to the same CISC instruction, because otherwise the reorder buffer
    could not commit/execute* CISC instructions.

    * OoO microarchitecture terminology calls what the reorder buffer does
    "retire" or "commit". But this is where the speculative execution
    becomes architecturally visible ("commit"), so from an architectural
    view it is execution.

    Followups set to comp.arch

    - anton

    Thanks for the correction. I did fair amount of SIMD coding for
    Pentium II, III and IV, so was more aware of their architecture. After
    the IV, I moved on to other things so haven't kept up.

    Question:
    It would seem that, lacking the microop cache the decoder would need
    to be involved, e.g., for every iteration of a loop, and there would
    be more pressure on I$1. Did these prove to be a bottleneck for the
    models lacking cache? [either? or something else?]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Fri Aug 8 21:43:11 2025
    On Wed, 06 Aug 2025 10:23:26 -0400, EricP
    <ThatWouldBeTelling@thevillage.com> wrote:

    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where
    destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    I was aware of this (thank you), but I was trying to figure out why
    the VAX - particularly early ones - would need it. And also it does
    not mesh with Waldek's comment [at top] about 3 copies.


    The VAX did have one (pathological?) address mode:

    displacement deferred indexed @dis(Rn)[Rx]

    in which Rn and Rx could be the same register. It is the only mode
    for which a single operand could reference a given register more than
    once. I never saw any code that actually did this, but the manual
    does say it was possible.

    But even with this situation, it seems that the register would only
    need to be read once (per operand, at least) and the value could be
    used twice.


    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode >bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    You mean exceptions? Exceptions were handled between instructions.
    VAX had no iterating string-copy/move instructions so every
    instruction logically could stand alone.

    VAX separately identified the case where the instruction completed
    with a problem (trap) from where the instruction could not complete
    because of the problem (fault), but in both cases it indicated the
    offending instruction.


    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 9 08:07:12 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to tkoenig@netcologne.de on Sat Aug 9 09:04:40 2025
    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not, but that's a LOT of
    speculation with hindsight-colored glasses. Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not. But as with all alternate history, this is
    completely unknowable.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Sat Aug 9 10:00:54 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    Sure. IBM was in less than no hurry to make a product out of
    the 801.


    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Sat Aug 9 10:03:29 2025
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,
    - with 8 optional XOR output invertors,
    - driving 8 tri-state or open collector buffers.

    So I count roughly 7 or 8 equivalent gate delays.
    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Sat Aug 9 20:54:07 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.


    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Al Kossow@21:1/5 to Thomas Koenig on Sat Aug 9 14:57:03 2025
    On 8/9/25 1:54 PM, Thomas Koenig wrote:

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings?

    using typicals was a rookie mistake
    also not comparing delay times across vendors

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to tkoenig@netcologne.de on Sun Aug 10 12:06:46 2025
    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    [snip]
    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    Ok. Sure.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Heh. :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    Absolutely.

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Similarly for other minicomputer companies.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.

    Plausiblity is orthogonal to whether a thing is knowable.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Dan Cross on Sun Aug 10 15:18:23 2025
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    <snip>

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer. Considerable
    internal resources were being applied to the Jupiter project
    at the end of the 1970s to support a wider range of applications.

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    Fundamentally, 36-bit words ended up being a dead-end.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Sun Aug 10 21:01:50 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started,

    That is true. Reading https://acg.cis.upenn.edu/milom/cis501-Fall11/papers/cocke-RISC.pdf
    (I liked the potential toung-in-cheek "Regular Instruction
    Set-Computer" name for their instruction set).

    and even
    fewer would have believed it absent a working prototype,

    The simulation approach that IBM took is interesting. They built
    a fast simulator, translating one 801 instruciton into one (or
    several) /370-instructions on the fly, with a fixed 32-bit size.


    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Scott Lurndal on Mon Aug 11 08:17:48 2025
    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the
    hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    * word addressing

    Sure, one could add 36-bit byte addresses to such an architecture
    (probably with 9-bit bytes to make it easy to deal with words), but it
    would force a completely different ABI and API, so the legacy code
    would still have no good upgrade path and would be limited to its
    256KW address space no matter how much actual RAM there is available.
    IBM decided to switch from this 36-bit legacy to the 32-bit
    byte-addressed S/360 in the early 1960s (with support for their legacy
    lines built into various S/360 implementations), DEC did so when they introduced the VAX.

    Concerning other manufacturers:

    <https://en.wikipedia.org/wiki/36-bit_computing> tells me that the
    GE-600 series was also 36-bit. It continued as Honeywell 6000 series <https://en.wikipedia.org/wiki/Honeywell_6000_series>. Honeywell
    introduced the DPS-88 in 1982; the architecture is described as
    supporting the usual 256KW, but apparently the DPS-88 could be bought
    with up to 128MB; programming that probably was no fun. Honeywell
    later sold the NEC S1000 as DPS-90, which does not sound like the
    Honeywell 6000 line was a growing business. And that's the last I
    read about the Honeywell 6000 line.

    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. <https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:

    |In addition to the IX (1100/2200) CPUs [...], the architecture had
    |Xeon [...] CPUs. Unisys' goal was to provide an orderly transition for
    |their 1100/2200 customers to a more modern architecture.

    So they continued to support it for a long time, but it's a legacy
    thing, not a future-oriented architecture.

    The Wikipedia article also mentions the Symbolics 3600 as 36-bit
    machine, but that was quite different from the 36-bit architectures of
    the 1950s and 1960s: The Symbolics 3600 has 28-bit addresses (the rest apparently taken by tags) and its successor Ivory has 32-bit addresses
    and a 40-bit word. Here the reason for its demise was the AI winter
    of the late 1980s and early 1990s.

    DEC did the right thing when they decided to support VAX as *the*
    future architecture, and the success of the VAX compared to the
    Honeywell 6000 and Univac 1100/2200 series demonstrates this.

    RISC-VAX would have been better than the PDP-10, for the same reasons:
    32-bit addresses and byte addressing. And in addition, the
    performance advantage of RISC-VAX would have made the position of
    RISC-VAX compared to PDP-10 even stronger.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Mon Aug 11 14:51:20 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    In a sense, they still live in the Unisys Clearpath systems.


    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    An issue for PDP-10, certainly. Not so much for the Univac
    systems.



    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:


    I spent 14 years at Burroughs/Unisys (on the Burroughs side, mainly).

    Yes, two of the six mainframe lines still exist (albeit in emulation);
    one 48-bit, the other 36-bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Aug 11 17:27:30 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting link, thanks!


    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    They were considering byte-addressability; interesting. It is also
    slightly funny that a 9-bit byte address would be made up of
    30 bits of virtual address and 2 bits of byte address, i.e.
    a 32-bit address in total.

    Fundamentally, 36-bit words ended up being a dead-end.

    Pretty much so. It was a pity for floating-point, where they had
    more precision than the 32-bit words (and especially the horrible
    IBM format).

    But byte addressability and power of two won.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Tue Aug 12 15:02:04 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780,
    11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Tue Aug 12 15:59:32 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    I doubt that they could afford 1-cycle multiply

    Yes, one might do a multiplier and divider with its own sequencer (and
    more sophisticated in later implementations), and with any user of the
    result waiting stalling the pipeline until that is complete, and any
    following user of the multiplier or divider stalling the pipeline
    until it is free again.

    The idea of providing multiply-step instructions and using a bunch of
    them was short-lived; already the MIPS R2000 included a multiply
    instruction (with its own sequencer), HPPA has multiply-step as well
    as an FPU-based multiply from the start. The idea of avoiding divide instructions had a longer life. MIPS has divide right from the start,
    but Alpha and even IA-64 avoided it. RISC-V includes divide in the M
    extension that also gives multiply.

    or
    even a barrel shifter.

    Five levels of 32-bit 2->1 muxes might be doable, but would that be cost-effecti

    It is accepted in this era that using more hardware could
    give substantial speedup. IIUC IBM used quadatic rule:
    performance was supposed to be proportional to square of
    CPU price. That was partly marketing, but partly due to
    compromises needed in smaller machines.

    That's more of a 1960s thing, probably because low-end S/360
    implementations used all (slow) tricks to minimize hardware. In the
    VAX 11/780 environment, I very much doubt that it is true. Looking at
    the early VAXen, you get the 11/730 with 0.3 VUPs up to the 11/784
    with 3.5 VUPs (from 4 11/780 CPUs). sqrt(3.5/0.3)=3.4. I very much
    doubt that you could get an 11/784 for 3.4 times the price of an
    11/730.

    Searching a little, I find

    |[11/730 is] to be a quarter the price and a quarter the performance of
    |a grown-up VAX (11/780) <https://retrocomputingforum.com/t/price-of-vax-730-with-vms-the-11-730-from-dec/3286>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to tkoenig@netcologne.de on Wed Aug 13 11:25:24 2025
    In article <107b1bu$252qo$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    Sure. I wasn't disputing that, just saying that I don't think
    it mattered that much.

    [snip]
    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Well, then we're definitely into the unknowable. :-)

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Aug 13 14:18:06 2025
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.
    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.
    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    I didn't use the typical values. Yes, it would be dangerous to use them.
    I never understood why they even quoted those typical numbers.
    I always considered them marketing fluff.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?

    I wasn't suggesting that. People used to modern CMOS speeds might not appreciate how slow TTL was. I was showing that its 50 ns speed number
    was not out of line with other MSI parts of that day, and just happened
    to have a PDF TTL manual opened on that part so used it as an example.
    A 74181 4-bit ALU is also of similar complexity and 62 ns max.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Aug 13 14:40:01 2025
    Anton Ertl wrote:

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton

    Yes I saw the Patt paper recently. He has written many microarchitecture papers. I was surprised that in 1990 he would say on page 2:

    "All VAXes are microcoded. The richness of the instruction set urges that
    the flexibility of microcoded control be employed, notwithstanding the conventional mythology that hardwired control is somehow faster than
    microcode. It is instructive to point out that (1) hardwired control
    produces higher performance execution only in situations where the
    critical path is in the microsequencing function, and (2) that this
    should not occur in VAX implementations if one designs with the
    well-understood (to microarchitects) technique that the next control
    store address must be obtained from information available at the start
    of the current microcycle. A variation of this basic old technique is
    the recently popularized delayed branch present in many ISA architectures introduced in the last few years."

    When he refers to the "mythology that hardwired control is somehow faster"
    he appears to still be using the monolithic "eyes" I referred to earlier
    in that everything must go through a single microsequencer.
    He compares a hardwired sequential controller to a microcoded sequential controller and notes that in that case hardwired is no faster.

    What he is not doing is comparing multiple parallel hardware stages
    to a sequential controller, hardwired or microcoded.

    Risc brings with it the concurrent hardware stages view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Wed Aug 13 20:23:53 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Aug 15 03:20:56 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    _For current machines_ there are reasons to use bigger pages, but
    in VAX time bigger pages almost surely would lead to higher memory
    use and consequently to higher price for end user. In effect
    machine would be much less competitive.

    BTW: Long ago I saw message about porting an application from
    VAX to Linux. On VAX application run OK in 1GB of memory.
    On 32 bit Inter architecture Linux with 1 GB there was excessive
    paging. The reason was much smaller number of bigger pages.
    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Fri Aug 15 05:07:01 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to tkoenig@netcologne.de on Fri Aug 15 12:57:35 2025
    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    The VAX was built to be a commercial product. As such, it was
    designed to be successful in the market. But in order to be
    successful in the market, it was important that the designers be
    informed by the business landscape at both the time they were
    designing it, and what they could project would be the lifetime
    of the product. Those are considerations that extend beyond
    the purely technical aspects of the design, and are both more
    speculative and more abstract.

    Consider how the business criteria might influence the technical
    design, and how these might play off of one another: obviously,
    DEC understood that the PDP-11 was growing ever more constrained
    by its 16-bit address space, and that any successor would have
    to have a larger address space. From a business perspective, it
    made no sense to create a VAX with a 16-bit address space.
    Similarly, they could have chosen (say) a 20, 24, or 28 bit
    address space, or used segmented memory, or any number of other
    such decisions, but the model that they did chose (basically a
    flat 32-bit virtual address space: at least as far as the
    hardware was concerned; I know VMS did things differently) was
    ultimately the one that "won".

    Of course, those are obvious examples. What I'm contending is
    that the business<->technical relationship is probably deeper
    and that business has more influence on technology than we
    realize, up to and including the ISA design. I'm not saying
    that the business folks are looking over the engineers'
    shoulders telling them how the opcode space should be arranged,
    but I am saying that they're probably going to engineering with
    broad-strokes requirements based on market analysis and customer
    demand. Indeed, we see examples of this now, with the addition
    of vector instructions to most major ISAs. That's driven by the
    market, not merely engineers saying to each other, "you know
    what would be cool? AVX-512!"

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT
    instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    Of course, they messed some of it up; EDITPC was like the
    punchline of a bad joke, and the ways that POLY was messed up
    are well-known.

    Anyway, I apologize for the length of the post, but that's the
    sort of thing I mean.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Dan Cross on Fri Aug 15 13:36:12 2025
    On Fri, 15 Aug 2025 12:57:35 -0000 (UTC), Dan Cross wrote:

    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm saying
    that business needs must have, at least in part, influenced the ISA
    design. That is, while mistaken, it was part of the business decision
    process regardless.

    It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [snip]

    There are also bits of the business requirements in each of the
    descriptions of DEC microprocessor projects on Bob Supnik's site
    that Al Kossow linked to earlier:

    <http://simh.trailing-edge.com/dsarchive.html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Waldek Hebisch on Fri Aug 15 15:10:58 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is:
    https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through. >>
    Section 2.7 also mentions an 8-byte instruction buffer, and that the
    instruction fetching is done happens concurrently with the microcoded
    execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Sun Aug 17 06:16:08 2025
    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.

    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Sun Aug 17 10:00:56 2025
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Aug 17 15:21:38 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Sun Aug 17 13:35:03 2025
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
    For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8> <7:0>
    | | | |
    4:16 4:16 4:16 4:16
    vvvv vvvv vvvv vvvv
    10k ---|---|---|---|------>INV->
    10k ---------------------->INV->
    10k ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Sun Aug 17 19:10:21 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jakob Bohm@21:1/5 to Kaz Kylheku on Sun Aug 17 20:18:36 2025
    XPost: comp.lang.c

    On 2025-08-05 23:08, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?


    What is actually going on is GCC offering its users a gradual way to
    transition from C17 to C23, by applying the C23 meaning of any C23
    construct that has no conflicting meaning in C17 . In particular, this
    allows installed library headers to use the new types as part of
    logically opaque (but compiler visible) implementation details, even
    when those libraries are used by pure C17 programs. For example, the
    ISO POSIX datatype struct stat could contain a _BitInt(128) type for
    st_dev or st_ino if the kernel needs that, as was the case with the 1996
    NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

    GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
    used "system" headers (including installed libc headers), as those are
    allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
    report the use of non-standard extensions to the library standards when
    the compiler is invoked with --pedantic and no contrary options .

    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C instead
    of GNUC reverts those to the standard definition .

    Enjoy

    Jakob

    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Dan Cross on Mon Aug 18 05:48:00 2025
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [...]

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    I had actually forgotten that the VAX also had decimal
    instructions. But the 11/780 also had one really important
    restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
    severely limited their throughput there (assuming they did
    things bytewise). So yes, decimal arithmetic was important
    in the day for COBOL and related commercial applications.

    So, what to do with decimal arithmetic, which was important
    at the time (and a business consideration)?

    Something like Power's addg6s instruction could have been
    introduced, it adds two numbers together, generating only the
    decimal carries, and puts a nibble "6" into the corresponding
    nibble if there is one, and "0" otherwise. With 32 bits, that
    would allow addition of eight-digit decimal numbers in four
    instructions (see one of the POWER ISA documents for details),
    but the cycle of "read ASCII digits, do arithmetic, write
    ASCII digits" would have needed some extra shifts and masks,
    so it might have been more beneficial to use four digits per
    register.

    The article above is also extremely interesting otherwise. It does
    not give cycle timings for each individual instruction and address
    mode, but it gives statistics on how they were used, and a good
    explanation of the timing implications of their microcode design.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Heathfield@21:1/5 to Keith Thompson on Mon Aug 18 08:02:30 2025
    XPost: comp.lang.c

    On 18/08/2025 06:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    $ cat so.c
    #include <stdio.h>

    int main(void)
    {
    int foo = 42;
    size_t soa = sizeof (foo, 'C');
    size_t sob = sizeof foo;
    printf("%s.\n", (soa == sob) ? "Yes" : "No");
    return 0;
    }
    $ gcc -o so so.c
    $ ./so
    Yes.
    $ gcc --version
    gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Mon Aug 18 11:34:49 2025
    XPost: comp.lang.c

    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?


    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Aug 18 11:03:15 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    Yeah, this approach works a lot better than people seem to give it
    credit for...
    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.
    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    None of those research papers that I have seen consider the possibility
    that OoO can make use of multiple concurrent HW walkers if the
    cache supports hit-under-miss and multiple pending miss buffers.

    While instruction fetch only needs to occasionally translate a VA one
    at a time, with more aggressive alternate path prefetching all those VA
    have to be translated first before the buffers can be prefetched.
    LSQ could also potentially be translating as many VA as there are entries.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton

    I'm looking for papers that separate out the common cost of loading a PTE
    from the extra cost of just the SW-miss handler. I had a paper a while
    back but can't find it now. IIRC in that paper the extra cost of the
    SW miss handler on Alpha was measured at 5-25%.

    One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
    that Intel's HW walker on its downward walk caches the interior node
    PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

    A SW walker can accomplish the same bottom-up walk by locating
    the different page table levels at *virtual* base addresses,
    and adding each VA of those interior PTE's to the TLB.
    This is what VAX VA translate did, probably Alpha too but I didn't check.

    This interior PTE node caching is critical for optimal performance
    and some of their stats don't take it into account
    and give much worse numbers than they should.

    Also many papers were written before ASID's were in common use
    so the TLB got invalidated with each address space switch.
    This would penalize any OS which had separate user and kernel space.

    So all these numbers need to be taken with a grain of salt.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Aug 18 15:35:36 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal
    reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Mon Aug 18 17:19:13 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want >atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were
    independent hardware features added to V8.1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Wed Aug 20 03:47:17 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    Prism paper says the following about RISC versus VAX performance:

    : 1. Shorter cycle time. VAX chips have more, and longer, critical
    : paths than RISC chips. The worst VAX paths are the control store
    : loop and the variable length instruction decode loop, both of
    : which are absent in RISC chips.

    : 2. Fewer cycles per function. Although VAX chips require fewer
    : instructions than RISC chips (1:2.3) to implement a given
    : function, VAX instructions take so many more cycles than RISC
    : instructions (5-10:1-1.5) that VAX chips require many more cycles
    : per function than RISC chips.

    : 3. Increased pipelining. VAX chips have more inter- and
    : intra-instruction dependencies, architectural irregularities,
    : instruction formats, address modes, and ordering requirements
    : than RISC chips. This makes VAX chips harder and more
    : complicated to pipeline.

    Point 1 above for me means that VAX chips were microcoded. Point
    2 above suggest that there were limited changes compared to VAX-780
    microcode.

    IIUC attempts to create better hardware for VAX were canceled
    just after PRISM memos, so later VAX used essential the same
    logic, just rescaled to better process.

    I think that VAX had problem with hardware decoders because of gate
    delay: in 1987 probably hardware decoder would slow down clock.
    But 1977 design for me looks quite relaxed: man logic was Schotky
    TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
    this means about 66 gate delays per cycle. And in critical paths
    VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
    commonly available in 1970 and 1 ns ECL was leading edge in 1970.

    That is why I think that in 1977 hardware decoder could give
    speedup, assuming that execution units could keep up: gate delay
    and cycle time means that rather deep circuit could fit within
    cycle time. IIUC 1987 designs were much more aggressive and
    decoder delay probably could not fit within single cycle.

    Quite possible that hardware designers attempting VAX hardware
    decoders were too ambitious and wanted to decode in one cycle
    too complicated instructions. AFAICS for instructions that can
    not be executed in one cycle decode can be slower than one
    cycle, all what one needs is to recognize withing one cycle
    that decode will take multiple cycles.

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Wed Aug 20 14:36:43 2025
    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    Not quite.
    My idea was to have two HW threads HT1 and HT2 which are like x86 HW
    threads except when HT1 gets a TLB miss it stalls its execution and
    injects the TLB miss handler at the front of HT2 pipeline,
    and a HT2 TLB miss stalls itself and injects its handler into HT1.
    The TLB miss handler never itself TLB misses as it explicitly checks
    the TLB for any VA it needs to translate so recursion is not possible.

    As the handler is injected at the front of the pipeline no drain occurs.
    The only possible problem is if between when HT1 injects its miss handler
    into HT2 that HT2's existing pipeline code then also does a TLB miss.
    As this would cause a deadlock, if this occurs then it cores detects it
    and both HT fault and run their TLB miss handler themselves.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want
    atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

    Yes. A memory recycler can periodically clear the Accessed bit
    so it can detect page usage, and that might be a different core.
    But it might skip sending TLB shootdowns to all other cores
    to lower the overhead (maybe a lazy usage detector).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Aug 20 16:41:39 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    Hmmm... I don't think that is possible, or if it is then its really hairy.
    The miss handler needs to LD the memory PTE's, which can happen OoO.
    But it also needs to do things like writing control registers
    (e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
    can't get to retire because the older instructions that triggered the
    miss are stalled.

    The miss handler needs general registers so it needs to
    stash the current content someplace and it can't use memory.
    Then add a nested miss handler on top of that.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    As Scott said, to avoid race conditions with software clearing those bits.
    Plus there might be PTE modifications that an OS could perform on other
    PTE fields concurrently without first acquiring the normal mutexes
    and doing a TLB shoot down of the PTE on all the other cores,
    provided they are done atomically so the updates of one core
    don't clobber the changes of another.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    A HW walker looks simple to me.
    It has a few bits of state number and a couple of registers.
    It needs to detect memory read errors if they occur and abort.
    Otherwise it checks each TLB level in backwards order using the
    appropriate VA bits, and if it gets a hit walks back down the tree
    reading PTE's for each level and adding them to their level TLB,
    checking it is marked present, and performing an atomic OR to set
    the Accessed and Dirty flags if they are clear.

    The HW walker is even simpler if the atomic OR is implemented directly
    in the cache controller as part of the Atomic Fetch And OP series.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton

    Yes, and it seems to me that one would spend a lot more time trying to
    fix the SW walker than doing the simple HW walker that just works.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Wed Aug 20 19:17:01 2025
    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest
    conditional branch would be 3 bytes as it needs a register specifier.

    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.

    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.

    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding
    space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Thu Aug 21 16:21:37 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Consider "virgin" page, that is neither accessed nor modified.
    Intruction 1 reads the page, instruction 2 modifies it. After
    both are done you should have both bits set. But if miss handling
    for instruction 1 reads page table entry first, but stores after
    store fomr instruction 2 handler, then you get only accessed bit
    and modified flag is lost. Symbolically we could have

    read PTE for instruction 1
    read PTE for instruction 2
    store PTE for instruction 2 (setting Accessed and Modified)
    store PTE for instruction 1 (setting Accessed but clearing Modified)

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Waldek Hebisch on Thu Aug 21 19:26:47 2025
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Thomas Koenig on Fri Aug 22 16:36:09 2025
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions. Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Aug 22 16:45:56 2025
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Waldek Hebisch on Fri Aug 22 17:21:17 2025
    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions.

    Like the multiply instruction in ARM2.

    Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.

    Yep, FP support can be expensive and was an extra option
    on the VAX, which also included integer multiply.

    However, I maintain that a ~1977 supermini with a similar sort
    of bus, MMU, floating point unit etc like the VAX, but with an
    architecture similar to ARM2, plus separate icache and dcache, would
    have beaten the VAX hands-down in performance - it would have taken
    fewer chips to implement, less power and possibly time to develop.
    HP showed this was possible some time later.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to John Levine on Sat Aug 23 16:38:47 2025
    John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.

    I remember this number, but do not remember where I found it. So
    it may be wrong.

    However, one can estimate possible density in a different way: package
    probably of similar dimensions as VAX package can hold about 100 TTL
    chips. I do not have detailed data about chip usage and transistor
    couns for each chip. Simple NAND gate is 4 transitors, but input
    transitor has two emiters and really works like two transistors
    so it is probably better to count it as 2 transitors, and conseqently
    consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
    20 transistors. D-flop probably is about 20-30 transitors, so
    74S74 is probably around 40-60. Quad D-flop bring us close to 100.
    I suspect that in VAX time octal D-flops were available. There
    were 4 bit ALU slices. Also multiplexers need nontrivial number
    of transistors. So I think that 50 transistors is reasonable (maybe
    low) estimate of average density. Assuming 50 transitors per chip
    that would be 5000 transistors per package. Packages were rather
    flat, so when mounted vertically one probably could allocate 1 cm
    of horizotal space for each. That would allow 30 packages at
    single level. With 7 levels we get 210 packages, enough for
    1 mln transistors.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Mon Aug 25 00:56:26 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }
    arrays:
    MOV Ri,#0
    MOV Rr,#0
    VEC Rt,{}
    LDD Rl,[Rv,Ri<<3]
    ADD Rr,Rr,Rl
    LOOP LT,Ri,Rn,#1
    MOV R1,Rr
    RET

    7 instructions, 1 instruction-modifier; 8 words.

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD 0x1234567890abcdef,[IP,a]
    STD 0xcdef1234567890ab,[IP,b]
    STD 0x567890abcdef1234,[IP,c]
    STD 0x5678901234abcdef,[IP,d]
    RET

    5 instructions, 13 words, 0 .data, 0 .bss

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret

    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.

    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret

    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
    1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
    1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
    1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
    1198: c3 ret

    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret

    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang) versions.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Wed Aug 27 00:56:58 2025
    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Also, PDP-11 compatibility depended on microcode.

    Different address modes mainly.

    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX especially given constraint
    of PDP-11 compatibility.

    RISC in MSI TTL logic would not have worked all that well.

    OTOH VAX designers probably felt
    that CISC nature added significant value: they understood
    that cost of programming was significant and believed that
    ortogonal instruction set, in particular allowing complex
    addresing on all operands made programming simpler.

    Some of us RISC designers believe similarly {about orthogonal
    ISA not about address modes.}

    They
    probably thought that providing resonably common procedures
    as microcoded instructions made work of programmers simpler
    even if routines were only marginally faster than ordinary
    code.

    We think similarly--but we do not accept µCode being slower
    that SW ISA, or especially compiled HLL.

    Part of this thinking was probably like "future
    system" motivation at IBM: Digital did not want to produce
    "commodity" systems, they wanted something with unique
    features that custemer will want to use.

    s/used/get locked in on/

    Without
    isight into future it is hard to say that they were
    wrong.

    It is hard to argue that they made ANY mistakes with
    what we know about the world of computers circa 1977.

    It is not hard in 2025.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Wed Aug 27 10:56:31 2025
    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte
    for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline stalls. The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to MitchAlsup on Thu Aug 28 07:49:31 2025
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    Note that most of this is microcode ROM. They complicated
    logic to get smaller ROM size. For VAX it was quite different:
    microcode memory (and cache) were build from LSI chips,
    not suitable for logic at that time. Assuming 6 transistor
    static RAM cells VAX had 590000 transistors in microcode memory
    chips (and another 590000 transistors in cache chips).
    Comparatively one can estimate VAX logic chips as between 20000
    and 100000 transistors, with low numbers looking more likely
    to me. IIUC at least early VAX on a "single" chip were slowed
    down by going to off-chip microcode memory.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Yes, but IIUC big item was on-chip microcode memory (or pins
    needed to go to external microcode memory).

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Thu Aug 28 13:39:54 2025
    EricP wrote:
    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline
    stalls.
    The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way
    jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot
    too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).

    I found a description of the 780 instruction buffer parser
    in the Data Path description on bitsavers and
    it does in fact pull one operand specifier from IB per clock.
    There is a mux network to handle various immediate formats in parallel,

    There are conflicting descriptions as to exactly how it handles the
    first operand, whether that is pulled with the opcode or in a separate clock, as the IB shifter can only do 1 to 5 byte shifts but an opcode with
    a first operand with 32-bit displacement would be 6 bytes.

    But basically it takes 1 clock for the opcode byte and the first operand specifier byte, a second clock if the first opspec has an immediate,
    then 1 clock for each subsequent operand specifier.
    If an operand has an immediate it is extracted in parallel with its opspec.

    If that is correct a MOV rs,rd or ADD rs,rd would take 2 clocks to decode,
    and a MOV offset(rs),rd would take 3 clocks to decode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Sun Aug 31 18:04:44 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are:
    Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)