Forum: >>> Magnum BBS <<<

Re: VAX

From Thomas Koenig@21:1/5 to BGB on Wed Jul 30 16:24:40 2025

BGB <cr88192@gmail.com> schrieb:

I can't say much for or against VAX, as I don't currently have any
compilers that target it.

If you want to look at code, godbolt has a few gcc versions for it.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Thu Jul 31 04:26:27 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

So going for microcode no longer was the best choice for the VAX, but >>>neither the VAX designers nor their competition realized this, and >>>commercial RISCs only appeared in 1986.

That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>instruction and address modes and the tiny 512 byte page size.

Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

Another, which is not entirely their fault, is that they did not expect >>compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >>instructions like POLY that should have been subroutines. The 801 project and >>PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >>presumably didn't know about it.

DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.

POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.

Related to the microcode issue they also don't seem to have anticipated how >>important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >>easier to pipeline.

My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.

Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.

I looked into VAX architecure handbook from 1977. Handbook claims
that VAX-780 used 96-bit microcode words. That is enough bits to
control pipelined machine with 1 instruction per cycle, provided
enough excution resources (register ports, buses and 1-cycle
execution units). However, VAX hardware allowed only one memory
access per cycle so instructions with multiple memory addreses
or using indirection trough memory by necessity needed multiple
cycles.

I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte. That
can be done using 2 negators and 4 input NAND gate. For normal
instructions lowest bit of opcode seem to select between 2
and 3 operand instructions. For 1 byte opcode with all
register arguments operand specifiers are in predictable placese,
so together modest number of gates could recognize register-only
operand specifiers. Of course, to be sure that this is
register instruction one needs to look at opcode. I am
guessing that VAX fetches microcode word based on opcode,
so this microcode word could conditionaly (based on result
of circuit mentioned above) pass instruction to pipeline
and initiate processing of next instruction or start
argument processing. Such one cycle conditional branch
in general may be problematic, but I would be surprising if
it were problematic for VAX microcode. Namely it was
ususal for microdode to specify address of next microcode
word. So with a pipeline and small number of extra gates
VAX should be able to do register-only instructions in
1 cycle. Escalating a bit, with managable number of
gates one should be able to recognize operand of
"defered mode", "autodecement mode" and "autoincrement mode".
For each such input operand microcode engine could
insert a load into pipeline and proceed with rest of
instruction. Similarly, for write operand microcode
could pass instruction to the pipeline, but also pass
special bit changing destination and insert store
after instruction. Once given memory operand is
handled decodin gates would indicate if this was last
memory operand which would allow either going to next
instruction or handling next memory operand. Together,
for normal istructions each memory operand should add
one cycle to execution time. Also short immediates
could be handled in similar way. This leaves some nasty
cases: longer immediates, displacement and modes with
double indirection. Displacement could probly be handled
at cost of extra cycle. Other modes probably would
cost one or two cycle penalty.

To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).

Given actual speed of VAX possibilities seem to be:
- extra factors slowing both VAX and RISC, like cache
misses (VAX archtecture handbook says that due to
misses cache had effective access time of 290 ns),
- VAX designers could not afford pipeline
- maybe VAX designers decided to avoid pipelne to reduce
complexity

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.

To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX especially given constraint
of PDP-11 compatibility. OTOH VAX designers probably felt
that CISC nature added significant value: they understood
that cost of programming was significant and believed that
ortogonal instruction set, in particular allowing complex
addresing on all operands made programming simpler. They
probably thought that providing resonably common procedures
as microcoded instructions made work of programmers simpler
even if routines were only marginally faster than ordinary
code. Part of this thinking was probably like "future
system" motivation at IBM: Digital did not want to produce
"commodity" systems, they wanted something with unique
features that custemer will want to use. Without
isight into future it is hard to say that they were
wrong.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Jul 31 16:05:14 2025

According to Waldek Hebisch <antispam@fricas.org>:

POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.

That was the plan but the people building Vaxen didn't get the memo
so even on the original 780, it got different answers with and without
the optional floating point accelerator.

If they wanted more accurate results, they should have

https://simh.trailing-edge.com/docs/vax_poly.pdf

I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.

It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.

In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.

To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).

Right, but detecting the abnormal cases wasn't trivial.

R's,
John
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Thu Jul 31 19:01:36 2025

John Levine <johnl@taugh.com> writes:

According to Waldek Hebisch <antispam@fricas.org>:

POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.

That was the plan but the people building Vaxen didn't get the memo
so even on the original 780, it got different answers with and without
the optional floating point accelerator.

If they wanted more accurate results, they should have

https://simh.trailing-edge.com/docs/vax_poly.pdf

I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.

It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.

In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.

Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET

(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.

There were some adjacent dependencies:

ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
ADDB3 #48,R1,(R9)+ ;AND NEXT

and a handful of others. Probably only a single-digit
percentage of instructions used autoincrement/decrement and only
a couple used the updated register in the same
instruction.

in some of my code from the era, I used auto-decrement frequently,
mainly to push 8 or 16bit data onto the stack.

;
; Deallocate Virtual Memory used to buffer records in copy.
;
pushl copy_in_rab+rab$l_ubf ; Record address
movzwl copy_in_rab+rab$w_usz,-(sp) ; Record size
pushab 4(sp)
pushab 4(sp)
calls #2,g^lib$free_vm ; Get rid of vm
ret

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Jul 31 19:57:43 2025

According to Scott Lurndal <slp53@pacbell.net>:

Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET

(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.

Wow, that's some funky code. The #240 is syntactic sugar for (PC)+
followed by a byte with 240 (octal) in it. VAX had an immediate
address mode that could represent 0 to 77 octal so the assembler used
that for immediates that would fit, (PC)+ if not. The S^#OPN explictly
tells it to use the short immediate mode. #^A/;/ is a literal
semicolon which fits in an immediate.

There were some adjacent dependencies:

ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
ADDB3 #48,R1,(R9)+ ;AND NEXT

and a handful of others. Probably only a single-digit
percentage of instructions used autoincrement/decrement and only
a couple used the updated register in the same
instruction.

Right, but it always had to check for it. As I said a few messages ago,
if they didn't allow register updates to affect other operands, or changed
the spec so the registers were all updated at the end of the instruction, it wouldn't have affected much code but would have made decoding and pipelining easier.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Thu Jul 31 21:24:29 2025

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET

(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.

Wow, that's some funky code.

.TITLE FOCAL MAIN SEGMENT
;FOCAL MAIN SEGMENT
;DAVE MONAHAN MARCH 1978

...

HEADER: .ASCII /C VAX FOCAL V1.0 /
DATE: .BLKB 24
.ASCII / -NOT A DEC PRODUCT/

I had it on a 9-track from 1980 that Al was nice enough to
copy to a CD-ROM for me.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to John Levine on Fri Aug 1 02:18:17 2025

John Levine <johnl@taugh.com> wrote:

According to Waldek Hebisch <antispam@fricas.org>:

I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.

It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.

My idea was that instruction decoder could essentially translate

ADDL (R2)+, R2, R3

into

MOV (R2)+, TMP
ADDL TMP, R2, R3

where TMP is special forwarding register in the CPU. AFAICS normal
forwarding in the pipeline would handle this. In case of

ADDL R2, (R2)+, R3

one would need something which we could denote

MOV (R2)+, TMP
ADDL R2*, TMP, R3

where R2* denotes previous value of R2, which introduces extra
complication, but does not look hard to handle.

Note that I do _not_ aim at executiong complex VAX instructions in
one cycle. Rather each memory operand is handled separately
and they are handled in order.

In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.

I considered only popular integer instructions, everthing else
would be handled by microcode at the same speed as real VAX.
VAX had 32-bit bus, so 8-bytes operand needed 2 cycles anyway,
so slower decoding for such operands would not be a problem.

To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).

Right, but detecting the abnormal cases wasn't trivial.

Maybe I was unclear, but the whole point was that distinguishing
between normal cases and abnomal ones could be done by
moderately complex hardware. Also, I am comparing to execution
time for equvalent functionality: VAX instruction with 1 memory
operand would take 2 cycles (the same as 2 instructions needed
by RISC). And I am comparing to early RISC, that is 32-bit
integer operations. Similar speedup for floating point operations
or for 64-bit operands would need bigger decoders, handling
more than 1 memory operand per cycle or going superscalar
probably would lead to too complex decoders.

And a little corection: proposed decoder effectively add 1 more
pipeline stage, so taken jump would be 1 cycle slower than
classic RISC having the same pipeline (and 2 cycle slower than
RISC delayed jumps). OTOH RISC-V compressed instructions
seem to require similar decoding stage, so Anton VAX-RISC-V
would have similar timing.

I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.

But after your post I find it more likely that in DEC could
not afford pipeline for VAX-780, even with simple instructions
one has to decide between accessing register file and using
forwarded value, one needs intelocks to wait for cache misses
etc.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Fri Aug 1 17:02:33 2025

BGB <cr88192@gmail.com> writes:

On 7/30/2025 12:59 AM, Anton Ertl wrote:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.

That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>> instruction and address modes and the tiny 512 byte page size.

Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.

...

But, if so, it would more speak for the weakness of VAX code density
than the goodness of RISC-V.

For the question at hand, what counts is that one can do a RISC that
is more compact than the VAX.

And neither among the Debian binaries nor among the NetBSD binaries I
measured I have found anything consistently more compact than RISC-V
with the C extension. There is one strong competitor, though: armhf
(Thumb2) on Debian, which is a little smaller then RV64GC in 2 out of
3 cases and a little larger in the third case.

There is, however, a fairly notable size difference between RV32 and
RV64 here, but I had usually been messing with RV64.

NetBSD has both RV32GC and RV64GC binaries, and there is no consistent advantage of RV32GC over RV64GC there:

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64

If I were to put it on a ranking (for ISAs I have messed with), it would
be, roughly (smallest first):
i386 with VS2008 or GCC 3.x (*1)

i386 has significantly larger binaries than RV64GC on both Debian and
NetBSD, also bigger than AMD64 and ARM A64.

For those who want to see all the numbers in one posting: <2025Jun17.161742@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Aug 1 17:25:22 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.

IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.

The reproducability did not happen.

It actually might have been better if the ISA contained instructions
for the individual steps. According to <http://simh.trailing-edge.com/docs/vax_poly.pdf>

|For example, POLY specified that in the central an*x+bn step:
|- The multiply result was truncated to 31b/63b prior to normalization.
|- The extended-precision multiply result was added to the next coefficient.
|- The addition result was truncated to 31b/63b prior to normalization and
| rounding.

One could specify an FMA instruction for that step like many recent
ISAs have done, but I think that the reproducibility would be better
if the truncation was a separate instruction. And of course, all of
this would require having at least a few registers with extra bits.

Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what
RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a
MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of
conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.

I looked into VAX architecure handbook from 1977. Handbook claims
that VAX-780 used 96-bit microcode words. That is enough bits to
control pipelined machine with 1 instruction per cycle, provided
enough excution resources (register ports, buses and 1-cycle
execution units). However, VAX hardware allowed only one memory
access per cycle so instructions with multiple memory addreses
or using indirection trough memory by necessity needed multiple
cycles.

I must admit that I do not understand why VAX needed so many
cycles per instruction.

It was not pipelined much. Assuming a totally unpipelined machine, an
ADD3.L R1,R2,R3 instruction might be executed in the following steps:

decode add3.l
decode first operand (r1)
read r1 from the register file | decode second operand (r2)
read r2 from the register file
add r1 and r2 | decode r3
write the result to r3

That's 6 cycles, and without any cycles for instruction fetching.

For 1 byte opcode with all
register arguments operand specifiers are in predictable placese,
so together modest number of gates could recognize register-only
operand specifiers.

Yes, but they wanted to implement the VAX, where every operand can be
anything. If they thought that focusing on register-only instructions
was the way to go, they would not have designed the VAX, but the IBM
801. The ISA was designed for a non-pipelined microcoded
implementation, obviously without any thought given to future
pipelined implementations, and that's how the VAX 11/780 was
implemented.

To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).

The VAX 11/780 was not pipelined. The VAX 8700/8800 (introduced 1986,
but apparently the 8800 replaced the 8600 as high-end VAX only
starting from 1987) was pipelined at the microcode level, like you
suggest, but despite having a 4.4 times higher clock rate, the 8700
achieved only 6 VUP, i.e., 6 times the VAX 11/780 performance (the
8800 just had two CPUs, but each CPU with the same speed). So if the
VAX 11/780 takes 10 cycles/instruction on average, the VAX 8700 still
takes 7.4 cycles per instruction on average, whereas typical RISCs
contemporary with the VAX 8700 required <2 CPI. They needed
a more instruction, but the bottom line was still a big speed
advantage for the RISCs.

A few years later, there was the pipelined 91MHz NVAX+ with 35 VUP,
and, implemented in the same process, the 200MHz 21064 with 106.9
SPECint92 and 159.6 SPECfp92 (https://ntrs.nasa.gov/api/citations/19960008936/downloads/19960008936.pdf). Note that both VUP and SPEC92 scale relative to the VAX 11/780 (i.e.,
the 11/780 has 1 VUP and SPEC92 int and fp results of 1). So we see
that they did not manage to get the NVAX+ up to the same clock rate as
the 21064 in the same process, and that the performance disadvantage
of the VAX is even higher than the clock rate disadvantage.

Given actual speed of VAX possibilities seem to be:
- extra factors slowing both VAX and RISC, like cache
misses (VAX archtecture handbook says that due to
misses cache had effective access time of 290 ns),
- VAX designers could not afford pipeline
- maybe VAX designers decided to avoid pipelne to reduce
complexity

Yes to all. And even when they finally pipelined the VAX, it was far
less effective than for RISCs.

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.

RISCs like the ARM, MIPS R2000, and SPARC implement a pipelined
integer instruction set in one chip in 1985/86, with the R2000 running
at up 12.5MHz. At around the same time the MicroVAX 78032 appeared
with a similar number of transistors (R2000 110,000, 78032 125000).
The 78032 runs at 5MHz and has a similar performance to the VAX
11/780. So for these single-chip implementations, the RISC could be
pipelined (and clocked higher), whereas the VAX could not*. I expect
that with the resources needed for the VAX 11/780, a pipelined RISC
could be implemented.

* And did the 78032 implement the whole integer instruction set? I
have certainly read about MicroVAXen that trapped rare instructions
and implemented them in software.

Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.

The PDP-11 instruction set is relatively simple. I expect that the
effort for decoding it to the RISC-VAX (whether in hardware or with
microcode) would not take that many resources.

To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX

They were significantly faster in later technologies, and the IBM 801 demonstrates the superiority of RISC at around the time of the VAX, so
it is very likely that a pipelined and faster RISC-VAX would have been
doable with the resources of the VAX.

Without
isight into future it is hard to say that they were
wrong.

It's now the past. And now we have all the data to see that the
result was certainly not very future-proof, and very likely not even
the best-performing design possible at the time. But ok, they did not
know better, that's why there's a time-machine involved in my
scenario.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Sat Aug 2 09:02:37 2025

antispam@fricas.org (Waldek Hebisch) writes:

I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.

Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sat Aug 2 15:33:07 2025

BGB <cr88192@gmail.com> writes:

But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...

Let's see:

#include <stddef.h>

long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}

long a, b, c, d;

void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}

gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret

0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret

When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.

gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:

000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret

000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a> 1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b> 1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c> 1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d> 1198: c3 ret

gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:

0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret

0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret

So, the overall sizes (including data size for globals() on RV64GC) are:

arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64

So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.

NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64

I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other
sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the
.sdata size to the .text size.

Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.

The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).

There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in
Gforth.

But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to
auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang)
versions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue Aug 5 22:17:00 2025

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX. Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc', but
when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
[r9+rcx*8]. It does not depend on the previous value of rbx, except for control dependency that hopefully would be speculated around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.

This is the carry chain that I don't see any obvious way to break...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Scott Lurndal on Tue Aug 5 20:34:27 2025

Scott Lurndal <scott@slp53.sl.home> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against >>>>superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.

Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.

The basic question is if VAX could afford the pipeline. VAX had
rather complex memory and bus interface, cache added complexity
too. Ditching microcode could allow more resources for execution
path. Clearly VAX could afford and probably had 1-cycle 32-bit
ALU. I doubt that they could afford 1-cycle multiply or
even a barrel shifter. So they needed a seqencer for sane
assembly programming. I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port. Multiported register file
probably would need a lot of separate register chips and
multiplexer. Alternatively, they could try some very fast
RAM and run it at multiple of base clock frequency (66 ns
cycle time caches were available at that time, so 3 ports
via multiplexing seem possible). But any of this adds
considerable complexity. Sane pipeline needs interlocks
and forwarding.

We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Using terminology of late seventies VAX was mixture of SSI,
MSI and LSI chips. I am not sure if VAX used it, but there
were 4-bit TTL ALU chips, 8 such chips would give 32-bit ALU
(for better speed one would add carry propagation chips,
which would increase chip count).

Probably only memory used LSI chips. That could add bias
for microcode: microcode used densest MOS chips (memory) and
replaced less dense random TTL logic. After switching to CMOS
on-chip logic was more comparable to memory, so balance
shifted.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed Aug 6 00:21:25 2025

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.

This is the carry chain that I don't see any obvious way to break...

Terje

You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Michael S on Tue Aug 5 21:13:50 2025

XPost: comp.lang.c

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

On 2025-08-04 15:03, Michael S wrote:

On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

...

In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.

That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.

If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.

I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.

I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.

(I don't know that to be true; an extension has to be documented other
than by omission. But anyway, if the GCC documentation says somewhere
something like, "no other identifier is reserved in this version of
GCC", then it means that the remaining portions of the reserved
namespaces are available to the program. Since it is undefined behavior
to use those identifiers (or in certain ways in certain circumstances,
as the case may be), being able to use them with the documentation's
blessing constitutes use of a documented extension.)

I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.

__builtin also in a standard-defined reserved namespace; the double
underscore namespace. It is no more or less conservative to name
something __bitInt as _BitInt.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Keith Thompson on Tue Aug 5 21:25:17 2025

XPost: comp.lang.c

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

GCC does not define a complete C implementation; it doesn't provide a
library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...

Those libraries are C implementors also, and get to name things
in the reserved namespace.

It would be unthinkable for GCC to introduce, say, an extension
using the identifier __libc_malloc.

In addition to libraries, if some other important project that serves as
a base package in many distributions happens to claim identifiers in
those spaces, it wouldn't be wise for GCC (or the C libraries) to start
taking them away.

You can't just rename the identifier out of the way in the offending
package, because that only fixes the issue going forward. Older versions
of the package can't be compiled with the new compiler without a patch. Compiling older things with newer GCC happens.

There are always the questions:

1. Is there an issue? Is anything broken?

2. If so, is what is broken important such that it becomes a showstopper
if the compiler change is rolled out (major distros are on fire?)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tkoenig@netcologne.de on Tue Aug 5 17:41:30 2025

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Keith Thompson on Wed Aug 6 04:31:59 2025

XPost: comp.lang.c

On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.

Do you have any reason to believe that gcc's use of _BitInt will break
any existing code?

It has landed, and we don't hear reports that the sky is falling.

If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.

My position has always been to think about the threat of real,
or at least probable clashes.

I can turn it around: I have not heard of any compiler or library using _CreamPuff as an identifier, or of a compiler which misbehaves when a
program uses it, on grounds of it being undefined behavior. Someone
using _CreamPuff in their code is taking a risk that is vanishingly
small, the same way that introducing _BigInt is a risk that is
vanishingly small.

In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced some identifier is vastly larger than the audience of implementations that a
given program will face that has introduced some funny identifier.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Wed Aug 6 05:53:22 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Kaz Kylheku on Wed Aug 6 11:48:09 2025

XPost: comp.lang.c

On Wed, 6 Aug 2025 04:31:59 -0000 (UTC)
Kaz Kylheku <643-408-1753@kylheku.com> wrote:

On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

Agreed -- and in gcc did not do that in this case. I was referring
to _BitInt, not to other identifiers in the reserved namespace.

Do you have any reason to believe that gcc's use of _BitInt will
break any existing code?

It has landed, and we don't hear reports that the sky is falling.

If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.

Exactly.
The World is a very big place. Even nowadays it is not completely
transparent. Even those parts that are publicly visible in theory not necessarily had been had been observed recently by a single person even
if the person in question is Keith.
Besides, according to my understanding majority of gcc users didn't yet
migrate to gcc14 or 15.

My position has always been to think about the threat of real,
or at least probable clashes.

I can turn it around: I have not heard of any compiler or library
using _CreamPuff as an identifier, or of a compiler which misbehaves
when a program uses it, on grounds of it being undefined behavior.
Someone using _CreamPuff in their code is taking a risk that is
vanishingly small, the same way that introducing _BigInt is a risk
that is vanishingly small.

In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced
some identifier is vastly larger than the audience of implementations
that a given program will face that has introduced some funny
identifier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to tkoenig@netcologne.de on Wed Aug 6 11:10:46 2025

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Wed Aug 6 16:19:11 2025

Michael S wrote:

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?

No, you are not. I skipped pretty much all the setup code. :-)

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.

This is the carry chain that I don't see any obvious way to break...

You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.

Aha!

That's _very_ nice.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to George Neuner on Wed Aug 6 10:23:26 2025

George Neuner wrote:

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]

To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.

The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.

So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.

The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.

So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode
bottleneck.

I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.

ADDLL (r2)+, (r2)+, (r2)+

the first (left) operand reads r2 then adds 4, which the second r2 reads
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.

I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Kaz Kylheku on Wed Aug 6 11:54:57 2025

XPost: comp.lang.c

On 2025-08-05 17:13, Kaz Kylheku wrote:

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

...

If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.

I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.

I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.

No, I meant very specifically that if, as reported, _BitInt was
supported even in earlier versions, then it was supported as an extension.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Kaz Kylheku on Wed Aug 6 11:56:04 2025

XPost: comp.lang.c

On 2025-08-05 17:25, Kaz Kylheku wrote:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...

Those libraries are C implementors also, and get to name things
in the reserved namespace.

GCC cannot be implemented in such a way as to create a fully conforming implementation of C when used in connection with an arbitrary
implementation of the C standard library. This is just one example of a
more general potential problem: Both gcc and the library must use some
reserved identifiers, and they might have made conflicting choices.
That's just one example of the many things that might prevent them from
being combined to form a conforming implementation of C. It doesn't mean
that either one is defective. It does mean that the two groups of
implementors should consider working together to resolve the conflicts.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Wed Aug 6 20:06:00 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Wed Aug 6 17:00:03 2025

Thomas Koenig wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Wed Aug 6 21:14:07 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Thomas Koenig wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,
especially for one of the BUNCH, for whom backward compatability was
paramount.

[*] The machine (Unisys V530) sold for well over a megabuck in
a single processor configuration.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Wed Aug 6 17:57:03 2025

EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).

TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.

The question is could one build this at a commercially competitive price?
There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent state machines, each with its own logic sequencer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to EricP on Wed Aug 6 23:43:12 2025

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA, which
used PALs which were not available to the VAX 11/780 designers, so it
could be clocked a bit higher, but at a multiple of the performance
than the VAX.

So, Anton visiting DEC or me visiting Data General could have brought
them a technology which would significantly outperformed the VAX
(especially if we brought along the algorithm for graph coloring. Some
people at IBM would have been peeved at having somebody else "develop"
this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
matrix)
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Wed Aug 6 20:41:44 2025

EricP wrote:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)

^^^^
Oops... typo. Should be FPLA.
PAL or Programmable Array Logic was a slightly different thing,
also an AND-OR matrix from Monolithic Memories.

were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

And PAL's too. Whatever works and is cheapest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Thomas Koenig on Thu Aug 7 11:16:20 2025

Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

Russians in late sixties proposed graph coloring as a way of
memory allocation (and proved that optimal allocation is
equivalent to graph coloring). They also proposed heuristics
for graph coloring and experimentaly showed that they
are reasonably effective. This is not the same thing as
register allocation, but connection is rather obvious.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Swindells on Thu Aug 7 10:47:50 2025

Robert Swindells <rjs@fdy2.co.uk> writes:

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.

The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lars Poulsen on Thu Aug 7 11:21:56 2025

Lars Poulsen <lars@cleo.beagle-ears.com> writes:

["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.

...

My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?

I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to EricP on Thu Aug 7 11:29:46 2025

EricP <ThatWouldBeTelling@thevillage.com> wrote:

EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

IIUC description of IBM 360-85 it had a pipeline which was much
more aggresivly clocked than VAX. 360-85 probaly used ECL, but
at VAX clock speed should be easily doable in Schottky TTL
(used in VAX).

The question is could one build this at a commercially competitive price?

Yes.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Thu Aug 7 11:59:35 2025

scott@slp53.sl.home (Scott Lurndal) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,

The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.

especially for one of the BUNCH, for whom backward compatability was >paramount.

Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to backwards compatibility?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu Aug 7 11:38:54 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

EricP wrote:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).

TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.

The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.

I am confused. You gave a possible answer in the posting you are
replying to.

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 7 13:34:26 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Lars Poulsen <lars@cleo.beagle-ears.com> writes:

["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:

...

My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have nativea types __int16 etc?

I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.

The more likely solution would be to push the protocol processing
into an attached I/O processor, in those days.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Aug 7 15:03:23 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,

The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.

especially for one of the BUNCH, for whom backward compatability was >>paramount.

Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?

Because you need to sell it. Without disrupting your existing
customer base.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Michael S on Tue Aug 5 21:08:53 2025

XPost: comp.lang.c

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.

That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.

They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.

What would be an exmaple of a more conservative way to name the
identifier?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Aug 8 10:08:43 2025

Anton Ertl wrote:

Robert Swindells <rjs@fdy2.co.uk> writes:

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.

The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

I don't know why they think these are problems with the 82S100.
These complaints sound like from a hobbyist.

Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

- anton

Yes. This risc-VAX would have to decode 1 instruction per clock to
keep keep a pipeline full so I envision running the fetch buffer
through a bank of those PLA and generating a uOp out.

I don't know whether the instructions can be byte aligned variable size
or have to be fixed 32-bits in order to meet performance requirements.
I would prefer the flexibility of variable size but
the Fetch byte alignment shifter adds a lot of logic.

If variable then the high frequency instructions like MOV rd,rs
and ADD rsd,rs fit into two bytes. The longest instruction looks like
12 bytes, 4 bytes operation specifier (opcode plus registers)
plus 8 bytes immediate FP64.

If a variable size instruction arranges that all the critical parse
information is located in the first 8-16 bits then we can just run
those bits through a PLAs in parallel and have that control the
alignment shifter as well as generate the uOp.

I envision this Fetch buffer alignment shifter built from tri-state
buffers rather than muxes as TTL muxes are very slow and we would
need a lot of them.

The whole fetch-parse-decode should fit on a single board.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Fri Aug 8 19:48:59 2025

On Fri, 08 Aug 2025 06:16:51 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

George Neuner <gneuner2@comcast.net> writes:

The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.

No such cache in the P6 or any of its descendents until the Sandy
Bridge (2011). The Pentium 4 has a microop cache, but eventually
(with Core Duo, Core2 Duo) was replaced with P6 descendents that have
no microop cache. Actually, the Core 2 Duo has a loop buffer which
might be seen as a tiny microop cache. Microop caches and loop
buffers still have to contain information about which microops belong
to the same CISC instruction, because otherwise the reorder buffer
could not commit/execute* CISC instructions.

* OoO microarchitecture terminology calls what the reorder buffer does
"retire" or "commit". But this is where the speculative execution
becomes architecturally visible ("commit"), so from an architectural
view it is execution.

Followups set to comp.arch

- anton

Thanks for the correction. I did fair amount of SIMD coding for
Pentium II, III and IV, so was more aware of their architecture. After
the IV, I moved on to other things so haven't kept up.

Question:
It would seem that, lacking the microop cache the decoder would need
to be involved, e.g., for every iteration of a loop, and there would
be more pressure on I$1. Did these prove to be a bottleneck for the
models lacking cache? [either? or something else?]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Fri Aug 8 21:43:11 2025

On Wed, 06 Aug 2025 10:23:26 -0400, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

George Neuner wrote:

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where
destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]

To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.

I was aware of this (thank you), but I was trying to figure out why
the VAX - particularly early ones - would need it. And also it does
not mesh with Waldek's comment [at top] about 3 copies.

The VAX did have one (pathological?) address mode:

displacement deferred indexed @dis(Rn)[Rx]

in which Rn and Rx could be the same register. It is the only mode
for which a single operand could reference a given register more than
once. I never saw any code that actually did this, but the manual
does say it was possible.

But even with this situation, it seems that the register would only
need to be read once (per operand, at least) and the value could be
used twice.

The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.

So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.

The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.

So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode >bottleneck.

I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.

ADDLL (r2)+, (r2)+, (r2)+

You mean exceptions? Exceptions were handled between instructions.
VAX had no iterating string-copy/move instructions so every
instruction logically could stand alone.

VAX separately identified the case where the instruction completed
with a problem (trap) from where the instruction could not complete
because of the problem (fault), but in both cases it indicated the
offending instruction.

the first (left) operand reads r2 then adds 4, which the second r2 reads
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.

I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Aug 9 08:07:12 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to tkoenig@netcologne.de on Sat Aug 9 09:04:40 2025

In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.

If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not, but that's a LOT of
speculation with hindsight-colored glasses. Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not. But as with all alternate history, this is
completely unknowable.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Sat Aug 9 10:00:54 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.

Sure. IBM was in less than no hurry to make a product out of
the 801.

If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,

Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.

but that's a LOT of
speculation with hindsight-colored glasses.

Graph-colored glasses, for the register allocation, please :-)

Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?

[...]

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Sat Aug 9 10:03:29 2025

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
- with 8 optional XOR output invertors,
- driving 8 tri-state or open collector buffers.

So I count roughly 7 or 8 equivalent gate delays.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Sat Aug 9 20:54:07 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

That is strange. Why would they make the chip worse?

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,

Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.

- with 8 optional XOR output invertors,

I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).

- driving 8 tri-state or open collector buffers.

A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.

So I count roughly 7 or 8 equivalent gate delays.

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Al Kossow@21:1/5 to Thomas Koenig on Sat Aug 9 14:57:03 2025

On 8/9/25 1:54 PM, Thomas Koenig wrote:

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings?

using typicals was a rookie mistake
also not comparing delay times across vendors

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to tkoenig@netcologne.de on Sun Aug 10 12:06:46 2025

In article <107768m$17rul$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[snip]
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,

Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.

Ok. Sure.

but that's a LOT of
speculation with hindsight-colored glasses.

Graph-colored glasses, for the register allocation, please :-)

Heh. :-)

Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?

Absolutely.

[...]

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

Sure.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Similarly for other minicomputer companies.

So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.

Plausiblity is orthogonal to whether a thing is knowable.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Dan Cross on Sun Aug 10 15:18:23 2025

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <107768m$17rul$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

<snip>

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

Sure.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer. Considerable
internal resources were being applied to the Jupiter project
at the end of the 1970s to support a wider range of applications.

http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".

"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"

Fundamentally, 36-bit words ended up being a dead-end.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Sun Aug 10 21:01:50 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[Snipping the previous long discussion]

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,

There, we agree.

that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,

With a certainty, if they followed RISC principles.

but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started,

That is true. Reading https://acg.cis.upenn.edu/milom/cis501-Fall11/papers/cocke-RISC.pdf
(I liked the potential toung-in-cheek "Regular Instruction
Set-Computer" name for their instruction set).

and even
fewer would have believed it absent a working prototype,

The simulation approach that IBM took is interesting. They built
a fast simulator, translating one 801 instruciton into one (or
several) /370-instructions on the fly, with a fixed 32-bit size.

which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.

That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.

Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Scott Lurndal on Mon Aug 11 08:17:48 2025

scott@slp53.sl.home (Scott Lurndal) writes:

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.

This does not make the actual VAX more attractive relative to the
hypothetical RISC-VAX IMO.

Fundamentally, 36-bit words ended up being a dead-end.

The reason why this once-common architectural style died out are:

* 18-bit addresses

* word addressing

Sure, one could add 36-bit byte addresses to such an architecture
(probably with 9-bit bytes to make it easy to deal with words), but it
would force a completely different ABI and API, so the legacy code
would still have no good upgrade path and would be limited to its
256KW address space no matter how much actual RAM there is available.
IBM decided to switch from this 36-bit legacy to the 32-bit
byte-addressed S/360 in the early 1960s (with support for their legacy
lines built into various S/360 implementations), DEC did so when they introduced the VAX.

Concerning other manufacturers:

<https://en.wikipedia.org/wiki/36-bit_computing> tells me that the
GE-600 series was also 36-bit. It continued as Honeywell 6000 series <https://en.wikipedia.org/wiki/Honeywell_6000_series>. Honeywell
introduced the DPS-88 in 1982; the architecture is described as
supporting the usual 256KW, but apparently the DPS-88 could be bought
with up to 128MB; programming that probably was no fun. Honeywell
later sold the NEC S1000 as DPS-90, which does not sound like the
Honeywell 6000 line was a growing business. And that's the last I
read about the Honeywell 6000 line.

Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. <https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:

|In addition to the IX (1100/2200) CPUs [...], the architecture had
|Xeon [...] CPUs. Unisys' goal was to provide an orderly transition for
|their 1100/2200 customers to a more modern architecture.

So they continued to support it for a long time, but it's a legacy
thing, not a future-oriented architecture.

The Wikipedia article also mentions the Symbolics 3600 as 36-bit
machine, but that was quite different from the 36-bit architectures of
the 1950s and 1960s: The Symbolics 3600 has 28-bit addresses (the rest apparently taken by tags) and its successor Ivory has 32-bit addresses
and a 40-bit word. Here the reason for its demise was the AI winter
of the late 1980s and early 1990s.

DEC did the right thing when they decided to support VAX as *the*
future architecture, and the success of the VAX compared to the
Honeywell 6000 and Univac 1100/2200 series demonstrates this.

RISC-VAX would have been better than the PDP-10, for the same reasons:
32-bit addresses and byte addressing. And in addition, the
performance advantage of RISC-VAX would have made the position of
RISC-VAX compared to PDP-10 even stronger.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Mon Aug 11 14:51:20 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.

This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.

Fundamentally, 36-bit words ended up being a dead-end.

In a sense, they still live in the Unisys Clearpath systems.

The reason why this once-common architectural style died out are:

* 18-bit addresses

An issue for PDP-10, certainly. Not so much for the Univac
systems.

Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:

I spent 14 years at Burroughs/Unisys (on the Burroughs side, mainly).

Yes, two of the six mainframe lines still exist (albeit in emulation);
one 48-bit, the other 36-bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Mon Aug 11 17:27:30 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

Interesting link, thanks!

Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".

"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"

They were considering byte-addressability; interesting. It is also
slightly funny that a 9-bit byte address would be made up of
30 bits of virtual address and 2 bits of byte address, i.e.
a 32-bit address in total.

Fundamentally, 36-bit words ended up being a dead-end.

Pretty much so. It was a pity for floating-point, where they had
more precision than the 32-bit words (and especially the horrible
IBM format).

But byte addressability and power of two won.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Tue Aug 12 15:02:04 2025

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through.

Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

While looking for the handbook, I also found

http://hps.ece.utexas.edu/pub/patt_micro22.pdf

which describes some parts of the microarchitecture of the VAX 11/780,
11/750, 8600, and 8800.

Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Tue Aug 12 15:59:32 2025

antispam@fricas.org (Waldek Hebisch) writes:

The basic question is if VAX could afford the pipeline.

VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.

VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+

SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.

The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.

Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).

I doubt that they could afford 1-cycle multiply

Yes, one might do a multiplier and divider with its own sequencer (and
more sophisticated in later implementations), and with any user of the
result waiting stalling the pipeline until that is complete, and any
following user of the multiplier or divider stalling the pipeline
until it is free again.

The idea of providing multiply-step instructions and using a bunch of
them was short-lived; already the MIPS R2000 included a multiply
instruction (with its own sequencer), HPPA has multiply-step as well
as an FPU-based multiply from the start. The idea of avoiding divide instructions had a longer life. MIPS has divide right from the start,
but Alpha and even IA-64 avoided it. RISC-V includes divide in the M
extension that also gives multiply.

or
even a barrel shifter.

Five levels of 32-bit 2->1 muxes might be doable, but would that be cost-effecti

It is accepted in this era that using more hardware could
give substantial speedup. IIUC IBM used quadatic rule:
performance was supposed to be proportional to square of
CPU price. That was partly marketing, but partly due to
compromises needed in smaller machines.

That's more of a 1960s thing, probably because low-end S/360
implementations used all (slow) tricks to minimize hardware. In the
VAX 11/780 environment, I very much doubt that it is true. Looking at
the early VAXen, you get the 11/730 with 0.3 VUPs up to the 11/784
with 3.5 VUPs (from 4 11/780 CPUs). sqrt(3.5/0.3)=3.4. I very much
doubt that you could get an 11/784 for 3.4 times the price of an
11/730.

Searching a little, I find

|[11/730 is] to be a quarter the price and a quarter the performance of
|a grown-up VAX (11/780) <https://retrocomputingforum.com/t/price-of-vax-730-with-vms-the-11-730-from-dec/3286>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to tkoenig@netcologne.de on Wed Aug 13 11:25:24 2025

In article <107b1bu$252qo$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[Snipping the previous long discussion]

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,

There, we agree.

that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,

With a certainty, if they followed RISC principles.

Sure. I wasn't disputing that, just saying that I don't think
it mattered that much.

[snip]

which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.

That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.

Well, then we're definitely into the unknowable. :-)

Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Wed Aug 13 14:18:06 2025

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

That is strange. Why would they make the chip worse?

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,

Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.

- with 8 optional XOR output invertors,

I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).

- driving 8 tri-state or open collector buffers.

A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.

I didn't use the typical values. Yes, it would be dangerous to use them.
I never understood why they even quoted those typical numbers.
I always considered them marketing fluff.

So I count roughly 7 or 8 equivalent gate delays.

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?

I wasn't suggesting that. People used to modern CMOS speeds might not appreciate how slow TTL was. I was showing that its 50 ns speed number
was not out of line with other MSI parts of that day, and just happened
to have a PDF TTL manual opened on that part so used it as an example.
A 74181 4-bit ALU is also of similar complexity and 62 ns max.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed Aug 13 14:40:01 2025

Anton Ertl wrote:

While looking for the handbook, I also found

http://hps.ece.utexas.edu/pub/patt_micro22.pdf

which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.

Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.

- anton

Yes I saw the Patt paper recently. He has written many microarchitecture papers. I was surprised that in 1990 he would say on page 2:

"All VAXes are microcoded. The richness of the instruction set urges that
the flexibility of microcoded control be employed, notwithstanding the conventional mythology that hardwired control is somehow faster than
microcode. It is instructive to point out that (1) hardwired control
produces higher performance execution only in situations where the
critical path is in the microsequencing function, and (2) that this
should not occur in VAX implementations if one designs with the
well-understood (to microarchitects) technique that the next control
store address must be obtained from information available at the start
of the current microcycle. A variation of this basic old technique is
the recently popularized delayed branch present in many ISA architectures introduced in the last few years."

When he refers to the "mythology that hardwired control is somehow faster"
he appears to still be using the monolithic "eyes" I referred to earlier
in that everything must go through a single microsequencer.
He compares a hardwired sequential controller to a microcoded sequential controller and notes that in that case hardwired is no faster.

What he is not doing is comparing multiple parallel hardware stages
to a sequential controller, hardwired or microcoded.

Risc brings with it the concurrent hardware stages view.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Wed Aug 13 20:23:53 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

Is that possible with a PAL before it has been programmed?

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

Two layers of NAND :-)

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Aug 15 03:20:56 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through.

Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.

_For current machines_ there are reasons to use bigger pages, but
in VAX time bigger pages almost surely would lead to higher memory
use and consequently to higher price for end user. In effect
machine would be much less competitive.

BTW: Long ago I saw message about porting an application from
VAX to Linux. On VAX application run OK in 1GB of memory.
On 32 bit Inter architecture Linux with 1 GB there was excessive
paging. The reason was much smaller number of bigger pages.
--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Fri Aug 15 05:07:01 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to tkoenig@netcologne.de on Fri Aug 15 12:57:35 2025

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

The VAX was built to be a commercial product. As such, it was
designed to be successful in the market. But in order to be
successful in the market, it was important that the designers be
informed by the business landscape at both the time they were
designing it, and what they could project would be the lifetime
of the product. Those are considerations that extend beyond
the purely technical aspects of the design, and are both more
speculative and more abstract.

Consider how the business criteria might influence the technical
design, and how these might play off of one another: obviously,
DEC understood that the PDP-11 was growing ever more constrained
by its 16-bit address space, and that any successor would have
to have a larger address space. From a business perspective, it
made no sense to create a VAX with a 16-bit address space.
Similarly, they could have chosen (say) a 20, 24, or 28 bit
address space, or used segmented memory, or any number of other
such decisions, but the model that they did chose (basically a
flat 32-bit virtual address space: at least as far as the
hardware was concerned; I know VMS did things differently) was
ultimately the one that "won".

Of course, those are obvious examples. What I'm contending is
that the business<->technical relationship is probably deeper
and that business has more influence on technology than we
realize, up to and including the ISA design. I'm not saying
that the business folks are looking over the engineers'
shoulders telling them how the opcode space should be arranged,
but I am saying that they're probably going to engineering with
broad-strokes requirements based on market analysis and customer
demand. Indeed, we see examples of this now, with the addition
of vector instructions to most major ISAs. That's driven by the
market, not merely engineers saying to each other, "you know
what would be cool? AVX-512!"

And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT
instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.

Of course, they messed some of it up; EDITPC was like the
punchline of a bad joke, and the ways that POLY was messed up
are well-known.

Anyway, I apologize for the length of the post, but that's the
sort of thing I mean.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Dan Cross on Fri Aug 15 13:36:12 2025

On Fri, 15 Aug 2025 12:57:35 -0000 (UTC), Dan Cross wrote:

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm saying
that business needs must have, at least in part, influenced the ISA
design. That is, while mistaken, it was part of the business decision
process regardless.

It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

[snip]

There are also bits of the business requirements in each of the
descriptions of DEC microprocessor projects on Bob Supnik's site
that Al Kossow linked to earlier:

<http://simh.trailing-edge.com/dsarchive.html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Waldek Hebisch on Fri Aug 15 15:10:58 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is:
https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through. >>
Section 2.7 also mentions an 8-byte instruction buffer, and that the
instruction fetching is done happens concurrently with the microcoded
execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.

One must also consider that the disks in that era were
fairly small, and 512 bytes was a common sector size.

Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sun Aug 17 06:16:08 2025

BGB <cr88192@gmail.com> writes:

It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table

I assume that you mean a balanced search tree (binary (AVL) or n-ary
(B)) vs. the now-dominant hierarchical multi-level page tables, which
are tries.

In both a hardware and a software implementation, one could implement
a balanced search tree, but what would be the advantage?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Sun Aug 17 10:00:56 2025

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
Recent studies show that TLB-related precise interrupts occur
once every 100–1000 user instructions on all ranges of code, from
SPEC to databases and engineering workloads [5, 18]."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Aug 17 15:21:38 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.

I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Sun Aug 17 13:35:03 2025

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

Is that possible with a PAL before it has been programmed?

They can speed and partially function test it.
Its programmed by blowing internal fuses which is a one-shot thing
so that function can't be tested.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

Two layers of NAND :-)

Thinking about different ways of doing this...
If the first NAND layer has open collector outputs then we can use
a wired-AND logic driving and invertor for the second NAND plane.

If the instruction buffer outputs to a set of 74159 4:16 demux with
open collector outputs, then we can just wire the outputs we want
together with a 10k pull-up resistor and drive an invertor,
to form the second output NAND layer.

inst buf <15:8> <7:0>
| | | |
4:16 4:16 4:16 4:16
vvvv vvvv vvvv vvvv
10k ---|---|---|---|------>INV->
10k ---------------------->INV->
10k ---------------------->INV->

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sun Aug 17 19:10:21 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

... if the buffers fill up and there is not enough resources left for
the TLB miss handler.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jakob Bohm@21:1/5 to Kaz Kylheku on Sun Aug 17 20:18:36 2025

XPost: comp.lang.c

On 2025-08-05 23:08, Kaz Kylheku wrote:

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.

That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.

They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.

What would be an exmaple of a more conservative way to name the
identifier?

What is actually going on is GCC offering its users a gradual way to
transition from C17 to C23, by applying the C23 meaning of any C23
construct that has no conflicting meaning in C17 . In particular, this
allows installed library headers to use the new types as part of
logically opaque (but compiler visible) implementation details, even
when those libraries are used by pure C17 programs. For example, the
ISO POSIX datatype struct stat could contain a _BitInt(128) type for
st_dev or st_ino if the kernel needs that, as was the case with the 1996
NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
used "system" headers (including installed libc headers), as those are
allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
report the use of non-standard extensions to the library standards when
the compiler is invoked with --pedantic and no contrary options .

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C instead
of GNUC reverts those to the standard definition .

Enjoy

Jakob

--
Jakob Bohm, MSc.Eng., I speak only for myself, not my company
This public discussion message is non-binding and may contain errors
All trademarks and other things belong to their owners, if any.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Dan Cross on Mon Aug 18 05:48:00 2025

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

[...]

And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.

I had actually forgotten that the VAX also had decimal
instructions. But the 11/780 also had one really important
restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
severely limited their throughput there (assuming they did
things bytewise). So yes, decimal arithmetic was important
in the day for COBOL and related commercial applications.

So, what to do with decimal arithmetic, which was important
at the time (and a business consideration)?

Something like Power's addg6s instruction could have been
introduced, it adds two numbers together, generating only the
decimal carries, and puts a nibble "6" into the corresponding
nibble if there is one, and "0" otherwise. With 32 bits, that
would allow addition of eight-digit decimal numbers in four
instructions (see one of the POWER ISA documents for details),
but the cycle of "read ASCII digits, do arithmetic, write
ASCII digits" would have needed some extra shifts and masks,
so it might have been more beneficial to use four digits per
register.

The article above is also extremely interesting otherwise. It does
not give cycle timings for each individual instruction and address
mode, but it gives statistics on how they were used, and a good
explanation of the timing implications of their microcode design.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Heathfield@21:1/5 to Keith Thompson on Mon Aug 18 08:02:30 2025

XPost: comp.lang.c

On 18/08/2025 06:18, Keith Thompson wrote:

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.

I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.

Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?

$ cat so.c
#include <stdio.h>

int main(void)
{
int foo = 42;
size_t soa = sizeof (foo, 'C');
size_t sob = sizeof foo;
printf("%s.\n", (soa == sob) ? "Yes" : "No");
return 0;
}
$ gcc -o so so.c
$ ./so
Yes.
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Mon Aug 18 11:34:49 2025

XPost: comp.lang.c

On 18.08.2025 07:18, Keith Thompson wrote:

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.

I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.

Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?

Presumably that's a typo - you meant to ask when the size is /not/ the
size of "int" ? After all, you said yourself that "(foo, 'C')"
evaluates to 'C' which is of type "int". It would be very interesting
if Jakob can show an example where gcc treats the expression as any
other type than "int".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Aug 18 11:03:15 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

None of those research papers that I have seen consider the possibility
that OoO can make use of multiple concurrent HW walkers if the
cache supports hit-under-miss and multiple pending miss buffers.

While instruction fetch only needs to occasionally translate a VA one
at a time, with more aggressive alternate path prefetching all those VA
have to be translated first before the buffers can be prefetched.
LSQ could also potentially be translating as many VA as there are entries.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.

I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.

- anton

I'm looking for papers that separate out the common cost of loading a PTE
from the extra cost of just the SW-miss handler. I had a paper a while
back but can't find it now. IIRC in that paper the extra cost of the
SW miss handler on Alpha was measured at 5-25%.

One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
that Intel's HW walker on its downward walk caches the interior node
PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

A SW walker can accomplish the same bottom-up walk by locating
the different page table levels at *virtual* base addresses,
and adding each VA of those interior PTE's to the TLB.
This is what VAX VA translate did, probably Alpha too but I didn't check.

This interior PTE node caching is critical for optimal performance
and some of their stats don't take it into account
and give much worse numbers than they should.

Also many papers were written before ASID's were in common use
so the TLB got invalidated with each address space switch.
This would penalize any OS which had separate user and kernel space.

So all these numbers need to be taken with a grain of salt.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Aug 18 15:35:36 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.

Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.

The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.

I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:

1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.

2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal
reaction to a branch misprediction).

3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!

It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Mon Aug 18 17:19:13 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

the same problem.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want >atomicity there.

To avoid race conditions with software clearing those bits, presumably.

ARM64 originally didn't support hardware updates in V8.0, they were
independent hardware features added to V8.1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Wed Aug 20 03:47:17 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

The basic question is if VAX could afford the pipeline.

VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.

VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+

SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.

The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.

Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).

Prism paper says the following about RISC versus VAX performance:

: 1. Shorter cycle time. VAX chips have more, and longer, critical
: paths than RISC chips. The worst VAX paths are the control store
: loop and the variable length instruction decode loop, both of
: which are absent in RISC chips.

: 2. Fewer cycles per function. Although VAX chips require fewer
: instructions than RISC chips (1:2.3) to implement a given
: function, VAX instructions take so many more cycles than RISC
: instructions (5-10:1-1.5) that VAX chips require many more cycles
: per function than RISC chips.

: 3. Increased pipelining. VAX chips have more inter- and
: intra-instruction dependencies, architectural irregularities,
: instruction formats, address modes, and ordering requirements
: than RISC chips. This makes VAX chips harder and more
: complicated to pipeline.

Point 1 above for me means that VAX chips were microcoded. Point
2 above suggest that there were limited changes compared to VAX-780
microcode.

IIUC attempts to create better hardware for VAX were canceled
just after PRISM memos, so later VAX used essential the same
logic, just rescaled to better process.

I think that VAX had problem with hardware decoders because of gate
delay: in 1987 probably hardware decoder would slow down clock.
But 1977 design for me looks quite relaxed: man logic was Schotky
TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
this means about 66 gate delays per cycle. And in critical paths
VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
commonly available in 1970 and 1 ns ECL was leading edge in 1970.

That is why I think that in 1977 hardware decoder could give
speedup, assuming that execution units could keep up: gate delay
and cycle time means that rather deep circuit could fit within
cycle time. IIUC 1987 designs were much more aggressive and
decoder delay probably could not fit within single cycle.

Quite possible that hardware designers attempting VAX hardware
decoders were too ambitious and wanted to decode in one cycle
too complicated instructions. AFAICS for instructions that can
not be executed in one cycle decode can be slower than one
cycle, all what one needs is to recognize withing one cycle
that decode will take multiple cycles.

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Wed Aug 20 14:36:43 2025

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

the same problem.

Not quite.
My idea was to have two HW threads HT1 and HT2 which are like x86 HW
threads except when HT1 gets a TLB miss it stalls its execution and
injects the TLB miss handler at the front of HT2 pipeline,
and a HT2 TLB miss stalls itself and injects its handler into HT1.
The TLB miss handler never itself TLB misses as it explicitly checks
the TLB for any VA it needs to translate so recursion is not possible.

As the handler is injected at the front of the pipeline no drain occurs.
The only possible problem is if between when HT1 injects its miss handler
into HT2 that HT2's existing pipeline code then also does a TLB miss.
As this would cause a deadlock, if this occurs then it cores detects it
and both HT fault and run their TLB miss handler themselves.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want
atomicity there.

To avoid race conditions with software clearing those bits, presumably.

ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

Yes. A memory recycler can periodically clear the Accessed bit
so it can detect page usage, and that might be a different core.
But it might skip sending TLB shootdowns to all other cores
to lower the overhead (maybe a lazy usage detector).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed Aug 20 16:41:39 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.

Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.

The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.

Hmmm... I don't think that is possible, or if it is then its really hairy.
The miss handler needs to LD the memory PTE's, which can happen OoO.
But it also needs to do things like writing control registers
(e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
can't get to retire because the older instructions that triggered the
miss are stalled.

The miss handler needs general registers so it needs to
stash the current content someplace and it can't use memory.
Then add a nested miss handler on top of that.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

As Scott said, to avoid race conditions with software clearing those bits.
Plus there might be PTE modifications that an OS could perform on other
PTE fields concurrently without first acquiring the normal mutexes
and doing a TLB shoot down of the PTE on all the other cores,
provided they are done atomically so the updates of one core
don't clobber the changes of another.

Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.

A HW walker looks simple to me.
It has a few bits of state number and a couple of registers.
It needs to detect memory read errors if they occur and abort.
Otherwise it checks each TLB level in backwards order using the
appropriate VA bits, and if it gets a hit walks back down the tree
reading PTE's for each level and adding them to their level TLB,
checking it is marked present, and performing an atomic OR to set
the Accessed and Dirty flags if they are clear.

The HW walker is even simpler if the atomic OR is implemented directly
in the cache controller as part of the Atomic Fetch And OP series.

I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:

1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.

2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!

It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.

- anton

Yes, and it seems to me that one would spend a lot more time trying to
fix the SW walker than doing the simple HW walker that just works.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Wed Aug 20 19:17:01 2025

BGB wrote:

On 8/17/2025 12:35 PM, EricP wrote:

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

When code/density is the goal, a 16/32 RISC can do well.

Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.

I'm assuming 16 32-bit registers, plus a separate RIP.
The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
With just 16 registers there would be no zero register.

The 4-bit register allows many 2-byte accumulate style instructions
(where a register is both source and dest)
8-bit opcode plus two 4-bit registers,
or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

A flags register allows 2-byte short conditional branch instructions,
8-bit opcode and 8-bit offset. With no flags register the shortest
conditional branch would be 3 bytes as it needs a register specifier.

If one is doing variable byte length instructions then
it allows the highest usage frequency to be most compact possible.
Eg. an ADD with 32-bit immediate in 6 bytes.

Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)

The saving for fixed 32-bit instructions is that it only needs to
prefetch aligned 4 bytes ahead of the current instruction to maintain
1 decode per clock.

With variable length instructions from 1 to 12 bytes it could need
a 16 byte fetch buffer to maintain that decode rate.
And a 16 byte variable shifter (collapsing buffer) is much more logic.

I was thinking the variable instruction buffer shifter could be built
from tri-state buffers in a cross-bar rather than muxes.

The difference for supporting variable aligned 16-bit instructions and
byte aligned is that bytes doubles the number of tri-state buffers.

In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd

Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)

LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)

For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9

For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

Reg3B was a bit hacky, but had similar hit rates but uses less encoding
space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Thu Aug 21 16:21:37 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

Consider "virgin" page, that is neither accessed nor modified.
Intruction 1 reads the page, instruction 2 modifies it. After
both are done you should have both bits set. But if miss handling
for instruction 1 reads page table entry first, but stores after
store fomr instruction 2 handler, then you get only accessed bit
and modified flag is lost. Symbolically we could have

read PTE for instruction 1
read PTE for instruction 2
store PTE for instruction 2 (setting Accessed and Modified)
store PTE for instruction 1 (setting Accessed but clearing Modified)

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Waldek Hebisch on Thu Aug 21 19:26:47 2025

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Thomas Koenig on Fri Aug 22 16:36:09 2025

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.

1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions. Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Aug 22 16:45:56 2025

According to Thomas Koenig <tkoenig@netcologne.de>:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
mounted a few to a package. The /91 was big but it wasn't *that* big.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Waldek Hebisch on Fri Aug 22 17:21:17 2025

Waldek Hebisch <antispam@fricas.org> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.

1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions.

Like the multiply instruction in ARM2.

Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.

Yep, FP support can be expensive and was an extra option
on the VAX, which also included integer multiply.

However, I maintain that a ~1977 supermini with a similar sort
of bus, MMU, floating point unit etc like the VAX, but with an
architecture similar to ARM2, plus separate icache and dcache, would
have beaten the VAX hands-down in performance - it would have taken
fewer chips to implement, less power and possibly time to develop.
HP showed this was possible some time later.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to John Levine on Sat Aug 23 16:38:47 2025

John Levine <johnl@taugh.com> wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.

I remember this number, but do not remember where I found it. So
it may be wrong.

However, one can estimate possible density in a different way: package
probably of similar dimensions as VAX package can hold about 100 TTL
chips. I do not have detailed data about chip usage and transistor
couns for each chip. Simple NAND gate is 4 transitors, but input
transitor has two emiters and really works like two transistors
so it is probably better to count it as 2 transitors, and conseqently
consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
20 transistors. D-flop probably is about 20-30 transitors, so
74S74 is probably around 40-60. Quad D-flop bring us close to 100.
I suspect that in VAX time octal D-flops were available. There
were 4 bit ALU slices. Also multiplexers need nontrivial number
of transistors. So I think that 50 transistors is reasonable (maybe
low) estimate of average density. Assuming 50 transitors per chip
that would be 5000 transistors per package. Packages were rather
flat, so when mounted vertically one probably could allocate 1 cm
of horizotal space for each. That would allow 30 packages at
single level. With 7 levels we get 210 packages, enough for
1 mln transistors.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Mon Aug 25 00:56:26 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...

Let's see:

#include <stddef.h>

long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}

arrays:
MOV Ri,#0
MOV Rr,#0
VEC Rt,{}
LDD Rl,[Rv,Ri<<3]
ADD Rr,Rr,Rl
LOOP LT,Ri,Rn,#1
MOV R1,Rr
RET

7 instructions, 1 instruction-modifier; 8 words.

long a, b, c, d;

void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}

globals:
STD 0x1234567890abcdef,[IP,a]
STD 0xcdef1234567890ab,[IP,b]
STD 0x567890abcdef1234,[IP,c]
STD 0x5678901234abcdef,[IP,d]
RET

5 instructions, 13 words, 0 .data, 0 .bss

gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret

0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret

When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.

gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:

000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret

000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
1198: c3 ret

gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:

0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret

0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret

So, the overall sizes (including data size for globals() on RV64GC) are:

arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64

So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.

NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64

I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.

Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.

The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).

There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang) versions.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Wed Aug 27 00:56:58 2025

antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.

Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.

In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.

Also, PDP-11 compatibility depended on microcode.

Different address modes mainly.

Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.

To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX especially given constraint
of PDP-11 compatibility.

RISC in MSI TTL logic would not have worked all that well.

OTOH VAX designers probably felt
that CISC nature added significant value: they understood
that cost of programming was significant and believed that
ortogonal instruction set, in particular allowing complex
addresing on all operands made programming simpler.

Some of us RISC designers believe similarly {about orthogonal
ISA not about address modes.}

They
probably thought that providing resonably common procedures
as microcoded instructions made work of programmers simpler
even if routines were only marginally faster than ordinary
code.

We think similarly--but we do not accept µCode being slower
that SW ISA, or especially compiled HLL.

Part of this thinking was probably like "future
system" motivation at IBM: Digital did not want to produce
"commodity" systems, they wanted something with unique
features that custemer will want to use.

s/used/get locked in on/

Without
isight into future it is hard to say that they were
wrong.

It is hard to argue that they made ANY mistakes with
what we know about the world of computers circa 1977.

It is not hard in 2025.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Wed Aug 27 10:56:31 2025

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

10.5 on a characteristic mix, actually.

See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.

Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte
for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.

So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

And I say "at least" 1 C/IB as I am not including any micro-pipeline stalls. The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way jump/call
to uSubroutines, each of those dispatches might have a branch delay slot too.

(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to MitchAlsup on Thu Aug 28 07:49:31 2025

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.

Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.

Note that most of this is microcode ROM. They complicated
logic to get smaller ROM size. For VAX it was quite different:
microcode memory (and cache) were build from LSI chips,
not suitable for logic at that time. Assuming 6 transistor
static RAM cells VAX had 590000 transistors in microcode memory
chips (and another 590000 transistors in cache chips).
Comparatively one can estimate VAX logic chips as between 20000
and 100000 transistors, with low numbers looking more likely
to me. IIUC at least early VAX on a "single" chip were slowed
down by going to off-chip microcode memory.

In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.

Yes, but IIUC big item was on-chip microcode memory (or pins
needed to go to external microcode memory).

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Thu Aug 28 13:39:54 2025

EricP wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

10.5 on a characteristic mix, actually.

See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.

Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.

So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

And I say "at least" 1 C/IB as I am not including any micro-pipeline
stalls.
The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way
jump/call
to uSubroutines, each of those dispatches might have a branch delay slot
too.

(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).

I found a description of the 780 instruction buffer parser
in the Data Path description on bitsavers and
it does in fact pull one operand specifier from IB per clock.
There is a mux network to handle various immediate formats in parallel,

There are conflicting descriptions as to exactly how it handles the
first operand, whether that is pulled with the opcode or in a separate clock, as the IB shifter can only do 1 to 5 byte shifts but an opcode with
a first operand with 32-bit displacement would be 6 bytes.

But basically it takes 1 clock for the opcode byte and the first operand specifier byte, a second clock if the first opspec has an immediate,
then 1 clock for each subsequent operand specifier.
If an operand has an immediate it is extracted in parallel with its opspec.

If that is correct a MOV rs,rd or ADD rs,rd would take 2 clocks to decode,
and a MOV offset(rs),rd would take 3 clocks to decode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Sun Aug 31 18:04:44 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...

Let's see:

#include <stddef.h>

long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}

arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET

long a, b, c, d;

void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}

globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET

-----------------

So, the overall sizes (including data size for globals() on RV64GC) are:
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22

32 68 My 66000 8 5

So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.

Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.

Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,

3 for My 66000

so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.

* My 66000 uses ST immediate for globals

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	547
Nodes:	16 (2 / 14)
Uptime:	71:30:36
Calls:	10,398
Files:	14,070
Messages:	6,417,621

Re: VAX

Who's Online

System Info