I can't say much for or against VAX, as I don't currently have any
compilers that target it.
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>>neither the VAX designers nor their competition realized this, and >>>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect >>compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >>instructions like POLY that should have been subroutines. The 801 project and >>PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >>presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >>important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >>easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
According to Waldek Hebisch <antispam@fricas.org>:
POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
That was the plan but the people building Vaxen didn't get the memo
so even on the original 780, it got different answers with and without
the optional floating point accelerator.
If they wanted more accurate results, they should have
https://simh.trailing-edge.com/docs/vax_poly.pdf
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.
In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.
Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET
(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.
There were some adjacent dependencies:
ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
ADDB3 #48,R1,(R9)+ ;AND NEXT
and a handful of others. Probably only a single-digit
percentage of instructions used autoincrement/decrement and only
a couple used the updated register in the same
instruction.
According to Scott Lurndal <slp53@pacbell.net>:
Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET
(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.
Wow, that's some funky code.
According to Waldek Hebisch <antispam@fricas.org>:
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.
In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
Right, but detecting the abnormal cases wasn't trivial.
On 7/30/2025 12:59 AM, Anton Ertl wrote:...
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>> instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.
But, if so, it would more speak for the weakness of VAX code density
than the goodness of RISC-V.
There is, however, a fairly notable size difference between RV32 and
RV64 here, but I had usually been messing with RV64.
If I were to put it on a ranking (for ISAs I have messed with), it would
be, roughly (smallest first):
i386 with VS2008 or GCC 3.x (*1)
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what
RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a
MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of
conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
I looked into VAX architecure handbook from 1977. Handbook claims
that VAX-780 used 96-bit microcode words. That is enough bits to
control pipelined machine with 1 instruction per cycle, provided
enough excution resources (register ports, buses and 1-cycle
execution units). However, VAX hardware allowed only one memory
access per cycle so instructions with multiple memory addreses
or using indirection trough memory by necessity needed multiple
cycles.
I must admit that I do not understand why VAX needed so many
cycles per instruction.
For 1 byte opcode with all
register arguments operand specifiers are in predictable placese,
so together modest number of gates could recognize register-only
operand specifiers.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
Given actual speed of VAX possibilities seem to be:
- extra factors slowing both VAX and RISC, like cache
misses (VAX archtecture handbook says that due to
misses cache had effective access time of 290 ns),
- VAX designers could not afford pipeline
- maybe VAX designers decided to avoid pipelne to reduce
complexity
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.
To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX
Without
isight into future it is hard to say that they were
wrong.
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
I guess it can be noted, is the overhead of any ELF metadata being >excluded?...
Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX. Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc', but
when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
[r9+rcx*8]. It does not depend on the previous value of rbx, except for control dependency that hopefully would be speculated around.
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against >>>>superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.
The basic question is if VAX could afford the pipeline. VAX had
rather complex memory and bus interface, cache added complexity
too. Ditching microcode could allow more resources for execution
path. Clearly VAX could afford and probably had 1-cycle 32-bit
ALU. I doubt that they could afford 1-cycle multiply or
even a barrel shifter. So they needed a seqencer for sane
assembly programming. I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port. Multiported register file
probably would need a lot of separate register chips and
multiplexer. Alternatively, they could try some very fast
RAM and run it at multiple of base clock frequency (66 ns
cycle time caches were available at that time, so 3 ports
via multiplexing seem possible). But any of this adds
considerable complexity. Sane pipeline needs interlocks
and forwarding.
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.
This is the carry chain that I don't see any obvious way to break...
Terje
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 2025-08-04 15:03, Michael S wrote:
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.
I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
They used fast SRAM and had three copies of their registers,
for 2R1W.
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.
Do you have any reason to believe that gcc's use of _BitInt will break
any existing code?
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
Agreed -- and in gcc did not do that in this case. I was referring
to _BitInt, not to other identifiers in the reserved namespace.
Do you have any reason to believe that gcc's use of _BitInt will
break any existing code?
It has landed, and we don't hear reports that the sky is falling.
If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.
My position has always been to think about the threat of real,
or at least probable clashes.
I can turn it around: I have not heard of any compiler or library
using _CreamPuff as an identifier, or of a compiler which misbehaves
when a program uses it, on grounds of it being undefined behavior.
Someone using _CreamPuff in their code is taking a risk that is
vanishingly small, the same way that introducing _BigInt is a risk
that is vanishingly small.
In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced
some identifier is vastly larger than the audience of implementations
that a given program will face that has introduced some funny
identifier.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.
This is the carry chain that I don't see any obvious way to break...
You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.
On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they usedThey used fast SRAM and had three copies of their registers,
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
for 2R1W.
I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.
Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?
Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.
Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.
Or do I completely misunderstand? [Definitely possible.]
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:...
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.
I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...
Those libraries are C implementors also, and get to name things
in the reserved namespace.
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:I'm not sure what, precisely, you're disagreeing with.
In article <44okQ.831008$QtA1.573001@fx16.iad>,I disagree. The 801 was a research project without much time
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]Indeed. I find this speculation about the VAX, kind of odd: the
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Thomas Koenig wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]Indeed. I find this speculation about the VAX, kind of odd: the
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
Thomas Koenig wrote:
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
Thomas Koenig wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
Or look at the performance of the TTL implementation of HP-PA, which
used PALs which were not available to the VAX 11/780 designers, so it
could be clocked a bit higher, but at a multiple of the performance
than the VAX.
So, Anton visiting DEC or me visiting Data General could have brought
them a technology which would significantly outperformed the VAX
(especially if we brought along the algorithm for graph coloring. Some
people at IBM would have been peeved at having somebody else "develop"
this at the same time, but OK.
matrix)
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)^^^^
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.
["Followup-To:" header set to comp.arch.]...
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?
EricP wrote:
Thomas Koenig wrote:
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
The question is could one build this at a commercially competitive price?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,
especially for one of the BUNCH, for whom backward compatability was >paramount.
EricP wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).
TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.
The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
...
My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have nativea types __int16 etc?
I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.
scott@slp53.sl.home (Scott Lurndal) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.
Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,
The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.
especially for one of the BUNCH, for whom backward compatability was >>paramount.
Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?
On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
Robert Swindells <rjs@fdy2.co.uk> writes:
On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
If I was building a TTL risc cpu in 1975 I would definitely be usingThe DG MV/8000 used PALs but The Soul of a New Machine hints that there
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
were supply problems with them at the time.
The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.
Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
- anton
George Neuner <gneuner2@comcast.net> writes:
The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.
No such cache in the P6 or any of its descendents until the Sandy
Bridge (2011). The Pentium 4 has a microop cache, but eventually
(with Core Duo, Core2 Duo) was replaced with P6 descendents that have
no microop cache. Actually, the Core 2 Duo has a loop buffer which
might be seen as a tiny microop cache. Microop caches and loop
buffers still have to contain information about which microops belong
to the same CISC instruction, because otherwise the reorder buffer
could not commit/execute* CISC instructions.
* OoO microarchitecture terminology calls what the reorder buffer does
"retire" or "commit". But this is where the speculative execution
becomes architecturally visible ("commit"), so from an architectural
view it is execution.
Followups set to comp.arch
- anton
George Neuner wrote:
On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they usedThey used fast SRAM and had three copies of their registers,
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
for 2R1W.
I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.
Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?
Given 1 R/W port each I can see needing a pair to handle cases where
destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.
Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.
Or do I completely misunderstand? [Definitely possible.]
To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.
The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.
So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.
The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.
So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode >bottleneck.
I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.
ADDLL (r2)+, (r2)+, (r2)+
the first (left) operand reads r2 then adds 4, which the second r2 reads
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.
I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,
but that's a LOT of
speculation with hindsight-colored glasses.
Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
But as with all alternate history, this is
completely unknowable.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Were there different versions, maybe?
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Were there different versions, maybe?
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:
http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf
By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
- with 8 optional XOR output invertors,
- driving 8 tri-state or open collector buffers.
So I count roughly 7 or 8 equivalent gate delays.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.
Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.
One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings?
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
[snip]
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,
Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.
but that's a LOT of
speculation with hindsight-colored glasses.
Graph-colored glasses, for the register allocation, please :-)
Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.
I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?
[...]
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
Yep, that would have been possible, either as an alternate
VAX or a competitor.
But as with all alternate history, this is
completely unknowable.
We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.
So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.
In article <107768m$17rul$1@dont-email.me>,<snip>
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
Yep, that would have been possible, either as an alternate
VAX or a competitor.
But as with all alternate history, this is
completely unknowable.
Sure.
We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,
that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,
but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started,
and even
fewer would have believed it absent a working prototype,
which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.
Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.
Fundamentally, 36-bit words ended up being a dead-end.
scott@slp53.sl.home (Scott Lurndal) writes:
One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.
This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.
Fundamentally, 36-bit words ended up being a dead-end.
The reason why this once-common architectural style died out are:
* 18-bit addresses
Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:
http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf
Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".
"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"
Fundamentally, 36-bit words ended up being a dead-end.
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
The basic question is if VAX could afford the pipeline.
I doubt that they could afford 1-cycle multiply
or
even a barrel shifter.
It is accepted in this era that using more hardware could
give substantial speedup. IIUC IBM used quadatic rule:
performance was supposed to be proportional to square of
CPU price. That was partly marketing, but partly due to
compromises needed in smaller machines.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
[Snipping the previous long discussion]
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,
There, we agree.
that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,
With a certainty, if they followed RISC principles.
[snip]
which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.
That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.
Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:Yes, must be different versions.
Concerning the speed of the 82S100 PLA,Were there different versions, maybe?
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
I'm looking at this 1976 datasheet which says 50 ns max access:
http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf
That is strange. Why would they make the chip worse?
Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
Should be free coming from a Flip-Flop.
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.
- with 8 optional XOR output invertors,
I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).
- driving 8 tri-state or open collector buffers.
A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.
One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.
So I count roughly 7 or 8 equivalent gate delays.
Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.
Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.
Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?
While looking for the handbook, I also found
http://hps.ece.utexas.edu/pub/patt_micro22.pdf
which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.
Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.
- anton
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.
By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
Should be free coming from a Flip-Flop.
Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.
For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.
Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
I'm just showing why it was more than just an AND gate.
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
antispam@fricas.org (Waldek Hebisch) writes:
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf
The cache is indeed 8KB in size, two-way set associative and write-through.
Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.
Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.
It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm saying
that business needs must have, at least in part, influenced the ISA
design. That is, while mistaken, it was part of the business decision
process regardless.
It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?
I can attempt to, though I'm not sure if I can be successful.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
The handbook is:
https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf
The cache is indeed 8KB in size, two-way set associative and write-through. >>
Section 2.7 also mentions an 8-byte instruction buffer, and that the
instruction fetching is done happens concurrently with the microcoded
execution. So here we have a little bit of pipelining.
Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.
It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)Manufacturing process variation leads to timing differences that
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
testing sorts into speed bins. The faster bins sell at higher price.
Is that possible with a PAL before it has been programmed?
Depends on what chips you use for registers.Should be free coming from a Flip-Flop.By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.
So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).
Another point... if you don't need 16 inputs or 8 outpus, youI'm just showing why it was more than just an AND gate.
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Two layers of NAND :-)
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.
What would be an exmaple of a more conservative way to name the
identifier?
In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
I can attempt to, though I'm not sure if I can be successful.
And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
EricP <ThatWouldBeTelling@thevillage.com> writes:
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:Both HW and SW table walkers incur the cost of reading the PTE's.
Concerning page table walker: The MIPS R2000 just has a TLB and trapsYeah, this approach works a lot better than people seem to give it
on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
credit for...
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.
- anton
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want >atomicity there.
antispam@fricas.org (Waldek Hebisch) writes:
The basic question is if VAX could afford the pipeline.
VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.
VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+
SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)
VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.
The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.
Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,It's always a one-way street (towards accessed and towards modified,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
never the other direction), so it's not clear to me why one would want
atomicity there.
To avoid race conditions with software clearing those bits, presumably.
ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.
EricP <ThatWouldBeTelling@thevillage.com> writes:
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.
I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:
1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.
2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).
3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!
It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.
- anton
On 8/17/2025 12:35 PM, EricP wrote:
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
When code/density is the goal, a 16/32 RISC can do well.
Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.
Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)
In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd
Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd
Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.
For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)
LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)
For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9
For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B
Reg3B was a bit hacky, but had similar hit rates but uses less encoding
space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).
EricP <ThatWouldBeTelling@thevillage.com> writes:
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions.
Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.
According to Thomas Koenig <tkoenig@netcologne.de>:
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.
BGB <cr88192@gmail.com> writes:arrays:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:
0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret
0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret
When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.
gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:
000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret
000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
1198: c3 ret
gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:
0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret
0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret
So, the overall sizes (including data size for globals() on RV64GC) are:
arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
I guess it can be noted, is the overhead of any ELF metadata being >excluded?...
These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.
Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.
The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).
There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.
But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang) versions.
- anton
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.
To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX especially given constraint
of PDP-11 compatibility.
OTOH VAX designers probably felt
that CISC nature added significant value: they understood
that cost of programming was significant and believed that
ortogonal instruction set, in particular allowing complex
addresing on all operands made programming simpler.
They
probably thought that providing resonably common procedures
as microcoded instructions made work of programmers simpler
even if routines were only marginally faster than ordinary
code.
Part of this thinking was probably like "future
system" motivation at IBM: Digital did not want to produce
"commodity" systems, they wanted something with unique
features that custemer will want to use.
Without
isight into future it is hard to say that they were
wrong.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
10.5 on a characteristic mix, actually.
See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.
antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.
In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
10.5 on a characteristic mix, actually.
See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.
Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.
So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.
And I say "at least" 1 C/IB as I am not including any micro-pipeline
stalls.
The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way
jump/call
to uSubroutines, each of those dispatches might have a branch delay slot
too.
(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are:32 68 My 66000 8 5
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
so certainly in this case RV64GC is not* My 66000 uses ST immediate for globals
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 547 |
Nodes: | 16 (2 / 14) |
Uptime: | 71:30:36 |
Calls: | 10,398 |
Files: | 14,070 |
Messages: | 6,417,621 |