Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” >market and wiped the floor with DEC’s flagship architecture, >performance-wise.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.
If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.
OTOH, DEC had great success with the VAX for a while, and their demise
may have been unavoidable given their market position: Their customers >(especially the business customers of VAXen) went to them instead of
IBM, because they wanted something less costly, and they continued
onwards to PCs running Linux when they provided something less costly.
So DEC would also have needed to outcompete Intel and the PC market to >succeed (and IBM eventually got out of that market).
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” >>market and wiped the floor with DEC’s flagship architecture, >>performance-wise.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.
Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to
design.
As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.
Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce
that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.
As a result, DEC would have had an architecture that would have given
them superior performance, they would not have suffered from the
infighting of VAX9000 vs. PRISM etc. (and not from the wrong decision
to actually build the VAX9000), and might still be going strong to
this day. They would have been able to extend RV32GC to RV64GC
without problems, and produce superscalar and OoO implementations.
OTOH, DEC had great success with the VAX for a while, and their demise
may have been unavoidable given their market position: Their customers (especially the business customers of VAXen) went to them instead of
IBM, because they wanted something less costly, and they continued
onwards to PCs running Linux when they provided something less costly.
So DEC would also have needed to outcompete Intel and the PC market to succeed (and IBM eventually got out of that market).
- anton
On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:
Like other USA-based computer architects, Bell ignores ARM, which
outperformed the VAX without using caches and was much easier to
design.
Was ARM around when VAX was being designed (~1973) ??
Found this paper <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:
The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.
The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” market and wiped the floor with DEC’s flagship architecture, performance-wise.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.
Like other USA-based computer architects, Bell ignores ARM, which >outperformed the VAX without using caches and was much easier to
design.
If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.
Lawrence D'Oliveiro wrote:
Found this paper
<https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:
The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.
The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer”
market and wiped the floor with DEC’s flagship architecture,
performance-wise.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).
If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was
still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.
If they had just only put in *the things they actually use*
(as show by DEC's own instruction usage stats from 1982),
and left out all the things that they rarely or never use,
it would have had 50 or so opcodes instead of 305,
at most one operand that addressed memory on arithmetic and logic
opcodes
with 3 address modes (register, register address, register offset
address)
instead of 0 to 5 variable length operands with 13 address modes each
(most combinations of which are either silly, redundant, or illegal).
Then they would have be able to parse instructions in one clock,
which makes pipelining a possible consideration,
and simplifies the uArch so now it can all fit on one chip,
which allows it to complete with RISC.
The reason it was designed the way it was, was because DEC had
microcode and microprogramming on the brain.
In this 1975 paper Bell and Strecher say it over and over and over.
They were looking at the cpu design as one large parsing machine
and not as a set of parallel hardware tasks.
This was their mental mindset just before they started the VAX design:
What Have We Learned From PDP11, Bell Strecker, 1975 https://gordonbell.azurewebsites.net/Digital/Bell_Strecker_What_we%20_learned_fm_PDP-11c%207511.pdf
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
The answer was no, the VAX could not have been done as a RISC >>>architecture. RISC wasn’t actually price-performance competitive until >>>the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.
Like other USA-based computer architects, Bell ignores ARM, which >>outperformed the VAX without using caches and was much easier to
design.
That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >The first ARM design started in 1983 with working silicon in 1985. It was a >decade later.
Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to design.
On 3/1/2025 5:58 AM, Anton Ertl wrote:------------------------------
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Would likely need some new internal operators to deal with bit-array operations and similar, with bit-ranges allowed as a pseudo-value type
(may exist in constant expressions but will not necessarily exist as an actual value type at runtime).
Say:
val[63:32]
Has the (63:32) as a BitRange type, which then has special semantics
when used as an array index on an integer type, ...
The previous idea for bitfield extract/insert had turned into a
composite BITMOV instruction that could potentially do both operations
in a single instruction (along with moving a bitfield directly between
two instructions).
Idea here is that it may do, essentially a combination of a shift and a masked bit-select, say:
Low 8 bits of immediate encode a shift in the usual format:
Signed 8-bit shift amount, negative is right shift.
High bits give a pair of bit-offsets used to compose a bit-mask.
These will MUX between the shifted value and another input value.
I am still not sure whether this would make sense in hardware, but is
not entirely implausible to implement in the Verilog.
Would likely be a 2 or 3 cycle operation, say:
EX1: Do a Shift and Mask Generation;
May reuse the normal SHAD unit for the shift;
Mask-Gen will be specialized logic;
EX2:
Do the MUX.
EX3:
Present MUX result as output (passed over from EX2).
The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
so Bell's argument that caches were only available around 1985 does not
hold water on that end, either.
IBM tried to commercialize it in the ROMP in the IBM RT PC; Wikipedia
says: "The architectural work on the ROMP began in late spring of
1977, as a spin-off of IBM Research's 801 RISC processor ... The first examples became available in 1981, and it was first used commercially
in the IBM RT PC announced in January 1986. ... The delay between the completion of the ROMP design, and introduction of the RT PC was
caused by overly ambitious software plans for the RT PC and its
operating system (OS)." And IBM then designed a new RISC, the
RS/6000, which was released in 1990.
It almost seems like they could have tried making a PDP-11 based PC.
DEC could have maybe had a marketing advantage in, say, "Hey, our crap
can run UNIX" and "UNIX is better than DOS".
How many clocks did Alpha take to process each instruction?
On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:
As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x
<2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.
VAX's advantage was it executed fewer instructions (VAX only executed
65% of the number of instructions R2000 executed.)
Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce
You would also have to convince the Computer Science department at
CMU; Where a lot of VAX ideas were dreamed up based on the success
of the PDP-11.
A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.
The design point you target for the original VAX would have taken >significantly longer to design, debug, and ship.
On Sat, 01 Mar 2025 11:58:17 GMT, Anton Ertl wrote:
Like other USA-based computer architects, Bell ignores ARM, which
outperformed the VAX without using caches and was much easier to design.
While those ARM chips were legendary for their low power consumption (and
low transistor count), those Archimedes machines were not exactly low-
cost, as I recall.
Without caches, did they have to use faster (and therefore more expensive) >memory?
Or did they fall back on the classic “wait states”?
On Sat, 01 Mar 2025 22:25:26 GMT, Anton Ertl wrote:
The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
so Bell's argument that caches were only available around 1985 does not
hold water on that end, either.
It was about the sizes of the caches
and hence their contribution to the
cost.
Not sure about what instruction scheduling was like on the Alpha,
mitchalsup@aol.com (MitchAlsup1) writes:
A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.
What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:
You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.
mitchalsup@aol.com (MitchAlsup1) writes:
A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.
What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:
On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:...
mitchalsup@aol.com (MitchAlsup1) writes:
A pipelined machine in 1978 would have had 50% to 100% more circuit >>>boards than VAX 11/780, making it a lot more expensive.
You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.
That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >>The first ARM design started in 1983 with working silicon in 1985. It was a >>decade later.
The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >available.
IBM tried to commercialize it in the ROMP in the IBM RT PC; ...
... The delay between the
completion of the ROMP design, and introduction of the RT PC was
caused by overly ambitious software plans for the RT PC and its
operating system (OS)."
On 3/2/2025 5:46 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
It almost seems like they could have tried making a PDP-11 based PC.
I dimly remember that there were efforts in that direction. But the
PDP-11 does not even have the cumbersome support for more than 64KB
that the 8086 has (there were PDP-11s with more, but that was even
more cumbersome to use).
I had thought it apparently used a model similar to the 65C816.
Namely, that you could address 64K code + 64K data at a time, but then
load a value into a special register to access different RAM banks.
Granted, no first hand experience with PDP-11.
DEC also tried their hand in the PC-like business (DEC Rainbow 100).
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.
I guess they could have also tried competing against the Commodore 64
and Apple II, which were also popular around that era.
No idea how their pricing compared with the IBM PC's, but in any case,
those who had success were generally a lot cheaper.
Well, except for the Macintosh apparently, which managed to survive with
its comparably higher costs.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:
As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x
<2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.
VAX's advantage was it executed fewer instructions (VAX only executed
65% of the number of instructions R2000 executed.)
This agrees with my estimate that a CPU with 3 RV32GC MIPS would have
the same performance as a CPU with 2 VAX MIPS.
Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce
You would also have to convince the Computer Science department at
CMU; Where a lot of VAX ideas were dreamed up based on the success
of the PDP-11.
Yes, include that in my magic wand.
A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.
What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780?
I have no data about discrete implementations, but if we look at integrated ones and assume that thehuge portion of the transistor count was ROM
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:
Transistors area proc CPU
125,000 74.82 3um MicroVAX 78032 (integer-only, some instructions missing)
68,000 44 3.5um 68,000 (integer-only, no MMU)2/3rds of the transistor count in ROM
45,000 58.52 2um ROMP (integer-only, no MMU, three pipeline stages)Twice the 68K data path transistor count.
25,000 50 3um ARM1 (integer-only, no MMU, pipelined)This gives some credence that is can be done
110,000 ? 1.2um SPARC MB86900 (integer-only, pipelined)These two counteract that credence, with 40K of those transistors
110,000 80 2um MIPS R2000 (integer-only, pipelined)
It seems that the MMU cost a lot of transistors, while the pipelining
did not, as especially the ARM1 shows.
The design point you target for the original VAX would have taken >>significantly longer to design, debug, and ship.
What makes you think so? A major selling point of RISC especially
compared to the VAX was that the reduced instruction-set complexity
reduces the implementation effort.
And the fact that the students of
Berkeley and Stanford could produce their prototypes in a short time
lends credibility to the claim.
You write that VAX work began in 1973; it was introduced in 1977 (but
when where machines shipped to customers?), which would mean that
development also took 4 years. According to <https://en.wikipedia.org/wiki/VAX-11>, development began in 1976, but
that is hard to believe, especially given the CISC-based problems such
as having to keep many pages in physical memory at the same time.
- anton
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
How many clocks did Alpha take to process each instruction?
For the 21064 see slide 15 of <https://people.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture19.pdf>
I.e., about 1 CPI for Ear, and about 4.3 CPI for TCP-B, with other
benchmarks in between.
Theoretical bottom CPI (peak performance) of the 21064 is 0.5.
- anton
It almost seems like they could have tried making a PDP-11 based PC.
Lawrence D'Oliveiro wrote:
Found this paper
<https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:
The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.
The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with?
Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer”
market and wiped the floor with DEC’s flagship architecture,
performance-wise.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:
If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was
still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.
If they had just only put in *the things they actually use*
(as show by DEC's own instruction usage stats from 1982),
and left out all the things that they rarely or never use,
it would have had 50 or so opcodes instead of 305,
at most one operand that addressed memory on arithmetic and logic
opcodes
with 3 address modes (register, register address, register offset
address)
instead of 0 to 5 variable length operands with 13 address modes each
(most combinations of which are either silly, redundant, or illegal).
The reason it was designed the way it was, was because DEC had
microcode and microprogramming on the brain.
In this 1975 paper Bell and Strecher say it over and over and over.
They were looking at the cpu design as one large parsing machine
and not as a set of parallel hardware tasks.
And Macintosh was initially successful as a sort of niche machine
for "creative types", as opposed to "business users" who used PCs.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
MIPS used 64KB caches for the R2000? Because they could, in 1986.
Motorola used 16KB caches for the 88000? Obviously 64KB is not all
that necessary. Acorn used a 4KB shared cache for ARM3? Because it
allowed them to do it on a single chip; it still gives good benefits.
My impression is that Bell was just grasping at straws to justify
their wrong choices.
He looked at other differences (rather than the instruction set) between the MIPS R2000 and the VAX, and if it
represented something that was not available at acceptable cost in
1977 (in particular, 64KB caches), he used it as justification for the
VAX.
- anton
We wasted a lot of time explaining why we weren't going to do random
IBM stuff of which the most memorable was user labels in the inodes
(well, OS DASD has them.)
My impression is that Bell was just grasping at straws to justify their
wrong choices.
But academic efforts do not result in industrial quality products.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
The first ARM design started in 1983 with working silicon in 1985. It was a >>>decade later.
The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>available.
How was the code density?
I know ARM was pretty good but VAX
was fantastic since they sacrified everything else to compact instructions.
I had thought it apparently used a model similar to the 65C816.
Namely, that you could address 64K code + 64K data at a time, but then
load a value into a special register to access different RAM banks.
That was not what customers were interested in. There were various
Unix variants available for the PC, but the customers preferred using
DOS, which was preinstalled and did not cost extra. ...
On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:
But academic efforts do not result in industrial quality products.
*Cough* Unix *cough*
On Sun, 2 Mar 2025 21:57:57 +0000, Lawrence D'Oliveiro wrote:
On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:
But academic efforts do not result in industrial quality products.
*Cough* Unix *cough*
Not sure you can call Bell Labs academia.
I know ARM was pretty good but VAX
was fantastic since they sacrified everything else to compact instructions.
I don't think they did. They spent encoding space on instructions
that were very rare, and AFAIK instructions can be encoded that do not
work (e.g., a consant as destination). The major idea seems to have
been orthogonality, not compactness.
Nearly all opcodes were one byte other than the extended format floating point instructions so it's hard to see how they could have made that
much smaller without making it a lot more complicated.
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
The first ARM design started in 1983 with working silicon in 1985. It was a >>>>decade later.
The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>>available.
How was the code density?
I have no data on that. Interestingly, unlike the 68k, which was
outcompeted by RISCs at around the same time, the VAX did not have an afterlife of hobbyists who produced Linux and Debian ports, so I
cannot easily make a comparison.
Nearly all opcodes were one byte other than the extended format floating point >instructions so it's hard to see how they could have made that much smaller >without making it a lot more complicated.
The VAX is still supported with gcc and binutils, with newlib as
its C library, so building up a tool chain for assembly/disassembly
should be doable with a few (CPU) hours; you can then compare
sizes.
John Levine <johnl@taugh.com> writes:
How was the code density?
I have no data on that. Interestingly, unlike the 68k, which was
outcompeted by RISCs at around the same time, the VAX did not have an >afterlife of hobbyists who produced Linux and Debian ports, so I
cannot easily make a comparison.
And looking at my latest code size measurements <2024Jan4.101941@mips.complang.tuwien.ac.at>, both armhf (ARM T32) and riscv64 (RV64GC) result in shorter code than IA-32 and AMD64:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
853892 152068 61124 i386
On Sun, 2 Mar 2025 13:19:32 +0000, Anton Ertl wrote:
My impression is that Bell was just grasping at straws to justify
their wrong choices.
Likely, but looking at it from the originating time perspective,
VAX would have lost PDP-11 compatibility if it were more RISC-
like.
BGB <cr88192@gmail.com> writes:
It almost seems like they could have tried making a PDP-11 based PC.
I dimly remember that there were efforts in that direction. But the
PDP-11 does not even have the cumbersome support for more than 64KB
that the 8086 has (there were PDP-11s with more, but that was even
more cumbersome to use).
DEC also tried their hand in the PC-like business (DEC Rainbow 100).
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.
NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.
Anton Ertl wrote:
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.
For some definition of success, i.e they were sufficiently worse at PCs
to later merge with Compaq who was the first significant vendor in the
PC Compatible marketplace.
Columbia beat both of them by half a year or
so, but faded away a bit later.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.
For some definition of success, i.e they were sufficiently worse at PCs
to later merge with Compaq who was the first significant vendor in the
PC Compatible marketplace.
Pfeiffer got Compaq into trouble by buying DEC and not being able to
digest it. HP then bought Compaq and was able to digest all the
parts, leading to a successful PC business (I have no idea how much
Compaq contributed to that and how much HP did) and a successful HPE;
pretty much all of the stuff coming from/through DEC went away (I
think the Tandem legacy may still be identifiable), but maybe they
managed to keep the customers.
Columbia beat both of them by half a year or
so, but faded away a bit later.
I don't think I ever heard about Columbia. At what did they beat
Compaq and HP?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
They did not succeed. Maybe that's the decisive difference from
HP: They did succeed in the PC market.
For some definition of success, i.e they were sufficiently worse at
PCs to later merge with Compaq who was the first significant vendor
in the PC Compatible marketplace.
Pfeiffer got Compaq into trouble by buying DEC and not being able to
digest it. HP then bought Compaq and was able to digest all the
parts, leading to a successful PC business (I have no idea how much
Compaq contributed to that and how much HP did)
and a successful HPE;
pretty much all of the stuff coming from/through DEC went away (I
think the Tandem legacy may still be identifiable), but maybe they
managed to keep the customers.
Columbia beat both of them by half a year or
so, but faded away a bit later.
I don't think I ever heard about Columbia. At what did they beat
Compaq and HP?
- anton
Note that the “big bang” arrival of RISC in the
latter 1980s is pretty much in agreement with his timeline.
It seems, newer gcc is much worse than older versions at generation of >compact i386 code.
MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:
NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.
Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a
wide variety of prebuilt stuff there. I took those that sound like architecture names (and probably belong to NetBSD): aarch64 alpha
amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax
Unfortunately, they do not seem to port to RISC-V in any form yet, and
their earmv7hf port uses ARM A32, not T32. So the NetBSD competition
is performed without entries for those two instruction set encodings
that showed the smallest code sizes on Debian. Anyway, here are the
results:
bash grep xz
710838 42236 m68k
748354 159304 40930 vax
829077 176836 42840 amd64
855400 164188 aarch64
877284 186924 48032 sparc
882847 187203 49866 i386
898532 179844 earmv7hf
962128 205776 54704 powerpc
1004864 192256 53632 sparc64
1025136 51160 mips64eb
1147664 232688 63456 alpha
1172692 mipsel
Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.
What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete
implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite
direction:
The first article in this Mar-1987 HP Journal is about the
HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL. >Implementation started in Apr-1983, prototype ready early 1984.
"[3 stage] pipeline fetches and executes an instruction every 125 ns,
a 4096-entry translation lookaside buffer (TLB) for high-speed address >translation, and 128K bytes of cache memory."
"The measured MIPS rate for the Model 840 varies from
about 3.5 to 8 MIPS with an average of 4.5 to 5."
which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.
https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.
Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.
According to https://www.openpa.net/pa-risc_processor_pa-early.html#ts-1
the first HP-PA CPU was introduced in 1986, and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/
For example, you could buy state machines programmable by FPGA in 1986,
which was not available in 1977. (No idea if HP used them or not).
Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.
MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:
NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.
Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a wide variety of prebuilt stuff there. I took those that sound like
architecture names (and probably belong to NetBSD): aarch64 alpha amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax
Unfortunately, they do not seem to port to RISC-V in any form yet, and
their earmv7hf port uses ARM A32, not T32. So the NetBSD competition is performed without entries for those two instruction set encodings that
showed the smallest code sizes on Debian. Anyway, here are the results:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.
Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.
and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/
If your aim is small code size, it is better to compare output compiled
with -Os.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Note that the “big bang” arrival of RISC in the latter 1980s is pretty >> much in agreement with his timeline.
Correlation does not prove causation.
... while the guy who hired me kept his belowed DEC Rainbow which he
felt had the better architecture:
For one thing they did not break Intel's rules about where to place the interrupt vectors. In hindsight this was a bad decision since 100% compatibility with Microsoft Flight Simulator was an absolute
requirement at the time.
Anton Ertl wrote:
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.
... VAX has 16 GPRs ...
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
mitchalsup@aol.com (MitchAlsup1) writes:The first article in this Mar-1987 HP Journal is about the
A pipelined machine in 1978 would have had 50% to 100% more circuitWhat makes you think that a pipelined single-issue RV32GC would take
boards than VAX 11/780, making it a lot more expensive.
more circuit boards than VAX11/780? I have no data about discrete
implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite
direction:
HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL.
Implementation started in Apr-1983, prototype ready early 1984.
<https://people.csail.mit.edu/emer/media/papers/1999.06.retrospective.vax.pdf>
says:
|the VAX 11/780 CPU spanned about 20 boards.
Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so. Are the boards of a different
size? If the answers to both questions are "no", this would be counterevidence to Mitch Alsup's claim.
"[3 stage] pipeline fetches and executes an instruction every 125 ns,
a 4096-entry translation lookaside buffer (TLB) for high-speed address
translation, and 128K bytes of cache memory."
"The measured MIPS rate for the Model 840 varies from
about 3.5 to 8 MIPS with an average of 4.5 to 5."
which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.
It's interesting that this HP machine needed a cache at 8MHz, while
the contemporary ARM2 could run from DRAM at the same speed. But
then, the HP machine supports bigger memories, and includes an MMU,
both of which slow things down.
https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf
- anton
As this (the kernel part) was my project, it was very disappointing. I think >IBM priced such that with DOS being "free", it had no chance.That was not what customers were interested in. There were various
Unix variants available for the PC, but the customers preferred using
DOS, which was preinstalled and did not cost extra. ...
Yup. PC/IX was a really nice Unix port for the IBM PC and nobody was interested.
On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:
If your aim is small code size, it is better to compare output compiled
with -Os.
Then it becomes an artificial benchmark, trying to minimize code size at
the expense of real-world performance.
Remember, VAX was built for real-world use, not for academic benchmarks.
You could compare sizes of applications in the base.tgz tarball for each >architecture, this is available for RISC-V as well as all the others.
I found an ARM2 manual and the short answer is that the chip
drives the RAS and CAS signals to the dram directly.
The chip's clock is adjustable from 100 kHz to 10 MHz
and you match the cpu clock to your dram timing.
There is no READY line on the memory bus.
It does have one interesting feature that if the current address
is sequential to the prior one it skips the RAS cycle.
The Motorola Memory Book from 1979 shows MCM4027A 4kb*1 drams with
80 to 165 ns CAS access, 120 to 250 RAS access, 320 to 375 R/W cycle.
Similar numbers for MCM4116A 16kb*1 R/W cycle of 500 ns.
VAX probably used 4kb 500 ns drams.
On 3/2/25 5:27 PM, John Levine wrote:
Yup. PC/IX was a really nice Unix port for the IBM PC and nobodyAs this (the kernel part) was my project, it was very
was interested.
disappointing. I think IBM priced such that with DOS being "free",
it had no chance.
On Mon, 03 Mar 2025 17:21:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Digital sold solid PCs in the 1990s. Some under brand DECpc, others
under brand DEC Station.
On Mon, 3 Mar 2025 19:53:12 +0200, Michael S
<already5chosen@yahoo.com> wrote:
On Mon, 03 Mar 2025 17:21:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Digital sold solid PCs in the 1990s. Some under brand DECpc, others
under brand DEC Station.
Was there an Intel based DECstation?
The only ones I ever saw were MIPS based.
<searches>
Ahh! Wikipedia says there were 3 different DECstation lines: one based
on PDP-8, annother based on MIPS, and yet another based on Intel.
Naturally one has to scan/read the entire article to find the Intel references.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:
If your aim is small code size, it is better to compare output
compiled with -Os.
Then it becomes an artificial benchmark, trying to minimize code size
at the expense of real-world performance.
Remember, VAX was built for real-world use, not for academic
benchmarks.
And supposedly the real-world constraints at the time made it necessary
to minimize code size.
In the current discussion we look at how RV32GC might have fared under
this constraint.
On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner <gneuner2@comcast.net> wrote:
Ahh! Wikipedia says there were 3 different DECstation lines: one based
on PDP-8, annother based on MIPS, and yet another based on Intel.
Naturally one has to scan/read the entire article to find the Intel
references.
The name I dug out of 1994 Byte issue are DEC Celebris desktop and
HiNote laptop. But that's not the names I had in mind.
According to Wikipedia, PC/IX cost $900 and was released
in 1984. By that time, there was a lot of business software and games available for DOS, but presumably, very little for PC/IX?
In the current discussion we look at how RV32GC might have fared under
this constraint.
Sure. Except you need a much more complicated and resource-hungry compiler >than would have been reasonable to run on a VAX back then.
How would you have done games without being able to directly
address screen memory? I'm sure PC/IX, being a Unix-type system,
would have disallowed that.
The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:
RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).
On Wed, 5 Mar 2025 01:28:18 +0200, Michael S wrote:
On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner
<gneuner2@comcast.net> wrote:
Ahh! Wikipedia says there were 3 different DECstation lines: one
based on PDP-8, annother based on MIPS, and yet another based on
Intel. Naturally one has to scan/read the entire article to find
the Intel references.
The name I dug out of 1994 Byte issue are DEC Celebris desktop and
HiNote laptop. But that's not the names I had in mind.
Entirely different decades.
Robert Swindells <rjs@fdy2.co.uk> writes:
On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:...
mitchalsup@aol.com (MitchAlsup1) writes:
A pipelined machine in 1978 would have had 50% to 100% more circuit >>>>boards than VAX 11/780, making it a lot more expensive.
You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.
And what was the effect on the number of circuit boards? What effect
did the load/store architecture have, and what effect did the pipelining have?
It's been a number of years since I read about Lisp Machines and
Symbolics. My impression was that they were both based on CISCy ideas;
it's about closing the semantic gap, no? Load/store would surprise me.
And when the RISC revolution came, they could not compete. The RISCy
way to Lisp implementation was explored in SPUR (and Smalltalk in SOAR)
(one of which counts as RISC-III and the other as RISC-IV, I don't
remember which), and commercialized in SPARC's instructions with support
for tags (not used in the Lisp system that a former comp.arch regular contributed to).
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.
Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.
So what?
and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/
Nice! The pictures are pretty good. I can read the markings on the
chip. The first chip I looked at was marked 74AS181. TI introduced
the 74xx series of TTL chips starting in 1964, and when I read TTL, I expected to see 74xx chips. The 74181 was introduced in February
1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
for the VAX, and the photo confirms my expectation for the first HP-PA
CPU.
The AS family was only introduced in 1980, so there was some advances
between the VAX and this HP-PA CPU indeed. However, as far as the
number of boards is concerned, a 74AS181 takes as much space as a
plain 74181, so that difference is irrelevant for that aspect.
I leave it to you to point out a chip on the HP-PA CPU that did not
have a same-sized variant avalable in, say, 1975.
Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.
Anton Ertl [2025-03-01 11:58:17] wrote:
Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.
I wonder if an RV32GC would be competitive if implemented in the
technology available back in 1977 (when the VAX-11/780 came out,
according to Wikipedia).
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:So what?
Have TTL chips advanced between the VAX and the first HP-PAOh yes, they did; there were nine years between the launch of the
implementation? I don't think so.
VAX and the launch of HP-PA.
and you can see picturesNice! The pictures are pretty good. I can read the markings on the
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/
chip. The first chip I looked at was marked 74AS181. TI introduced
the 74xx series of TTL chips starting in 1964, and when I read TTL, I
expected to see 74xx chips. The 74181 was introduced in February
1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
for the VAX, and the photo confirms my expectation for the first HP-PA
CPU.
The AS family was only introduced in 1980, so there was some advances
between the VAX and this HP-PA CPU indeed. However, as far as the
number of boards is concerned, a 74AS181 takes as much space as a
plain 74181, so that difference is irrelevant for that aspect.
I leave it to you to point out a chip on the HP-PA CPU that did not
have a same-sized variant avalable in, say, 1975.
What I found intruiging are the chips that have numbers on paper
on them, like 09740-81710. That chip has a MMI logo still sticking
out. This is the logo of Monolithic Memories, Inc. which developed
the PAL chips of "Soul of a New Machine" and Eclipse MV 8000 fame.
At https://en.wikipedia.org/wiki/Programmable_Array_Logic you can
see the logo of the company.
PALs were not available for the VAX development, and they certainly
made implemting logic far less cumbersome, and they took up far less
space than their equivalent in logic gates (again, as described in
"The Soul of a New Machine", where Tom West gambled the development
on MMI getting its act togetther).
Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.
Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.
The same word layout of using the "free" lower bits for tags when
you know that objects are aligned to larger boundaries is still used
in most Lisp systems today, just without any hardware support, you
need to generate instructions to shift down an integer value before
using it.
On Wed, 5 Mar 2025 15:01:11 -0000 (UTC), Robert Swindells wrote:
The same word layout of using the "free" lower bits for tags when
you know that objects are aligned to larger boundaries is still used
in most Lisp systems today, just without any hardware support, you
need to generate instructions to shift down an integer value before
using it.
*Lightbulb moment*
How much would it cost in hardware to add support for ignoring some bottommost N bits (N fixed? configurable?) for most accesses?
This ties in with my idea that it would have been useful to reserve the bottom 3 bits for a bit offset, albeit ignored (or even MBZ) by normal load/store instructions.
ARM2 had no caches, but was still table-topping in its era.
By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of >https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for RISC-II). >Compilers at the time did not use the CISCy features much, which is
one reason why the IBM 801 project and later the Berkeley RISC and
Stanford MIPS proposed replacing them with a load/store architecture.
Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.
Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.
MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array. PAL has programmable AND matrix but a fixed OR matrix.
PLA has both AND and OR matrix programmable.
Mask programmed PLA's were available since 1970, and field programmable FPLA's available in 1976 from a number of suppliers (e.g. Signetics). https://en.wikipedia.org/wiki/Programmable_logic_array
If one was building a RISC style ISA cpu in 1975 they could be used
for decoding and state machines for fetch, load/store, page table walker.
I don't know the price.
I'm not so sure. The IBM Fortran H compiler used a lot of the 360's instruction
set and it is my recollection that even the dmr C compiler would generate memory
to memory instructions when appropriate. The PL.8 compiler generated code for 5
architectures including S/360 and 68K, and I think I read somewhere that its S/360 code was considrably better than the native PL/I compilers.
I get the impression that they found that once you have a reasonable number of
registers, like 16 or more, the benefit of complex instructions drops because you can make good use of the values in the registers.
By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for RISC-II). Compilers at the time did not use the CISCy features much, which is
one reason why the IBM 801 project and later the Berkeley RISC and
Stanford MIPS proposed replacing them with a load/store architecture.
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
In article <vq82c8$232tl$7@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:
How would you have done games without being able to directly
address screen memory? I'm sure PC/IX, being a Unix-type system,
would have disallowed that.
How?
There's no memory management hardware in an 8088, and PC/IX ran on a
basic PC/XT.
John
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
By contrast, making good use of the complex instructions of VAX in a compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for
RISC-II). Compilers at the time did not use the CISCy features
much, which is one reason why the IBM 801 project and later the
Berkeley RISC and Stanford MIPS proposed replacing them with a
load/store architecture.
VAX intstructions are very complex and much of that complexity
is hard to use in compilers. But even extremaly simple compiler
can generate load-op combinations decreasing number of instructions.
Rather simple hack is enough to combine additions in address
artihmetic into addressing mode. Also, operations with two or three
memory addresses are easy to generate from compiler. I think
that chains of pointer dereferences in C should be not hard to
convert to indirect addressing mode.
I think that state of chip technology was more important. For
example 486 has RISC-like pipeline with load-ops, but load-ops
take the same time as two separate instructions. Similarly,
operations on memory take the same time as load-op-store.
So there were no execution time gain from combined instructions
and clearly some complication compared to load/store
architecture.
Main speed gain of RISC came from having
pipeline on a chip (multichip processors were pipelined,
but expensive, earlier single chip ones had no pipeline).
So load/store architecture (and no microcode) meant that
early RISC could offer good pipeline earlier.
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of
https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for
RISC-II). Compilers at the time did not use the CISCy features
much, which is one reason why the IBM 801 project and later the
Berkeley RISC and Stanford MIPS proposed replacing them with a
load/store architecture.
VAX intstructions are very complex and much of that complexity
is hard to use in compilers. But even extremaly simple compiler
can generate load-op combinations decreasing number of instructions.
Rather simple hack is enough to combine additions in address
artihmetic into addressing mode. Also, operations with two or three
memory addresses are easy to generate from compiler. I think
that chains of pointer dereferences in C should be not hard to
convert to indirect addressing mode.
I think that state of chip technology was more important. For
example 486 has RISC-like pipeline with load-ops, but load-ops
take the same time as two separate instructions. Similarly,
operations on memory take the same time as load-op-store.
So there were no execution time gain from combined instructions
and clearly some complication compared to load/store
architecture.
In specific case of i486, with its small (8KB) unfied I+D cache,
you will see good gain from load+Op combining, even if going by cycle
count in the manual they are the same.
For Pentium , not necessarily so.
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.
In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.
On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:----------------
So, writing things like:
y[55:48]=x[19:12];
And:
j=x[19:12];
Also a single instruction, or 2 or 3 in the fallback case (encoded as a
shift and mask).
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
Though, looking at the compiler code, it would be subject to the "side effects in lvalue may be applied twice" bug:
(*ct++)[19:12]=(*cs++)[15:8];
On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of
polynomial functions.
In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.
One could also say at that point in time that FORTRAN was not that high
of a high level language.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Given a (very rough) estimate that each PAL replaced four standardMMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array.
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.
Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.
PAL has programmable AND matrix but a fixed OR matrix.
PLA has both AND and OR matrix programmable.
Mask programmed PLA's were available since 1970, and field programmable
FPLA's available in 1976 from a number of suppliers (e.g. Signetics).
https://en.wikipedia.org/wiki/Programmable_logic_array
I read somewhere that these were not used much because, in the
beginning, they were slow, big, expensive and difficult to program.
This is probably why they were not considered as a replacement for
the PAL chips for the MV/8000, had MMI failed - they were not up
to the job.
If one was building a RISC style ISA cpu in 1975 they could be used
for decoding and state machines for fetch, load/store, page table walker.
I don't know the price.
They could have been used for the same things on the VAX 11/780. Does anybody know if they were?
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.
In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.
On 2025-03-07 12:34 p.m., MitchAlsup1 wrote:
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:----------------
So, writing things like:
y[55:48]=x[19:12];
2 instructions in My 66000. One extract, one insert.
Ibid for Q+. The logic for an extract and insert as one operation might
add to the timing. Extract, sign/zero extend and copy back. Fields may
be different sizes.
And:
j=x[19:12];
Also a single instruction, or 2 or 3 in the fallback case (encoded as a
shift and mask).
1 instruction--extract (SLL or SLA)
Q+ has EXT/EXTU which is basically a SRL or SRA with mask applied
afterwards. PowerPC has a rotate-left-and-mask instruction. In my
opinion it makes more sense for extracts to be shifting right.
Lawrence D'Oliveiro wrote:
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of
polynomial functions.
How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf
In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.
And the decimal instructions for COBOL (also on some PDP-11's).
The only reason to add complex instructions like MOVC3, MOVC5 and
others SKIPC, SPANC, etc is if hardware can do a better job than a
software subroutine. And you only add those instructions when you
know you can afford the hardware, not in anticipation that someday
we might do a better job.
The reason VAX and 8086 benefit from string instructions is because
they are sequential processors. It allows them to do decode once and
sit in a tight loop doing execute. But both still move byte-by-byte
and do not attempt to optimize memory access operations.
Also the sequencer is sequential so the loop counting and branch testing
each take microcycles.
So there is some benefit when comparing a VAX MOVc3 to a VAX subroutine,
but not compared to a pipelined TTL RISC.
If it is a pipelined RISC then decode is overlapped with execute
so there is no advantage to these complex instructions vs a RISC
subroutine doing the same in a loop.
And the RISC subroutine might be
faster because it can overlap the loop count and branch with memory
access.
In both cases the real advantage is when you can afford the HW to
optimize bus accesses as this is where the majority of cycles are spent.
When you can afford the HW optimizer then you add them.
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
1 innstruction:: BITR rd,rs1,<8>
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
1 innstruction:: BITR rd,rs1,<8>
Isn't that just 'bswap32' on x86, or REV32 on ARM64?
On 3/7/2025 11:34 AM, MitchAlsup1 wrote:
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:----------------
So, writing things like:
y[55:48]=x[19:12];
2 instructions in My 66000. One extract, one insert.
1 instruction in this case...
The 3 sub-fields being, 36, 48, and 56.
The way I defined things does mean adding 1 to the high bit in the
encoding, so 63:56 would be expressed as 64:56, which nominally uses 1
more bit of range. Though, if expressed in 6 bits, the behavior I had
defined it as, effectively causes it to be modulo.
----------------------------------
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
1 innstruction:: BITR rd,rs1,<8>
In this particular case, there is also a SWAP.L instruction, but I was ignoring it for sake of this example, and my compiler isn't that clever.
Unlike Verilog, in C mode it will currently require single-bit fetch to
use a notation like x[17:17], but this is more because a person is much
more likely to type "x[17]" by accident (such as by using the wrong
variable, a missing '*', or ...).
On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:
In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.
One could also say at that point in time that FORTRAN was not that high
of a high level language.
[Fortran] was high enough, right from the start, to abstract away a
_lot_ of the machine, while still being quite efficient.
Like, bitfield helpers were too weird/obscure, but hard-coding parts of
the CRC or stuff related to DES encryption and similar into the ISA is fine...
On Fri, 7 Mar 2025 16:57:31 -0600, BGB wrote:
Like, bitfield helpers were too weird/obscure, but hard-coding parts of
the CRC or stuff related to DES encryption and similar into the ISA is
fine...
I blame C. The fact that C does not have built-in constructs to make convenient use of variable bitfields seems to be the main excuse for not supporting them in hardware instruction sets.
And then in return, the lack of efficient support in hardware becomes an excuse for not having such constructs in the higher-level language.
I guess, while a person could do something like (in C):
_BitInt(1048576) bmp;
_Boolean b;
...
b=(bmp>>i)&1; //*blarg* (shift here would be absurdly expensive)
This is liklely to be rare vs more traditional strategies, say:
uint64_t *bmp;
int b, i;
...
b=(bmp[i>>6]>>(i&63))&1;
As well as the traditional strategy being a whole lot more efficient in
this case...
I guess the case could be made for a generic dense bit array.
Though, an open question is how one would define it in a way that is consistent with existing semantics rules.
long ago and far away ... comparing pascal to pascal front-end with
pl.8 back-end (3033 is 370 about 4.5MIPS)
Date: 8 August 1981, 16:47:28 EDT
To: wheeler
the 801 group here has run a program under several different PASCAL "systems". The program was about 350 statements and basically
"solved" SOMA (block puzzle..). Although this is only one test, and
all of the usual caveats apply, I thought the numbers were
interesting... The numbers given in each case are EXECUTION TIME ONLY (Virtual on 3033).
6m 30 secs PERQ (with PERQ's Pascal compiler, of course)
4m 55 secs 68000 with PASCAL/PL.8 compiler at OPT 2
0m 21.5 secs 3033 PASCAL/VS with Optimization
0m 10.5 secs 3033 with PASCAL/PL.8 at OPT 0
0m 5.9 secs 3033 with PASCAL/PL.8 at OPT 3
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.
On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
1 innstruction:: BITR rd,rs1,<8>
Isn't that just 'bswap32' on x86, or REV32 on ARM64?
A degenerate version is:: but consider::
BITR Rd,Rs1,<1>
performs bit reversion, while::
BITR Rd,Rs1,<2>
reverses pairs of bits, ...
BITR Rs,Rs1,<16>
reverses halfwords.
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
1 innstruction:: BITR rd,rs1,<8>
Isn't that just 'bswap32' on x86, or REV32 on ARM64?
A degenerate version is:: but consider::
BITR Rd,Rs1,<1>
performs bit reversion, while::
BITR Rd,Rs1,<2>
reverses pairs of bits, ...
Is there an application for this particular variant?
BITR Rs,Rs1,<16>
reverses halfwords.
Since there generally aren't higher level language
constructs that encapsulate this behavior, how useful
is it in the real world? Does it justify the verif
costs, much less the engineering cost?
Bswap32/64 are genuinely useful in real world applications
(particularly networking) thus the presence in most modern instruction
sets.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.
The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).
He reports that the 82S100 generates the #CASRAM signal with a
propagation delay of 35ns in one direction and 25ns in the other, and
the #ROMH signal with a propagation delay of 25ns in both directions
(table 3.4). I guess that the 50ns are the worst case of anything you
can do with the 82S100.
He reports a current consumption of 102mA for the 82S100 (table 3.3),
which at 5V (the regular voltage at the time) is pretty close to the
600mW given in the data sheet. The rest of the board, including
several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
(I/O)) , consumed at most 770mA in his measurements; most of the rest
was NMOS, while the 82S100 was TTL.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.
The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).
It looks like the C64's circuit design was one culprit.
Though I do remember back then hearing about failures over time with
other fuse programmable devices like PROMs.
Something about the sputter from the blown fuses.
He reports that the 82S100 generates the #CASRAM signal with a
propagation delay of 35ns in one direction and 25ns in the other, and
the #ROMH signal with a propagation delay of 25ns in both directions
(table 3.4). I guess that the 50ns are the worst case of anything you
can do with the 82S100.
Yes, and it sounds like the circuit design depends on a race condition between two logic paths to work. Big no-no.
He reports a current consumption of 102mA for the 82S100 (table 3.3),
which at 5V (the regular voltage at the time) is pretty close to the
600mW given in the data sheet. The rest of the board, including
several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
(I/O)) , consumed at most 770mA in his measurements; most of the rest
was NMOS, while the 82S100 was TTL.
- anton
This is not a problem with the 82S100.
Whoever designed that circuit didn't know what they were doing.
One can't use any combinatorial logic circuit and expect exact timing.
The manufacturer specs indicate a range of speeds which depend on
things like variations in power supply voltage, load, temperature.
In the case of the 82S100 it is 35 ns typical, 50 ns max.
Also these are logic chains, so each gate adds its own variations.
The circuit should be designed so it works across all timing variations
which is what synchronization clocks and flip flops are for.
And even then flip flops have their own timing variations.
Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.
The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report ><http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).
Mc 68020 had instructions to access bit-fields that cross word
boundaries.
On Sun, 9 Mar 2025 1:27:19 +0000, Torbjorn Lindgren wrote:
Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.
Radio Shack TRS-80 would buy every Z80 that did not make 2MHz
operating frequency. They used something around 1.87 MHz so the
CPU clock and the TV clock were the same clock.
On Sat, 8 Mar 2025 18:03:38 +0000, EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw >>>> in a 28 pins dip.
The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).
It looks like the C64's circuit design was one culprit.
Though I do remember back then hearing about failures over time with
other fuse programmable devices like PROMs.
Something about the sputter from the blown fuses.
A laser blasts a short wire so that there is no longer any connection.
Then, in use, the electrical forces cause the still present aluminum
wires to reconstruct themselves making contact and changing the state.
The blowable wire is still immersed within an oxide layer, preventing
the blown aluminum atoms from "really going anywhere" allowing small
forces to reassemble the wire.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mwThe Commodore 64 used a 82S100 or compatible for various purposes,
in a 28 pins dip.
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).
AFAIK the Signetics 82S100 isn't failure-prone in C64, the MOS
Technology (IE Commodore) *clone* that they switched to due to cost
reasons IS known to be failure prone. Those are the reason there's
lots of PLA replacement projects.
If you have any actual Signetics device in a C64 it'll very likely
fine unless the power supply failed and fed everything too high
voltages which is unfortunately a common failure mode on the C64 power
brick. This PSU failure usually destroys most or all of the memory
chips, the SID and one or more likely multiple of CPU, PLA and ROMs.
Other things with known high failure rates are the MOS Technology 74xx
clones and the MT memory. These failure also include MT branded memory
chips of that specific type when used in non-Commodore items like PC
clones so it's not just C64.
Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.
It is rare for bitmap bits to not be a power of 2...
On 3/7/2025 9:28 PM, MitchAlsup1 wrote:
On Sat, 8 Mar 2025 2:49:50 +0000, BGB wrote:
------------------------
I guess, while a person could do something like (in C):
_BitInt(1048576) bmp;
_Boolean b;
...
b=(bmp>>i)&1; //*blarg* (shift here would be absurdly expensive)
This is liklely to be rare vs more traditional strategies, say:
uint64_t *bmp;
int b, i;
...
b=(bmp[i>>6]>>(i&63))&1;
Question: How do you handle the case where the bit vector is an odd
number of bits in width ?? Say <3, 5, 7, 17, ...>
It is rare for bitmap bits to not be a power of 2...
I would guess, at least for C, something like (for 3 bits):
uint32_t *bmp;
uint64_t bv;
int i, b, bp;
...
bp=i*3;
bv=*(uint64_t *)(bmp+(bp>>5));
b=(bv>>(bp&31))&7;
Could apply to anything up to 31 bits.
Could do similar with __int128 (or uint128_t), which extends it up to 63 bits.------------
Mc 68020 had instructions to access bit-fields that cross word
boundaries.
I guess one could argue the use-case for adding a generic funnel shift instruction.
If I added it, it would probably be a 64-bit encoding (generally needed
for 4R).
Architecture is more about what gets left OUT than what gets left IN.
Well, except in this case it was more a question of trying to fit it in
with C semantics (and not consideration for more ISA features).
There are still some limitations, for example:
In my current implementation, CSR's are very limited (may only be used
to load and store CSRs; not do RMW operations on CSRs).
Though, have noted that seemingly some number of actual RISC-V cores
also have this limitation.
A more drastic option might be to try to rework the hardware interfaces
and memory map hopefully enough to try to make it possible to run an OS
like Linux, but there doesn't really seem to be a standardized set of hardware interfaces or memory map defined.
Some amount of SOC's though seem to use a map like:
00000000..0000FFFF: ROM goes here.
00010000..0XXXXXXX: RAM goes here.
ZXXXXXXX..FFFFFFFF: Hardware / MMIO
They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS
boot the kernel directly as an ELF image (traditionally UEFI was always PE/COFF based?...).
There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used for
a lewd act) and seems to be triggering to Google's automatic content filtering (probably for a similar reason).
On 3/10/2025 7:53 PM, MitchAlsup1 wrote:-------------------
I guess one could argue the use-case for adding a generic funnel shift
instruction.
My 66000 has CARRY-SL/SR which performs a double wide operand shifted
by a single wide count (0..63) and produces a double wide result {IO}.
OK.
If I added it, it would probably be a 64-bit encoding (generally needed
for 4R).
By placing the width in position {31..37} you can compress this down
to 3-Operand.
It is 3 operand if being used as a 128-bit shift op.
But, funnel shift operators implies 3 independent inputs and 1 output.
----------
Architecture is more about what gets left OUT than what gets left IN.
Well, except in this case it was more a question of trying to fit it in
with C semantics (and not consideration for more ISA features).
Clearly, you want to support C semantics--but you can do this in a way
that also supports languages with real bit-field support.
---------------
Yeah.
Amidst debugging and considering Verilog support...
There are still some limitations, for example:
In my current implementation, CSR's are very limited (may only be used
to load and store CSRs; not do RMW operations on CSRs).
My 66000 only has 16 CPU CRs, and even these are R/W through MMI/O
space. All the other (effective) CRs are auto loaded in line quanta.
This mechanism allows one CPU to figure out what another CPU is up to
simply by meandering through its CRs...
I had enough space for 64 CRs, but only a small subset are actually
used. Some more had space reserved, but were related to non-implemented features.
RISC-V has a 12-bit CSR space, of which:
Some map to existing CRs;
My whole CR space was stuck into an implementation-dependent range.
Some read-only CSRs were mapped over to CPUID.
Of which, all of the CPUID indices were also mapped into CSR space.
Seemingly lacks defined user CSRs for timer or HW-RNG, which do exist in
my case. It is very useful to be able to access a HW timer in userland,
as otherwise it would waste a lot of clock-cycles using system calls for "clock()" and similar.
Though, have noted that seemingly some number of actual RISC-V cores
also have this limitation.
A more drastic option might be to try to rework the hardware interfaces
and memory map hopefully enough to try to make it possible to run an OS
like Linux, but there doesn't really seem to be a standardized set of
hardware interfaces or memory map defined.
Some amount of SOC's though seem to use a map like:
00000000..0000FFFF: ROM goes here.
00010000..0XXXXXXX: RAM goes here.
ZXXXXXXX..FFFFFFFF: Hardware / MMIO
My 66000::
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM
Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.
There seems to be a lot here defined in terms of 32-bit physical spaces, including on 64-bit targets.
Though, thus far, my existing core also has pretty all of its physical
map in 32-bit space.
The physical ranges from 0001_00000000 .. 7FFF_FFFFFFFF currently
contain a whole lot of nothing.
I once speculated on the possibility of special hardware to memory-map
the whole SDcard into physical space, but nothing has been done yet (and
such a hardware interface would be a lot more complicated than my
existing interface).
An intermediate option being to expand the SPI interface to support 256
bit bursts.
Say:
P_SPI_QDATA0..P_SPI_QDATA3
It appears this has already been partly defined (though not fully
implemented in the 256-bit case).
Where, the supported XMIT sizes are:
8 bit: Single Byte
64 bit: 8 bytes
256 bit: 32 bytes
With larger bursts mostly to reduce the amount of round-trip delay over
the bus.
On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
------------------------
My 66000::How does one reference DRAM vs MMI/O at the same address using a LD / ST instruction?
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM
Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with addresses. There is a region table in the system that describes up to
eight distinct regions.
Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.
Are there any CRs accessible with any instructions besides LD / ST?
------------
They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS
boot the kernel directly as an ELF image (traditionally UEFI was always
PE/COFF based?...).
Boot ROM should be big enough that no BOOT ROM will ever exceed its
size.
--------------
There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used for >>> a lewd act) and seems to be triggering to Google's automatic content
filtering (probably for a similar reason).
Coming up with names is surprisingly difficult. I got into a discussion
with a colleague a while ago about this. They were having difficulty
coding something an it turned out to be simply what names to choose for routines.
Hilarious--and reason enough to change names.
When you do change names, can you spell LD and ST instead of MOV ??
Yes, please LD / ST it is so much clearer what is going on. Less trouble getting confused by the placement of operands.
I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.
I go cross-eyed reading code that is a whole lot of moves.
A low of people swear by:
movl %eax, 16(%rdi)
....
On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:
On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
------------------------
My 66000::How does one reference DRAM vs MMI/O at the same address using a LD / ST
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM
instruction?
The MMU translates the virtual address to a universal address.
The PTE supplies the extra bits.
Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
addresses. There is a region table in the system that describes up to
eight distinct regions.
Every major block in my architecture has ports in config space that
smell just like that of a device on PCIe having said control block.
My thought was that adding all these to the config name space might
cramp any fixed (or programmable) partition. So, the easiest thing
is to give it its own big space.
Then every device header gets 1 or more pages of address space for
its own control registers. PCIe is now a 42-bit address space::
segment, bus; device; function, xreg, reg and likely to grow as
ACHI can consume a whole PCIe segment by itself.
Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.
Are there any CRs accessible with any instructions besides LD / ST?
CRs accessible via HR instruction theoretically == 40
CRs accessible via HR instruction at a privilege >= 16
Basically, HR provides access to this threads critical CRs
{IP, Root, ASID, CSP, exception ctrl, inst ctrl, interrupts ...}
and has access to the CPU SW stack according to privilege.
------------
They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS >>>> boot the kernel directly as an ELF image (traditionally UEFI was always >>>> PE/COFF based?...).
Boot ROM should be big enough that no BOOT ROM will ever exceed its
size.
--------------
There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used
for
a lewd act) and seems to be triggering to Google's automatic content
filtering (probably for a similar reason).
Coming up with names is surprisingly difficult. I got into a discussion
with a colleague a while ago about this. They were having difficulty
coding something an it turned out to be simply what names to choose for
routines.
Hilarious--and reason enough to change names.
When you do change names, can you spell LD and ST instead of MOV ??
Yes, please LD / ST it is so much clearer what is going on. Less trouble
getting confused by the placement of operands.
I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.
I go cross-eyed reading code that is a whole lot of moves.
I agree.
On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
When you do change names, can you spell LD and ST instead of MOV ??
Yes, please LD / ST it is so much clearer what is going on. Less trouble >>> getting confused by the placement of operands.
I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.
I go cross-eyed reading code that is a whole lot of moves.
I agree.
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background?
The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is interesting.
A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is interesting.
On Tue, 11 Mar 2025 18:15:06 +0000, Stephen Fuld wrote:
On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
When you do change names, can you spell LD and ST instead of MOV ??
Yes, please LD / ST it is so much clearer what is going on. Less
trouble
getting confused by the placement of operands.
I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.
I go cross-eyed reading code that is a whole lot of moves.
I agree.
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background?
Even when both LD and ST are written MOV there is a different OpCode
for the inbound MOV versus the outbound MOV, so, in effect, they are
really different instructions requiring different pipeline semantics.
Only (O N L Y) when one has a memory to memory move instruction can
the LDs and STs be MOVs. VAX had this, BJX* does not.
One should argue that different pipeline semantics requires a different OpCode--and you already have said OpCode having different bit patterns
and different signedness semantics different translation access rights
, ... At the HW level about the only thing LD has in common with ST is
the way the address is generated--although MIPS did something different.
The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
MARY and MARY2 used X = Y to mean the value in X is deposited into Y.
Both were left to right only languages. This should surprise most !! {{Although to be fair, Mary used the =: operator to perform assign.}}
On 11/03/2025 18:15, Stephen Fuld wrote:
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea
is that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.
No, it is logically a copy.
On 3/11/2025 10:44 AM, MitchAlsup1 wrote:
On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:
I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.
I go cross-eyed reading code that is a whole lot of moves.
I agree.
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:
A low of people swear by:
movl %eax, 16(%rdi)
....
More swear at it than for it.
Most likely: those who swear by it have brain damage by x86-ism.
On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:
On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:
On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:
------------------------
My 66000::How does one reference DRAM vs MMI/O at the same address using a LD / ST
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM
instruction?
The MMU translates the virtual address to a universal address.
The PTE supplies the extra bits.
Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
addresses. There is a region table in the system that describes up to
eight distinct regions.
Every major block in my architecture has ports in config space that
smell just like that of a device on PCIe having said control block.
My thought was that adding all these to the config name space might
cramp any fixed (or programmable) partition. So, the easiest thing
is to give it its own big space.
Then every device header gets 1 or more pages of address space for
its own control registers. PCIe is now a 42-bit address space::
segment, bus; device; function, xreg, reg and likely to grow as
ACHI can consume a whole PCIe segment by itself.
x86 asm, as used in MASM and DEBUG, was the first assembler language O
used, I found it very familiar that
mov ax,bx
or
mov ax,[bx]
or
mov ax,[bx+1234]
all correspond nicely to
a = b
a = *b
a = b[1234]
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:
A low of people swear by:
movl %eax, 16(%rdi)
....
More swear at it than for it.
Most likely: those who swear by it have brain damage by x86-ism.
It's the oververbose, bas-ackwards intel syntax that one does swear at.
The AT&T syntax that BGB noted above is far superior.
YMMVO.
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.
In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.
No software guy talks about "pipeline semantics" :-)
I.e having the target on the left is the only one that makes sense to
me.
A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)
On 3/11/2025 12:07 PM, moi wrote:
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.
Trouble is that such "common" operations have rather low frequency
compared to simple stuff. They are really library functions.
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
I think it may depend on first experiences with assembler language;
On 3/11/2025 12:07 PM, moi wrote:
On 11/03/2025 18:15, Stephen Fuld wrote:
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea
is that when hardware guys see the instruction, they think in terms
of register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.
Putting the destination on the right is also fairly common in general in
Unix style command notation:
dosomething args infile outfile
prog1 infile | prog2 > outfile
On 3/11/2025 12:57 PM, MitchAlsup1 wrote:--------------
My whole space is mapped by BAR registers as if they were on PCIe.
Not a thing yet.
But, PCIe may need to exist for Linux or similar.
But, may still be an issue as Linux could only use known hardware IDs,
and it is a question what IDs it would know about (and if any happen to
map closely enough to my existing interfaces).
Otherwise, would be necessary to write custom HW drivers, which would
add a lot more pain to all of this.
Some read-only CSRs were mapped over to CPUID.
I don't even have a CPUID--if you want this you go to config space
and read the configuration lists and extended configuration lists.
Errm, so vendor/Hardware ID's for each feature flag...
30 and 31 give the microsecond timer and HW-RNG, which are more relevant
to user-land.
32..63: Currently unused.
There is also a cycle counter (along vaguely similar lines to x86
RDTSC), but for many uses a microsecond counter is more useful (where
the timer-tick count updates at 1.0 MHz, and all cores would have the
same epoch).
On x86, trying to use RDTSC as a timer is rather annoying as it may jump around and goes at a different rate depending on current clock speed.
This scheme will not roll over for around 400k years (for a 64-bit microsecond timer), so "good enough".
Conceptually, this time would be in UTC, likely with time-zones handled
by adding another bias value.
This can in turn be used to derive the output from "clock()" and
similar.
Also, there are relatively few software timing tasks where we have much reason to care about nanoseconds. For many tasks, milliseconds are sufficient, but there are some things where microseconds matters.
Of which, all of the CPUID indices were also mapped into CSR space.
CPUID is soooooo pre-PCIe.
Dunno.
Mine is different from x86, in that it mostly functions like read-only registers.
RISC-V land seemingly exposes a microsecond timer via MMIO instead, but
this is much less useful as this means needing to use a syscall to fetch
the current time, which is slow.
Doom manages to fetch the current time frequently enough that doing so
via a syscall has a visible effect on performance.
My 66000 does not even have a 32-bit space to map into.
You can synthesize such a space by not using any of the
top 32-address bits in PTEs--but why ??
32-bit space is just the first 4GB of physical space.
But, as-is, there is pretty much nothing outside of the first 4GB.
The actually in use MMIO space is also still 28 bits.
The VRAM maps 128K in MMIO space, but in retrospect probably should have
been more. When I designed it, I didn't figure there would have been
more than 128K. The RAM backed framebuffer can be bigger though, but not
too much bigger, as then screen refresh starts getting too glitchy (as
it competes with the CPU for the L2 cache, but is more timing
sensitive).
My interconnect bus is 1 cache line (512-bits) per cycle plus
address and command.
My bus is 128 bits, but MMIO operations are 64-bits.
Where, for MMIO, every access involves a whole round-trip over the bus (unlike for RAM-like access, where things can be held in the L1 cache).
In theory, MMIO operations could be widened to allow 128-bit access, but haven't done so. This would require widening the data path for MMIO
devices.
Can note that when the request goes onto the MMIO bus, data narrows to
64-bit and address narrows to 28 bits. Non-MMIO range requests (from the ringbus) are not allowed onto the MMIO bus, and the MMIO bus will not
accept any new requests until the prior request has either finished or
timed out.
I still haven't seen any good reason to move to C++.
Some people (at the C people): use C++, it has features...
Others (at the C++ people): Use Rust, it is less of a trash fire.
Next was PDP=11 where MOV R1,R2 copies R1 into R2.
On 11/03/2025 18:15, Stephen Fuld wrote:
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.
No, it is logically a copy.
On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:
On 3/11/2025 12:07 PM, moi wrote:
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.
There is a language (C++) which has introduced reference operators that distinguish between “move semantics” versus “copy semantics”.
No, I haven’t got my head around it either.
On Tue, 11 Mar 2025 19:07:08 +0000, moi wrote:
On 11/03/2025 18:15, Stephen Fuld wrote:
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is >>> that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.
No, it is logically a copy.
But does it copy X into Y or copy Y into X ??
On Tue, 11 Mar 2025 22:15:30 -0000 (UTC), Waldek Hebisch wrote:
Trouble is that such "common" operations have rather low frequency
compared to simple stuff. They are really library functions.
Inline library functions. And they did contribute to keeping the code compact, as Bell said.
One thing, though, I don’t think the POLYx instruction was all that
useful. It is typical, when computing functions approximated by
polynomials, for the polynomial to actually be infinite. And so you have
a
loop that computes each term in turn, accumulates it to the result,
works
out an estimate of the remaining error, and stops only when this falls
below some threshold.
This cannot be expressed by some fixed-length table of coefficients, as
the POLYx instruction expects.
On Tue, 11 Mar 2025 11:15:06 -0700, Stephen Fuld wrote:
A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)
The true hardware engineer knows that it is neither, it is merely a
register rename. ;)
On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:
On 3/11/2025 12:07 PM, moi wrote:
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.
There is a language (C++) which has introduced reference operators that
distinguish between “move semantics” versus “copy semantics”.
No, I haven’t got my head around it either.
I still haven't seen any good reason to move to C++.
On Tue, 11 Mar 2025 18:58:11 -0500, BGB wrote:
I still haven't seen any good reason to move to C++.
No disagreement here. ;)
Some people (at the C people): use C++, it has features...
It appears the GNU C compiler itself is written in C++ now.
Others (at the C++ people): Use Rust, it is less of a trash fire.
Google started a project called “Carbon” a little while back, kind of a C++ done right, with all the accumulated legacy crap removed.
Wonder what happened to it ...
Of course, it is possible that VAX designers understood
performace implications of their decisons (or rather
meager speed gain from complex instructions), but bet
that "nice" instruction set will tie programs to their
platform.
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is >interesting.
A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)
There is a language (C++) which has introduced reference operators that >distinguish between “move semantics” versus “copy semantics”.
One of the early programming languages I came across was POP-2. This was >fully dynamic and heap-based, like Lisp, but also had an operand stack. So
a simple assignment statement looked like
a -> b;
but this could actually be written as two separate statements:
a;
-> b;
The first one pushed the value of a on the stack, the second one popped it >off and stored it in b.
This made it easy to do things like swap variable values:
a, b -> a -> b;
antispam@fricas.org (Waldek Hebisch) writes:
Of course, it is possible that VAX designers understood
performace implications of their decisons (or rather
meager speed gain from complex instructions), but bet
that "nice" instruction set will tie programs to their
platform.
I don't think that they fully understood the performance implications,
but I believe that creating an appealing environment for software
developers was a major consideration of the architects: For the assembly-language programmers, provide orthogonality; that also makes
it easy to write compilers (optimility in some form is a different
story). The much-critized VAX CALL instruction is designed for a
software ecosystem where various languages can call each other, there
exists a common debugger for all of them, etc. I am sure that they
were aware that this call instruction was expensive, but they expected
that it was worth the cost, and also expected that implementors would
reduce the cost to below what a sequence of simpler instructions would
cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
we see such expectations disappointed; I have not measured recent generations, though).
- anton
On Wed, 12 Mar 2025 00:26:50 -0000 (UTC), John Levine wrote:
Next was PDP=11 where MOV R1,R2 copies R1 into R2.
What about CMP (compare) versus SUB (subtract)? CMP does the subtract
without updating the destination operand, only setting the condition
codes. But are the operands the same way around as SUB (i.e. backwards for >comparison purposes) or are they flipped?
...I am sure that they
were aware that this call instruction was expensive, but they expected
that it was worth the cost, and also expected that implementors would
reduce the cost to below what a sequence of simpler instructions would
cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
we see such expectations disappointed; I have not measured recent
generations, though).
It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this
century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and stores.
If we replace a dozen with hundred or three then it will become true
for loop of 64-bit loads/stores as well.
Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory accesses
still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.
Michael S <already5chosen@yahoo.com> writes:
...I am sure that they
were aware that this call instruction was expensive, but they
expected that it was worth the cost, and also expected that
implementors would reduce the cost to below what a sequence of
simpler instructions would cost (looking at REP MOVSB in many
generations of Intel and AMD CPUs, we see such expectations
disappointed; I have not measured recent generations, though).
It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple >non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this >century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and
stores. If we replace a dozen with hundred or three then it will
become true for loop of 64-bit loads/stores as well.
Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory
accesses still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.
My experiments were with the code in
<https://github.com/AntonErtl/move/>.
I posted performance results in <2017Sep19.082137@mips.complang.tuwien.ac.at> <2017Sep20.184358@mips.complang.tuwien.ac.at> <2017Sep23.174313@mips.complang.tuwien.ac.at>
My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).
The longest of the routines is ssememmove at 275 bytes.
I expect that an avx512memmove would be quite a bit smaller, thanks to predication, but I have not yet written that nor measured how that
performs.
- anton
On 3/11/2025 7:51 PM, MitchAlsup1 wrote:
On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:
On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:
On 3/11/2025 12:07 PM, moi wrote:There is a language (C++) which has introduced reference operators that >>>> distinguish between “move semantics” versus “copy semantics”.
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op >>>>> code. :-) I had thought about mentioning in the software part of the >>>>> argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>> "Move A to B." even though you are technically right that it is a copy. >>>>
No, I haven’t got my head around it either.
I still haven't seen any good reason to move to C++.
C++ is for those situations where you want to write a small amount of
code and have it compile into a vast string of instructions.
Yeah, one can use iostream and have a trivial "hello world" type program
have build times and binary size like it was something quite substantial...
BGB <cr88192@gmail.com> writes:
On 3/11/2025 7:51 PM, MitchAlsup1 wrote:
On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:
On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:
On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:
On 3/11/2025 12:07 PM, moi wrote:There is a language (C++) which has introduced reference operators that >>>>> distinguish between “move semantics†versus “copy semanticsâ€.
No, it is logically a copy.
While that is true, I don't think anyone is talking about a "copy" op >>>>>> code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>>> "Move A to B." even though you are technically right that it is a copy. >>>>>
No, I haven’t got my head around it either.
I still haven't seen any good reason to move to C++.
C++ is for those situations where you want to write a small amount of
code and have it compile into a vast string of instructions.
Yeah, one can use iostream and have a trivial "hello world" type program
have build times and binary size like it was something quite substantial...
You don't have to use iostream. vsnprintf/snprintf/printf all work
fine in C++ code and are far more efficient (and far less verbose).
Use a subset of C++ (C with classes) and the resulting code is
quite compact, but you still get data encapsulation and
inheritance (with a minor perf hit for virtual functions).
On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My experiments were with the code in
<https://github.com/AntonErtl/move/>.
Non of those are simple loops that I mentioned above.
I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>
My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).
Idiots from corporate IT blocked http://al.howardknight.net/
So, link to google groups
or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
would be helpful.
I don't know why gnu memcpy is huge. I don't even know if it is
really *that* huge. But several KB is number that I had seen
stated by other people.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
One of the early programming languages I came across was POP-2. This
was fully dynamic and heap-based, like Lisp, but also had an operand
stack.
In Forth you can define VALUEs that work like these POP-11 variables.
My first programming was on the TI-58C programmable calculator, which
has RCL (recall) and STO (store).
https://isocpp.org/wiki/faq/value-vs-ref-semantics
I don't know why gnu memcpy is huge.
The much-critized VAX CALL instruction is designed for a software
ecosystem where various languages can call each other, there exists a
common debugger for all of them, etc.
I am sure that they were aware that this call instruction was
expensive, but they expected that it was worth the cost ...
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.
I am a software person. When talking about register-memory copies, I
prefer to talk about load and store operations, whether I talk about
assembly language (even one where the mnemonic for these operations is
MOV) or C; in Forth the spoken names for these operations are "fetch" (written: @) and "store" (written: !).
Michael S <already5chosen@yahoo.com> writes:
On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My experiments were with the code in
<https://github.com/AntonErtl/move/>.
Non of those are simple loops that I mentioned above.
They are not. If you want short code, rep movsb is unbeatable (for memmove(), you have to do a little more, however).
I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>
My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).
Idiots from corporate IT blocked http://al.howardknight.net/
I feel with you. In my workplace, Usenet is blocked (probably unintentionally). I have to post from home.
So, link to google groups
Sorry, I cannot provide that service. Trying to access
groups.google.com tells me:
|Couldn’t sign you in
|
|The browser you’re using doesn’t support JavaScript, or has JavaScript |turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.
I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.
But all I would do is try whether google groups finds the message-ids.
You can do that yourself.
or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
would be helpful.
The posts are from 2017; these message-ids are not random-generated.
I don't know why gnu memcpy is huge. I don't even know if it is
really *that* huge. But several KB is number that I had seen
stated by other people.
I stated in one of these messages that I have seen an 11KB memmove in
glibc. Let's see:
objdump -t /debian8/usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
'memmove' 00000000000001a0 g i .text 0000000000000047
__libc_memmove 0000000000000000 g F .text 000000000000019f __memmove_sse2 00000000000001a0 g i .text 0000000000000047
memmove 0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3 0000000000000010 g F .text.ssse3
0000000000002b67 __memmove_ssse3 0000000000000000 g F .text.ssse3
0000000000000009 __memmove_chk_ssse3_back 0000000000000010 g F .text.ssse3 0000000000002b06 __memmove_ssse3_back ...
Yes, 11111 bytes for __memmove_ssse3. Debian 8 is one of the systems
I used at the time.
Let's see how it looks in Debian 12:
objdump -t /usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
'memmove'|grep -v wmemmove 0000000000000000 l F .text
00000000000000f6 __libc_memmove_ifunc 0000000000000000 g i .text 00000000000000f6 __libc_memmove 0000000000000000 g i .text 00000000000000f6 memmove 0000000000000010 g F .text.avx
000000000000002f __memmove_avx_unaligned 0000000000000080 g F
.text.avx 00000000000006de __memmove_avx_unaligned_erms
0000000000000010 g F .text.avx.rtm 000000000000002d __memmove_avx_unaligned_rtm 0000000000000080 g F .text.avx.rtm 00000000000006df __memmove_avx_unaligned_erms_rtm 0000000000000020 g
F .text.avx512 0000000000000009
__memmove_chk_avx512_no_vzeroupper 0000000000000030 g F
.text.avx512 000000000000073b __memmove_avx512_no_vzeroupper 0000000000000010 g F .text.evex512 0000000000000037 __memmove_avx512_unaligned 0000000000000080 g F .text.evex512 00000000000007a0 __memmove_avx512_unaligned_erms 0000000000000020 g
F .text 0000000000000009 __memmove_chk_erms 0000000000000030 g
F .text 000000000000002d __memmove_erms 0000000000000010 g F
.text.evex 0000000000000034 __memmove_evex_unaligned
0000000000000080 g F .text.evex 00000000000007bb __memmove_evex_unaligned_erms 0000000000000010 g F .text
0000000000000028 __memmove_sse2_unaligned 0000000000000080 g F
.text 0000000000000552 __memmove_sse2_unaligned_erms
0000000000000040 g F .text.ssse3 0000000000000f3d
__memmove_ssse3 0000000000000000 g F .text 000000000000000e __memmove_chk
So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.
- anton
On Wed, 12 Mar 2025 16:46:36 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: =20
My experiments were with the code in
<https://github.com/AntonErtl/move/>. =20
...I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>
http://al.howardknight.net helped me to see that start of the message,
but not the full message.=20
And eternal-september is still struggling with restoration of its
archives after the crash of 9 months ago. More and more it looks like
they will never be restored.
... Trying to access groups.google.com tells me:
|Couldn’t sign you in
|
|The browser you’re using doesn’t support JavaScript, or has JavaScript >|turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.
I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.
But all I would do is try whether google groups finds the message-ids.
You can do that yourself.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:23:24 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |