Forum: >>> Magnum BBS <<<

Re: Why VAX Was the Ultimate CISC and Not RISC

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sat Mar 1 11:58:17 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” >market and wiped the floor with DEC’s flagship architecture, >performance-wise.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.

Like other USA-based computer architects, Bell ignores ARM, which
outperformed the VAX without using caches and was much easier to
design.

As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.

As a result, DEC would have had an architecture that would have given
them superior performance, they would not have suffered from the
infighting of VAX9000 vs. PRISM etc. (and not from the wrong decision
to actually build the VAX9000), and might still be going strong to
this day. They would have been able to extend RV32GC to RV64GC
without problems, and produce superscalar and OoO implementations.

OTOH, DEC had great success with the VAX for a while, and their demise
may have been unavoidable given their market position: Their customers (especially the business customers of VAXen) went to them instead of
IBM, because they wanted something less costly, and they continued
onwards to PCs running Linux when they provided something less costly.
So DEC would also have needed to outcompete Intel and the PC market to
succeed (and IBM eventually got out of that market).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Mar 1 08:09:35 2025

Found this paper <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:

The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.

The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” market and wiped the floor with DEC’s flagship architecture, performance-wise.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Sat Mar 1 17:59:51 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.

There was also the question of PDP-11 compatibility. I would solve
that by adding a PDP-11 decoder that produces RV32G instructions (or
maybe the microcode that the RV32G decoder produces). Low-end models
may get a dynamic binary translator instead.

OTOH, DEC had great success with the VAX for a while, and their demise
may have been unavoidable given their market position: Their customers >(especially the business customers of VAXen) went to them instead of
IBM, because they wanted something less costly, and they continued
onwards to PCs running Linux when they provided something less costly.
So DEC would also have needed to outcompete Intel and the PC market to >succeed (and IBM eventually got out of that market).

OTOH, HP was also a big player in the mini and later workstation
market, and they managed to survive, albeit by eventually splitting
themselves into HPE for the big iron, and the other part for the PCs
and printers. But it may be the exception that proves the rule.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Mar 1 18:03:21 2025

On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” >>market and wiped the floor with DEC’s flagship architecture, >>performance-wise.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.

Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to
design.

Was ARM around when VAX was being designed (~1973) ??

"The Case for the Reduced Instruction Set Computer" was after
1980 as a point of temporal reference.

As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x <2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.

VAX's advantage was it executed fewer instructions (VAX only executed
65% of the number of instructions R2000 executed.)

My 66000 only needs 70% of the instructions RISC-V requires. Thus
it is within spitting distance of VAX instruction count while still
being almost a RISC architecture.

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce

You would also have to convince the Computer Science department at
CMU; Where a lot of VAX ideas were dreamed up based on the success
of the PDP-11.

that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

As a result, DEC would have had an architecture that would have given
them superior performance, they would not have suffered from the
infighting of VAX9000 vs. PRISM etc. (and not from the wrong decision
to actually build the VAX9000), and might still be going strong to
this day. They would have been able to extend RV32GC to RV64GC
without problems, and produce superscalar and OoO implementations.

The design point you target for the original VAX would have taken
significantly longer to design, debug, and ship.

OTOH, DEC had great success with the VAX for a while, and their demise
may have been unavoidable given their market position: Their customers (especially the business customers of VAXen) went to them instead of
IBM, because they wanted something less costly, and they continued
onwards to PCs running Linux when they provided something less costly.
So DEC would also have needed to outcompete Intel and the PC market to succeed (and IBM eventually got out of that market).

Unclear.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sat Mar 1 20:01:01 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

Like other USA-based computer architects, Bell ignores ARM, which
outperformed the VAX without using caches and was much easier to
design.

Was ARM around when VAX was being designed (~1973) ??

ARM was designed starting in 1983, if Wikipedia is to be believed.

The only ones experimenting (successfully) with RISC at the time
the VAX was designed were IBM with the 801, and they were kept
from realizing their full potential by IBM's desire to not hurt
their /370 business.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Sat Mar 1 14:40:55 2025

Lawrence D'Oliveiro wrote:

Found this paper <https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:

The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.

The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer” market and wiped the floor with DEC’s flagship architecture, performance-wise.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).

If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it was still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.

If they had just only put in *the things they actually use*
(as show by DEC's own instruction usage stats from 1982),
and left out all the things that they rarely or never use,
it would have had 50 or so opcodes instead of 305,
at most one operand that addressed memory on arithmetic and logic opcodes
with 3 address modes (register, register address, register offset address) instead of 0 to 5 variable length operands with 13 address modes each
(most combinations of which are either silly, redundant, or illegal).

Then they would have be able to parse instructions in one clock,
which makes pipelining a possible consideration,
and simplifies the uArch so now it can all fit on one chip,
which allows it to complete with RISC.

The reason it was designed the way it was, was because DEC had
microcode and microprogramming on the brain.
In this 1975 paper Bell and Strecher say it over and over and over.
They were looking at the cpu design as one large parsing machine
and not as a set of parallel hardware tasks.

This was their mental mindset just before they started the VAX design:

What Have We Learned From PDP11, Bell Strecker, 1975 https://gordonbell.azurewebsites.net/Digital/Bell_Strecker_What_we%20_learned_fm_PDP-11c%207511.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Mar 1 20:46:29 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until >>the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.

Like other USA-based computer architects, Bell ignores ARM, which >outperformed the VAX without using caches and was much easier to
design.

That's not a fair comparison. VAX design started in 1975 and shipped in 1978. The first ARM design started in 1983 with working silicon in 1985. It was a decade later.

On the other hand, I think some things were shortsighted even at the time. As Bell's paper said, they knew about Moore's law but didn't believe it. If they believed it they could have made the instructions a little less dense and a lot easier to decode and pipeline. STRETCH did pipelining in the 1950s so they should have been aware of it and considered that future machines could use it.

As someeone else noted, they had microcode on the brain and the VAX instruction set is clearly designed to be decoded by microcode one byte at a time. Address modes can have side-effects so you have to decode them serially or have a big honking hazard scheme. They probably also assumed that microcode ROM would
be faster than RAM which even in 1975 was not particularly true. Rather than putting every possible instruction into microcode, have a fast subroutine call and make them subsroutines which can be cached and pipelined.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Sat Mar 1 23:19:24 2025

On Sat, 01 Mar 2025 14:40:55 -0500, EricP wrote:

If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.

How many clocks did Alpha take to process each instruction? Because I
recall the initial chips had clock speeds several times that of the RISC competition, but performance, while competitive, was not several times
greater.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Mar 1 22:30:32 2025

On Sat, 1 Mar 2025 19:40:55 +0000, EricP wrote:

Lawrence D'Oliveiro wrote:

Found this paper
<https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:

The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.

The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with? Because not doing so meant that, just
over a decade later, RISC architectures took over the “real computer”
market and wiped the floor with DEC’s flagship architecture,
performance-wise.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).

If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was
still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.

If they had just only put in *the things they actually use*
(as show by DEC's own instruction usage stats from 1982),
and left out all the things that they rarely or never use,
it would have had 50 or so opcodes instead of 305,
at most one operand that addressed memory on arithmetic and logic
opcodes
with 3 address modes (register, register address, register offset
address)
instead of 0 to 5 variable length operands with 13 address modes each
(most combinations of which are either silly, redundant, or illegal).

Excepting for the 1 memory operand per instruction, the above para-
graph accurately describes My 66000 ISA.

Then they would have be able to parse instructions in one clock,
which makes pipelining a possible consideration,
and simplifies the uArch so now it can all fit on one chip,
which allows it to complete with RISC.

If VAX had stuck with PDP-11 address modes and simply added the
{Byte, Half, Word, Double} accesses it would have been a lot easier
to pipeline.

The reason it was designed the way it was, was because DEC had
microcode and microprogramming on the brain.

As did most of academia at the time.

In this 1975 paper Bell and Strecher say it over and over and over.
They were looking at the cpu design as one large parsing machine
and not as a set of parallel hardware tasks.

Orthogonality, Regularity, Expressibility, ...

This was their mental mindset just before they started the VAX design:

What Have We Learned From PDP11, Bell Strecker, 1975 https://gordonbell.azurewebsites.net/Digital/Bell_Strecker_What_we%20_learned_fm_PDP-11c%207511.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sat Mar 1 22:25:26 2025

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

The answer was no, the VAX could not have been done as a RISC >>>architecture. RISC wasn’t actually price-performance competitive until >>>the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches.

Like other USA-based computer architects, Bell ignores ARM, which >>outperformed the VAX without using caches and was much easier to
design.

That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >The first ARM design started in 1983 with working silicon in 1985. It was a >decade later.

The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was
available.

So I think that, for a VAX-11/780-priced machine, they could have had
a pipelined RISC that reads instructions from two 32-bit-wide DRAM
banks alternatingly, resulting in maybe 3-4 32-bits of instructions
delivered per microsecond for straight-line code without loads or
stores. And in RV32GC many instructions take only 16 bits, so these
3-4 32-bits contain maybe 5-6 instructions. So that might be 5-6 peak
MIPS, maybe 3 average MIPS, compared to 0.5 VAX MIPS. Some VAX
instructions have to be replaced with several RISC instructions, so
let's say these 3 RISC MIPS correspond to 2 VAX MIPS. That would
still be faster than the VAX 11/780, which reportedly had about 0.5
MIPS.

The other thing is that the VAX 11/780 (released 1977) had a 2KB
cache, so Bell's argument that caches were only available around 1985
does not hold water on that end, either. So my 1977 RISC project
would have used that cache, too, increasing the performance of the
result even more.

Yes, commercial RISCs only happened in 1986 or so, but there is no
technical reason for that, only that commercial architects did not
believe in such things at the time. It took research projects from
several sources until the concept had enough credibility to be taken
seriously. That's why I asked for the magic wand for my time-travel
project.

It's interesting that this lack of credibility apparently includes
IBM, whose research lab pioneered the concept. They produced the IBM
801 with 15MHz clock, probably around the time of the first VAX, but
the IBM 801 had no MMU; not sure what RAM technology they used.

IBM tried to commercialize it in the ROMP in the IBM RT PC; Wikipedia
says: "The architectural work on the ROMP began in late spring of
1977, as a spin-off of IBM Research's 801 RISC processor ... The first
examples became available in 1981, and it was first used commercially
in the IBM RT PC announced in January 1986. ... The delay between the completion of the ROMP design, and introduction of the RT PC was
caused by overly ambitious software plans for the RT PC and its
operating system (OS)." And IBM then designed a new RISC, the
RS/6000, which was released in 1990.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 00:16:06 2025

On Sat, 01 Mar 2025 11:58:17 GMT, Anton Ertl wrote:

Like other USA-based computer architects, Bell ignores ARM, which outperformed the VAX without using caches and was much easier to design.

While those ARM chips were legendary for their low power consumption (and
low transistor count), those Archimedes machines were not exactly low-
cost, as I recall.

Without caches, did they have to use faster (and therefore more expensive) memory? Or did they fall back on the classic “wait states”?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Mar 2 01:02:04 2025

On Sat, 1 Mar 2025 22:29:27 +0000, BGB wrote:

On 3/1/2025 5:58 AM, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

------------------------------

Would likely need some new internal operators to deal with bit-array operations and similar, with bit-ranges allowed as a pseudo-value type
(may exist in constant expressions but will not necessarily exist as an actual value type at runtime).
Say:
val[63:32]
Has the (63:32) as a BitRange type, which then has special semantics
when used as an array index on an integer type, ...

Mc 88K and My 66000 both have bit-vector operations.

The previous idea for bitfield extract/insert had turned into a
composite BITMOV instruction that could potentially do both operations
in a single instruction (along with moving a bitfield directly between
two instructions).

Using CARRY and extract + insert, one can extract a field spanning
a doubleword and then insert it into another pair of doublewords.
1 pseudo-instruction, 2 actual instructions.

Idea here is that it may do, essentially a combination of a shift and a masked bit-select, say:
Low 8 bits of immediate encode a shift in the usual format:
Signed 8-bit shift amount, negative is right shift.
High bits give a pair of bit-offsets used to compose a bit-mask.
These will MUX between the shifted value and another input value.

You want the offset (a 6-bit number) and the size (another 6-bit number)
in order to identify the field in question.

I am still not sure whether this would make sense in hardware, but is
not entirely implausible to implement in the Verilog.

In the extract case, you have the shifter before the masker
In the insert case, you have the masker before the shifter
followed by a merge (OR). Both maskers use the size. Offset
goes only to the shifter.

Would likely be a 2 or 3 cycle operation, say:
EX1: Do a Shift and Mask Generation;
May reuse the normal SHAD unit for the shift;
Mask-Gen will be specialized logic;
EX2:
Do the MUX.
EX3:
Present MUX result as output (passed over from EX2).

I have done these in 1 cycle ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 02:40:45 2025

On Sat, 01 Mar 2025 22:25:26 GMT, Anton Ertl wrote:

The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
so Bell's argument that caches were only available around 1985 does not
hold water on that end, either.

It was about the sizes of the caches, and hence their contribution to the
cost.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Anton Ertl on Sat Mar 1 18:29:50 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

IBM tried to commercialize it in the ROMP in the IBM RT PC; Wikipedia
says: "The architectural work on the ROMP began in late spring of
1977, as a spin-off of IBM Research's 801 RISC processor ... The first examples became available in 1981, and it was first used commercially
in the IBM RT PC announced in January 1986. ... The delay between the completion of the ROMP design, and introduction of the RT PC was
caused by overly ambitious software plans for the RT PC and its
operating system (OS)." And IBM then designed a new RISC, the
RS/6000, which was released in 1990.

ROMP originally for DISPLAYWRITER follow-on ... running CP.r operating
system and PL.8 programming language. ROMP was minimal 801, didn't have supervisor/problem mode ... at the time their claim was PL.8 would only generate correct code and CP.r would only load/execute correct programs.
They claimed 40bit addressing ... 32 bit addresses ... but top four bits selected 16 "segment registers" that contained 12bit
segment-identifiers. ... aka 28bit segment displacement and 12bit
segment-ids (40bits) .... and any inline code could change segment
register value ... as easily as could load any general register.

When follow-on to DISPLAYWRITER was canceled, they pivoted to UNIX
workstation market and got the company that had done AT&T unix port to
IBM/PC for PC/IX ... to do AIX. Now ROMP needed supervisor/problem mode
and inline code could no longer change segment register values
... needed to have supervisor call.

Folklore is they also had 200 PL.8 programmers and needed something for
them to do, so they gen'ed a abstract virtual machine system ("VRM") (implemented in PL.8) and had AIX port be done to the abstract virtual
machine definition (instead of real hardware) .... claiming that the
combined effort would be less (total effort) than having the outside
company do the AIX port to the real hardware (also putting in a lot of
IBM SNA communication support).

The IBM Palo Alto group had been working on UCB BSD port to 370, but was redirected to do it instead to bare ROMP hardware ... doing it in
enormously significantly less resources than the VRM+AIX+SNA effort.

Move to RS/6000 & RIOS (large multi-chip) doubled the 12bit segment-id
to 24bit segment-id (and some left-over description talked about it
being 52bit addressing) and eliminated the VRM ... and adding in some
amount of BSDisms.

AWD had done their own cards for PC/RT (16bit AT) bus, including a 4mbit token-ring card. Then for RS/6000 microchannel, AWD was told they
couldn't do their own card, but had to do PS2 microchannel cards. The communication group was fiercely fighting off client/server and
distributed computing and had seriously performance knee-capped PS2
cards, including ($800) 16mbit token-ring card (the PS2 microchannel
which had lower card throughput than the PC/RT 4mbit TR card). There
was joke that PC/RT 4mbit TR server having higher throughput than
RS/6000 16mbit TR server. There was also joke that the RS6000/730 with
VMEbus was a work around corporate politics and being able to install high-performance workstation cards

We got the HA/6000 project in 1988 (approved by Nick Donofrio),
originally for NYTimes to move their newspaper system off VAXCluster to RS/6000. I rename it HA/CMP. https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing when I start doing technical/scientific cluster scale-up with national
labs (LLNL, LANL, NCAR, etc) and commercial cluster scale-up with RDBMS
vendors (Oracle, Sybase, Ingres, Informix that had vaxcluster support in
same source base with unix). The S/88 product administrator then starts
taking us around to their customers and also has me do a section for the corporate continuous availability strategy document ... it gets pulled
when both Rochester/AS400 and POK/(high-end mainframe) complain they
couldn't meet the requirements.

Early Jan1992 have a meeting with Oracle CEO and IBM/AWD Hester tells
Ellison we would have 16-system clusters by mid92 and 128-system
clusters by ye92. Then late Jan92, cluster scale-up is transferred for
announce as IBM Supercomputer (for technical/scientific *ONLY*) and we
are told we can't work on anything with more than four processors (we
leave IBM a few months later). Contributing was the mainframe DB2 DBMS
group were complaining if we were allowed to coninue, it would be at
least five years ahead of them.

Neither ROMP or RIOS supported bus/cache consistency for multiprocessor operation. The executive we reported to, went over to head up ("AIM" -
Apple, IBM, Motorola) Somerset for single chip 801/risc ... but also
adopts Motorola 88k bus enabling multiprocessor configurations. He later
leaves Somerset for president of (SGI owned) MIPS.

trivia: I also had HSDT project (started in early 80s), T1 and faster
computer links, both terrestrial and satellite ... which included custom designed TDMA satellite system done on the other side of the pacific
... and put in 3-node system. two 4.5M dishes, one in San Jose and one
in Yorktown Research (hdqtrs, east coast) and a 7M dish in Austin (where
much of the RIOS design was going on). San Jose also got an EVE, a
superfast hardware VLSI logic simulator (scores of times faster than
existing simultion) ... and it was claimed that Austin being able to use
the EVE in San Jose, helped bring RIOS in a year early.

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Sun Mar 2 11:46:23 2025

BGB <cr88192@gmail.com> writes:

It almost seems like they could have tried making a PDP-11 based PC.

I dimly remember that there were efforts in that direction. But the
PDP-11 does not even have the cumbersome support for more than 64KB
that the 8086 has (there were PDP-11s with more, but that was even
more cumbersome to use).

DEC also tried their hand in the PC-like business (DEC Rainbow 100).
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

DEC could have maybe had a marketing advantage in, say, "Hey, our crap
can run UNIX" and "UNIX is better than DOS".

That was not what customers were interested in. There were various
Unix variants available for the PC, but the customers preferred using
DOS, which was preinstalled and did not cost extra. And even when you
could install Unix for free in the form of Linux, most customers just
used Windows which was preinstalled, and (probably decisively) for the
network effects).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 12:03:58 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

How many clocks did Alpha take to process each instruction?

For the 21064 see slide 15 of <https://people.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture19.pdf>

I.e., about 1 CPI for Ear, and about 4.3 CPI for TCP-B, with other
benchmarks in between.

Theoretical bottom CPI (peak performance) of the 21064 is 0.5.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Sun Mar 2 09:34:37 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x
<2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.

VAX's advantage was it executed fewer instructions (VAX only executed
65% of the number of instructions R2000 executed.)

This agrees with my estimate that a CPU with 3 RV32GC MIPS would have
the same performance as a CPU with 2 VAX MIPS.

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce

You would also have to convince the Computer Science department at
CMU; Where a lot of VAX ideas were dreamed up based on the success
of the PDP-11.

Yes, include that in my magic wand.

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:

Transistors area proc CPU
125,000 74.82 3um MicroVAX 78032 (integer-only, some instructions missing)
68,000 44 3.5um 68,000 (integer-only, no MMU)
45,000 58.52 2um ROMP (integer-only, no MMU, three pipeline stages)
25,000 50 3um ARM1 (integer-only, no MMU, pipelined)
110,000 ? 1.2um SPARC MB86900 (integer-only, pipelined)
110,000 80 2um MIPS R2000 (integer-only, pipelined)

It seems that the MMU cost a lot of transistors, while the pipelining
did not, as especially the ARM1 shows.

The design point you target for the original VAX would have taken >significantly longer to design, debug, and ship.

What makes you think so? A major selling point of RISC especially
compared to the VAX was that the reduced instruction-set complexity
reduces the implementation effort. And the fact that the students of
Berkeley and Stanford could produce their prototypes in a short time
lends credibility to the claim. We also have some timelines we can
compare:

<https://en.wikipedia.org/wiki/PA-RISC> says:
|In early 1982, work on the Precision Architecture began at HP
|Laboratories, defining the instruction set and virtual memory
|system. Development of the first TTL implementation started in April
|1983. With simulation of the processor having completed in 1983, a
|final processor design was delivered to software developers in July
|1984. Systems prototyping followed, with "lab prototypes" being
|produced in 1985 and product prototypes in 1986.[7]
|
|The first processors were introduced in products during 1986.

<https://www.hpmuseum.net/display_item.php?hw=836> writes:
|The 3000/930 and 3000/950 were both announced in March of 1986 but did
|not ship until the second half of 1987.

So here we actually have a discrete implementation of a RISC.

You write that VAX work began in 1973; it was introduced in 1977 (but
when where machines shipped to customers?), which would mean that
development also took 4 years. According to <https://en.wikipedia.org/wiki/VAX-11>, development began in 1976, but
that is hard to believe, especially given the CISC-based problems such
as having to keep many pages in physical memory at the same time.

<https://people.computing.clemson.edu/~mark/330/eagle.html#timeline>
says about the Data General Eclipse MV/8000:
|spring 1978 - Eagle project starts
|summer 1978 - recruiting of Hardy Boys and Microkids
||April 1979 - projected Eagle completion date (missed)
|June 1979 - West presents Eagle at Product Board Meeting
|mid-1979 - Eagle supporter and VP of Engineering Carl Carmen leaves DG |October 1979 - Gallifrey Eagle moved to software department
|November 1979 - Tom West transferred
|2H79 and early 1980 - difficulties with PAL supplier;
| hardware debugging and software development continue |April 1980 - public announcement of MV/8000
|fall 1980 - Eclipse group reorganized

It's not clear when the MV/8000 was delivered to the customers.

The same timeline also contains Data General's Plan A (Eagle was Plan
B), the more ambitious FHP which makes the writable control store
available to third-party software, i.e., it's in a way a further step
in the thinking that led to CISC:

|July 1975 - FHP project starts
|September 1977 - EGO vs. FHP meeting; FHP version promised in a year
|early 1979 - news that FHP will miss deadline "by a huge margin"
|November 1981 - FHP demo

BTW, the existence of writable control stores before the release of
the VAX is further counterevidence to the claim that at the time of
the VAX, microcode ROM was so much faster and that fast SRAM was not
an option: The DEC PDP-10 KL10 (1975) has 80*1280bits (12.5KB) or
80*2K bits (20KB) of WCS, depending on the model, and they were the
machines that the VAX-11/780 was going to replace. So not only did
RISCs not need cache to perform better than a VAX-11/780, for a
machine in the price class of the VAX-11/780 you could also have cache
to gain even more speed; the 2KB of the VAX-11/780 itself would help,
but bigger caches were possible and could help more.

<https://en.wikipedia.org/wiki/I386> says:
|Development of i386 technology began in 1982 under the internal name
|of P3.[4] The tape-out of the 80386 development was finalized in July |1985.[4] The 80386 was introduced as pre-production samples for
|software development workstations in October 1985.[5] Manufacturing
|of the chips in significant quantities commenced in June 1986 ... The
|first personal computer to make use of the 80386 was the Deskpro 386.

The Compaq DeskPro 386 was released on September 9, 1986. But the
i386 was a single chip, not a discrete implementation, which may have
had an influence on development time. IBM took until July 1987 to
introduce their first computer with an i386.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 12:10:42 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Sat, 01 Mar 2025 11:58:17 GMT, Anton Ertl wrote:

Like other USA-based computer architects, Bell ignores ARM, which
outperformed the VAX without using caches and was much easier to design.

While those ARM chips were legendary for their low power consumption (and
low transistor count), those Archimedes machines were not exactly low-
cost, as I recall.

Compared to what? Compared to an 8-bit or 16-bit home computer: no.
Compared to PCs at the time: yes. Compared to a contemporary VAX: by
a lot.

USD GBP Year Model
149 1985 C64 (64KB RAM)
699 499 1987 Amiga 500 (512KB RAM)
799 1987 ARM Archimedes 305 (512KB RAM)
3500 1989 ARM R140 (4MB RAM, 60MB HDD, RISC iX, no ethernet)
1995 1990 ARM R225 (8MB RAM, RISC iX)
3995 1990 ARM R260 (8MB RAM, 100MB HDD, RISC iX)

6499 1986 Compaq DeskPro 386 Model 40 (1MB RAM, 40MB HDD)
3769 1987 MacIntosh II (1MB RAM, no HDD for this price)
5000 1987? MicroVAX II KA620 (1MB RAM)
81000 1989 MicroVAX 3800
120200 1989 MicroVax 3900

The MicroVAX II KA620 seems to be a special model ("a single-board
MicroVAX II designed for automatic test equipment and manufacturing applications which only ran DEC's real-time VAXELN operating system" <https://alchetron.com/MicroVAX>; all VAX prices are from that site).

The question of performance came up: <https://en.wikipedia.org/wiki/File:Archimedes_performance.svg> shows
that the Acorn Archimedes using ARM2 without cache ran Dhrystone at
2.8 times the speed of a VAX-11/780, and the ARM3 with 4KB cache ran
DhryStone at 10.5-15.1 times the speed of the VAX-11/780. The graph
also shows the competition in the home computer and PC space, and
gives sources for the numbers.

Without caches, did they have to use faster (and therefore more expensive) >memory?

The ARM Archimedes used a 32-bit memory interface which was apparently
a lot more expensive than the 8-bit memory interface of the C64 and
the 16-bit memory interface of the Amiga 500. Other than that, I
don't think there was much in the way of DRAM speed grades at the
time. Marketing high-speed DRAM to the gullible seems to be a rather
recent development.

Or did they fall back on the classic “wait states”?

Wait states make no sense is the CPU has no chance to do other work
during the wait states (as for the ARM2), so I doubt that. I expect
that what they did was to just use the clock that the RAM supported.
They used an 8MHz clock.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 13:19:32 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Sat, 01 Mar 2025 22:25:26 GMT, Anton Ertl wrote:

The other thing is that the VAX 11/780 (released 1977) had a 2KB cache,
so Bell's argument that caches were only available around 1985 does not
hold water on that end, either.

It was about the sizes of the caches

Sure, more cache is better than less cache, all other things being
equal, but as the ARM2 without cache demonstrates, a RISC running out
of RAM can outperform the VAX-11/780 with 2KB (or is it 8KB? see
below) cache. And as the ARM3 with 4KB cache demonstrates, that
performance advantage increases by a lot with even a 4KB cache.

and hence their contribution to the
cost.

Interestingly, <http://bitsavers.informatik.uni-stuttgart.de/pdf/datapro/datapro_reports_70s-90s/DEC/M11-384-40_8402_DEC_VAX-11.pdf>
reports VAX-11/780 cache as having 8KB (4KB for the /750). It also
gives a delivery date of Jan. 1978 for the VAX-11/780. So that cost
was obviously acceptable for the VAX-11/780, and, as the ARM3
performance results show, is not too small to be beneficial to
performance.

MIPS used 64KB caches for the R2000? Because they could, in 1986.
Motorola used 16KB caches for the 88000? Obviously 64KB is not all
that necessary. Acorn used a 4KB shared cache for ARM3? Because it
allowed them to do it on a single chip; it still gives good benefits.

My impression is that Bell was just grasping at straws to justify
their wrong choices. He looked at other differences (rather than the instruction set) between the MIPS R2000 and the VAX, and if it
represented something that was not available at acceptable cost in
1977 (in particular, 64KB caches), he used it as justification for the
VAX.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to BGB on Sun Mar 2 16:33:31 2025

On Sat, 1 Mar 2025 17:45:58 -0600
BGB <cr88192@gmail.com> wrote:

Not sure about what instruction scheduling was like on the Alpha,

DEC shipped 4 generations of Alpha CPUs with 3 quite different core microarchitectures.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Sun Mar 2 15:42:53 2025

On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:

You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Robert Swindells on Sun Mar 2 09:03:53 2025

Robert Swindells <rjs@fdy2.co.uk> writes:

You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.

from long ago and far away:

Date: 79/07/11 11:00:03
To: wheeler

i heard a funny story: seems the MIT LISP machine people proposed that
IBM furnish them with an 801 to be the engine for their prototype.
B.O. Evans considered their request, and turned them down.. offered them
an 8100 instead! (I hope they told him properly what they thought of
that)

... snip ...

... trivia: Evans had asked my wife to review/audit 8100 (had really
slow, anemic processor) and shortly later it was canceled
("decomitted").

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Mar 2 13:51:36 2025

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:

The first article in this Mar-1987 HP Journal is about the
HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL. Implementation started in Apr-1983, prototype ready early 1984.

"[3 stage] pipeline fetches and executes an instruction every 125 ns,
a 4096-entry translation lookaside buffer (TLB) for high-speed address translation, and 128K bytes of cache memory."

"The measured MIPS rate for the Model 840 varies from
about 3.5 to 8 MIPS with an average of 4.5 to 5."

which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Swindells on Sun Mar 2 18:30:24 2025

Robert Swindells <rjs@fdy2.co.uk> writes:

On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit >>>boards than VAX 11/780, making it a lot more expensive.

...

You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.

And what was the effect on the number of circuit boards? What effect
did the load/store architecture have, and what effect did the
pipelining have?

It's been a number of years since I read about Lisp Machines and
Symbolics. My impression was that they were both based on CISCy
ideas; it's about closing the semantic gap, no? Load/store would
surprise me.

And when the RISC revolution came, they could not compete. The RISCy
way to Lisp implementation was explored in SPUR (and Smalltalk in
SOAR) (one of which counts as RISC-III and the other as RISC-IV, I
don't remember which), and commercialized in SPARC's instructions with
support for tags (not used in the Lisp system that a former comp.arch
regular contributed to).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Mar 2 20:02:03 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

That's not a fair comparison. VAX design started in 1975 and shipped in 1978. >>The first ARM design started in 1983 with working silicon in 1985. It was a >>decade later.

The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >available.

Oh, OK. How was the code density? I know ARM was pretty good but VAX
was fantastic since they sacrified everything else to compact instructions.
The pages were only 512B, they really thought memory was expensive even
though the trend lines were clear.

IBM tried to commercialize it in the ROMP in the IBM RT PC; ...
... The delay between the
completion of the ROMP design, and introduction of the RT PC was
caused by overly ambitious software plans for the RT PC and its
operating system (OS)."

I was there, designing AIX. IBM couldn't decide what they wanted, and
they didn't understand Unix, but they wanted it yesterday, so they had
an elaborate and slow "virtual resource manager" with the operating
systems running on top. It turned out that the only operating system
was AIX, with the VRM just extra overhead. We wasted a lot of time
explaining why we weren't going to do random IBM stuff of which the
most memorable was user labels in the inodes (well, OS DASD has them.)

It did not help that I naively believed their initial schedule so we
based AIX on 4.1BSD rather than the recently released 4.2 and it
didn't have dynamic shared libraries and other 4.2 stuff. Someone else
did a 4.2 port that ran way faster than AIX did.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to BGB on Sun Mar 2 13:16:52 2025

On 3/2/2025 12:39 PM, BGB wrote:

On 3/2/2025 5:46 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

It almost seems like they could have tried making a PDP-11 based PC.

I dimly remember that there were efforts in that direction. But the
PDP-11 does not even have the cumbersome support for more than 64KB
that the 8086 has (there were PDP-11s with more, but that was even
more cumbersome to use).

I had thought it apparently used a model similar to the 65C816.

Namely, that you could address 64K code + 64K data at a time, but then
load a value into a special register to access different RAM banks.

Granted, no first hand experience with PDP-11.

DEC also tried their hand in the PC-like business (DEC Rainbow 100).
They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

I guess they could have also tried competing against the Commodore 64
and Apple II, which were also popular around that era.

No idea how their pricing compared with the IBM PC's, but in any case,
those who had success were generally a lot cheaper.

Well, except for the Macintosh apparently, which managed to survive with
its comparably higher costs.

Yes, but . . . Its earlier, more expensive incarnation, the Lisa did
not survive, which shows there is a limit to how much more people are
willing to pay. And Macintosh was initially successful as a sort of
niche machine for "creative types", as opposed to "business users" who
used PCs.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 21:26:34 2025

On Sun, 2 Mar 2025 9:34:37 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sat, 1 Mar 2025 11:58:17 +0000, Anton Ertl wrote:

As for code size, we see significantly smaller code for RISC
instruction sets with 16/32-bit encodings such as ARM T32/A32 and
RV64GC than for all CISCs, including AMD64, i386, and S390x
<2024Jan4.101941@mips.complang.tuwien.ac.at>. I doubt that VAX fares
so much better in this respect that its code is significantly smaller
than for these CPUs.

VAX's advantage was it executed fewer instructions (VAX only executed
65% of the number of instructions R2000 executed.)

This agrees with my estimate that a CPU with 3 RV32GC MIPS would have
the same performance as a CPU with 2 VAX MIPS.

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce

You would also have to convince the Computer Science department at
CMU; Where a lot of VAX ideas were dreamed up based on the success
of the PDP-11.

Yes, include that in my magic wand.

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780?

a) 68000, 010, 020 were latch based implementations using cross coupled
sense amps as the flop-part of the latch.
b) 88110 was a non-overlapping clock dual latch design.

~2/3rds of 68000 transistor count was in the 2 levels of ROM (it was
both microcoded and nano-coded.)

So, comparing the area of the 88100 integer unit to the 68020 integer
unit, and certain making adjustments, the pipeline integer unit took
a lot more area in the latching of operands and results.

Where 68020 would read an operand, run it though integer calculation
and write it back to the still asserted register select line, with
only staging (delay) latches in the loop. Now, the loop took 2
cycles, but an R2000 it took 4 (Decode, EX, Cache, writeback).

There are a lot more flip-flops in the pipelined path that in the
"use latches to create optimal delay" unpipelined path.

I have no data about discrete implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite direction:

Transistors area proc CPU
125,000 74.82 3um MicroVAX 78032 (integer-only, some instructions missing)

huge portion of the transistor count was ROM

68,000 44 3.5um 68,000 (integer-only, no MMU)

2/3rds of the transistor count in ROM
So, here we are only using ~20K transistors for {address, data, pc, and
pins}. Now revisit your comparisons.

45,000 58.52 2um ROMP (integer-only, no MMU, three pipeline stages)

Twice the 68K data path transistor count.

25,000 50 3um ARM1 (integer-only, no MMU, pipelined)

This gives some credence that is can be done

110,000 ? 1.2um SPARC MB86900 (integer-only, pipelined)
110,000 80 2um MIPS R2000 (integer-only, pipelined)

These two counteract that credence, with 40K of those transistors
found in the windowed register file.

It seems that the MMU cost a lot of transistors, while the pipelining
did not, as especially the ARM1 shows.

The design point you target for the original VAX would have taken >>significantly longer to design, debug, and ship.

What makes you think so? A major selling point of RISC especially
compared to the VAX was that the reduced instruction-set complexity
reduces the implementation effort.

Reduces the effort is you have a RISC ISA, it does not reduce the
effort as much if you have a VAX ISA with al of its decoding "stuff".

And the fact that the students of
Berkeley and Stanford could produce their prototypes in a short time
lends credibility to the claim.

Student projects run in quanta of semesters, often building on the
work of the previous students in the previous semesters, guided
by professors with an overall direction moving forward. This is not
different than the VAX prototype designs at CMU.

But academic efforts do not result in industrial quality products.
So, the time lines in academia are fundamentally different than in
industry.

<snip>

You write that VAX work began in 1973; it was introduced in 1977 (but
when where machines shipped to customers?), which would mean that
development also took 4 years. According to <https://en.wikipedia.org/wiki/VAX-11>, development began in 1976, but
that is hard to believe, especially given the CISC-based problems such
as having to keep many pages in physical memory at the same time.

I remember walking by the 12 person conference room in the CS
part of CMU in 1973 and listening to the participants discuss
making a bigger-better PDP-11. The topics was quite advanced
at that moment, but I can vouch for the '73 date. I had been
using PDP-11 ISA for 6-months at that point and was significantly
impressed with it, more so that PDP-10, or IBM 360/67.

Another time I was walking by they were talking about how to
adjust BLISS so that it ran well on the "Bigger and better PDP-11".

Grayson, Bell, and Newell were at both meeting along with a host
of students.

<snip>

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 21:35:03 2025

On Sun, 2 Mar 2025 12:03:58 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

How many clocks did Alpha take to process each instruction?

For the 21064 see slide 15 of <https://people.eecs.berkeley.edu/~randy/Courses/CS252.S96/Lecture19.pdf>

Well, that was fun. Thanks.

I.e., about 1 CPI for Ear, and about 4.3 CPI for TCP-B, with other
benchmarks in between.

Theoretical bottom CPI (peak performance) of the 21064 is 0.5.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sun Mar 2 21:56:08 2025

On Sat, 1 Mar 2025 16:29:27 -0600, BGB wrote:

It almost seems like they could have tried making a PDP-11 based PC.

They did. In 1982, DEC already had, not one, but three different lines of
PCs, based on the PDP-11 (the “Professional” line), the PDP-8 (“DECmate”),
and a dual-processor Z80/8088 machine (the “Rainbow”). This last one could run 3 OSes: CP/M-80, CP/M-86, and MS-DOS. But only one at a time.

So you’d think they had their bases covered on the PC front, quite early
on.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Mar 2 21:50:14 2025

On Sat, 1 Mar 2025 19:40:55 +0000, EricP wrote:

Lawrence D'Oliveiro wrote:

Found this paper
<https://gordonbell.azurewebsites.net/Digital/Bell_Retrospective_PDP11_paper_c1998.htm>
at Gordon Bell’s website. Talking about the VAX, which was designed as
the ultimate “kitchen-sink” architecture, with every conceivable
feature to make it easy for compilers (and humans) to generate code,
he explains:

The VAX was designed to run programs using the same amount of
memory as they occupied in a PDP-11. The VAX-11/780 memory range
was 256 Kbytes to 2 Mbytes. Thus, the pressure on the design was
to have very efficient encoding of programs. Very efficient
encoding of programs was achieved by having a large number of
instructions, including those for decimal arithmetic, string
handling, queue manipulation, and procedure calls. In essence, any
frequent operation, such as the instruction address calculations,
was put into the instruction-set. VAX became known as the
ultimate, Complex (Complete) Instruction Set Computer. The Intel
x86 architecture followed a similar evolution through various
address sizes and architectural fads.

The VAX project started roughly around the time the first RISC
concepts were being researched. Could the VAX have been designed as a
RISC architecture to begin with?

Not with the people involved in VAX at both DEC and in academia.

With some other group of individuals who worked under any of
{Thornton, Cray, Smith, Cooke} and latched onto the things those
people preached and performed.

Because not doing so meant that, just

over a decade later, RISC architectures took over the “real computer”
market and wiped the floor with DEC’s flagship architecture,
performance-wise.

Remember, VAX was envisioned to be the 32-bit PDP-11.

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

In a rough sense, that statement is true; and add in the fact that
people building computers at the inception of VAX were still beholden
to Wilkes (i.e., microcode.)

If you look at the VAX 8800 or NVAX uArch you see that even in 1990 it
was
still taking multiple clocks to serially decode each instruction and
that basically stalls away any benefits a pipeline might have given.

When compilers spit out code so each VAX instruction contains
a memory reference (anything from a displacement to an immediate
to a real memory reference), serial decode is what lands on the
table.

If they had just only put in *the things they actually use*

The problem was that they (the VAX compiler people) emit code
using all the address modes, and VAX had a problem in not having
enough registers ~12 compared to ~29 for R2000, so all use once
memory references used address modes.

(as show by DEC's own instruction usage stats from 1982),
and left out all the things that they rarely or never use,
it would have had 50 or so opcodes instead of 305,

IMHO::: VAX should have 50-60 instructions and not have had the
polynomial instructions, the queue handling instructions, and
the editing instructions, CALLS, RET, Decimal arithmetic, two
kinds of FP, (and a bunch more I can't think of).

at most one operand that addressed memory on arithmetic and logic
opcodes
with 3 address modes (register, register address, register offset
address)
instead of 0 to 5 variable length operands with 13 address modes each
(most combinations of which are either silly, redundant, or illegal).

DEC literally jumped the shark. No instruction should have more
than 3 or 4 operands.

The reason it was designed the way it was, was because DEC had
microcode and microprogramming on the brain.

Yes, exactly, and it was the safe choice when VAX started.

In this 1975 paper Bell and Strecher say it over and over and over.
They were looking at the cpu design as one large parsing machine
and not as a set of parallel hardware tasks.

We were in the era where you typical instructions took 5-cycles
to execute, heading to the era where the typical instruction
takes 3 cycles to execute.

RISC shows up at 1.4 cycles per instruction heading to 0.5 today.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Sun Mar 2 22:00:11 2025

On Sun, 2 Mar 2025 13:16:52 -0800, Stephen Fuld wrote:

And Macintosh was initially successful as a sort of niche machine
for "creative types", as opposed to "business users" who used PCs.

Desktop publishing! That was the killer new market that was basically
created by/on the Macintosh (together with the Apple LaserWriter), and
which it dominated for many years.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Mar 2 22:00:31 2025

On Sun, 2 Mar 2025 13:19:32 +0000, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

MIPS used 64KB caches for the R2000? Because they could, in 1986.
Motorola used 16KB caches for the 88000? Obviously 64KB is not all
that necessary. Acorn used a 4KB shared cache for ARM3? Because it
allowed them to do it on a single chip; it still gives good benefits.

MIPS used SRAM external to the chip and sent out addresses 1 on the
high phase of the clock and 1 on the low phase of the clock and
violating the timing of the SRAM specs themselves. I was told that
later MIPS had a tester setup to sort SRAMs into those that met its
needs and those that did not.

Mc88100 was not allowed to use an interface twice per cycle (the
test guys objected) so we had to use 2 interfaces 1 for I$ 1 for
D$. We put FP on die and migrated the TLB to the Caches which
were 4-way set instead of direct mapped unified.

As to ARM's 4KB cache:: 68020 had a 256 byte cache, with a hit rate
just good enough (70%) to separate Instructions accesses from data
accesses at the pins. ARM's cache would have been big enough for
there to be unused cycles on its external interface.

My impression is that Bell was just grasping at straws to justify
their wrong choices.

Likely, but looking at it from the originating time perspective,
VAX would have lost PDP-11 compatibility if it were more RISC-
like. I put the mistakes up to hoping the other guys don't
advance the start of the art as much as what actually happened.

He looked at other differences (rather than the instruction set) between the MIPS R2000 and the VAX, and if it
represented something that was not available at acceptable cost in
1977 (in particular, 64KB caches), he used it as justification for the
VAX.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Sun Mar 2 21:50:15 2025

On Sun, 2 Mar 2025 20:02:03 -0000 (UTC), John Levine wrote:

We wasted a lot of time explaining why we weren't going to do random
IBM stuff of which the most memorable was user labels in the inodes
(well, OS DASD has them.)

Sounds like an early form of Linux-style extended attributes? <https://manpages.debian.org/xattr(7)>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sun Mar 2 21:48:36 2025

On Sun, 02 Mar 2025 13:19:32 GMT, Anton Ertl wrote:

My impression is that Bell was just grasping at straws to justify their
wrong choices.

My impression was not. Note that the “big bang” arrival of RISC in the latter 1980s is pretty much in agreement with his timeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Mar 2 21:57:57 2025

On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

But academic efforts do not result in industrial quality products.

*Cough* Unix *cough*

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Sun Mar 2 22:40:11 2025

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
The first ARM design started in 1983 with working silicon in 1985. It was a >>>decade later.

The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>available.

How was the code density?

I have no data on that. Interestingly, unlike the 68k, which was
outcompeted by RISCs at around the same time, the VAX did not have an
afterlife of hobbyists who produced Linux and Debian ports, so I
cannot easily make a comparison.

I know ARM was pretty good but VAX
was fantastic since they sacrified everything else to compact instructions.

I don't think they did. They spent encoding space on instructions
that were very rare, and AFAIK instructions can be encoded that do not
work (e.g., a consant as destination). The major idea seems to have
been orthogonality, not compactness. They did take choices for
compactness (e.g., the call instructions that includes a bitmask for
the registers to be saved), but overall other ideas were more
important.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Mar 2 22:27:59 2025

According to BGB <cr88192@gmail.com>:

I had thought it apparently used a model similar to the 65C816.

Namely, that you could address 64K code + 64K data at a time, but then
load a value into a special register to access different RAM banks.

Not really. The low end PDP-11's were 16 bit, 64K was it.

The larger ones had memory mapping with 8K pages, a size carefully
chosen to be too large for paging, but too small to map whole programs.
There were three modes, user, supervisor, and kernel, with 64K instruction
and data in each. The kernel changed the maps by poking values into
I/O addresses, so it's not something a normal program could do.

Unix only used user and kernel so for our early bitmap terminals, I
mapped the screen's video memory into supervisor data and set the
mode bits so you could access it with MOVE TO/FROM PREVIOUS DATA
SPACE. C didn't generate those so we had some little assembler
routines.

Given the way the PDP-11 was set up, it's hard to think of a memory
expansion scheme that wasn't a grotesque kludge so I think it was
the right decision for VAX to have a new instruction set with a mode
to run PDP-11 code, sort of like the 386's virtual 86 mode for
real mode 8086 code.

That was not what customers were interested in. There were various
Unix variants available for the PC, but the customers preferred using
DOS, which was preinstalled and did not cost extra. ...

Yup. PC/IX was a really nice Unix port for the IBM PC and nobody was interested.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sun Mar 2 23:43:58 2025

On Sun, 2 Mar 2025 21:57:57 +0000, Lawrence D'Oliveiro wrote:

On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

But academic efforts do not result in industrial quality products.

*Cough* Unix *cough*

Not sure you can call Bell Labs academia.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Mar 3 02:08:46 2025

On Sun, 2 Mar 2025 23:43:58 +0000, MitchAlsup1 wrote:

On Sun, 2 Mar 2025 21:57:57 +0000, Lawrence D'Oliveiro wrote:

On Sun, 2 Mar 2025 21:26:34 +0000, MitchAlsup1 wrote:

But academic efforts do not result in industrial quality products.

*Cough* Unix *cough*

Not sure you can call Bell Labs academia.

Berkeley!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Mar 3 02:58:41 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

I know ARM was pretty good but VAX
was fantastic since they sacrified everything else to compact instructions.

I don't think they did. They spent encoding space on instructions
that were very rare, and AFAIK instructions can be encoded that do not
work (e.g., a consant as destination). The major idea seems to have
been orthogonality, not compactness.

It certainly was orthogonal. I was thinking that they had one-, two-, and four-
byte offset versions of all of the relative addressing modes, which made the code smaller at the cost of forcing operands to be decoded one at a time since you couldn't tell where the N+1st operand was until you'd looked at the Nth.

Nearly all opcodes were one byte other than the extended format floating point instructions so it's hard to see how they could have made that much smaller without making it a lot more complicated. On the other hand, we can compare it to the S/360 instruction set which was fairly compact but a lot easier to decode, e.g., you could tell from the high bits of the first opcode byte how long each instruction was and where the operands were so you could decode the rest and do address calculations in parallel.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Mon Mar 3 03:36:04 2025

On Mon, 3 Mar 2025 02:58:41 -0000 (UTC), John Levine wrote:

Nearly all opcodes were one byte other than the extended format floating point instructions so it's hard to see how they could have made that
much smaller without making it a lot more complicated.

Bell points that VAX code gets close to the code density of equivalent
PDP-11 code. That was a major design goal.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 07:24:57 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

That's not a fair comparison. VAX design started in 1975 and shipped in 1978.
The first ARM design started in 1983 with working silicon in 1985. It was a >>>>decade later.

The point is that ARM outperformed VAX without using caches. DRAM
with 800ns cycle time was available in 1971 (the Nova 800 used it).
By 1977, when the VAX 11/780 was released, certainly faster DRAM was >>>available.

How was the code density?

I have no data on that. Interestingly, unlike the 68k, which was
outcompeted by RISCs at around the same time, the VAX did not have an afterlife of hobbyists who produced Linux and Debian ports, so I
cannot easily make a comparison.

The VAX is still supported with gcc and binutils, with newlib as
its C library, so building up a tool chain for assembly/disassembly
should be doable with a few (CPU) hours; you can then compare
sizes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Mon Mar 3 07:39:03 2025

John Levine <johnl@taugh.com> writes:

Nearly all opcodes were one byte other than the extended format floating point >instructions so it's hard to see how they could have made that much smaller >without making it a lot more complicated.

One can look at IA-32 and compare the instruction lengths for frequent instructions like "add %reg1,%reg2", "add const,%reg", "mov (%reg1),
%reg2", and "mov %reg1, (%reg2)" are. I expect that they are shorter
than on the VAX (exception: if the constant fits in 16 bits, but not
in 8). Of course there is a difference: VAX has 16 GPRs and IA-32 has
only 8. AMD64 has 16 GPRs, and needs a REX prefix byte, but only if
one of the additional registers is used (or 64-bit operation is
needed), so for the frequent cases it probably still has shorter
encodings on average than the VAX, especially if compilers prefer
using the first 8 registers. For three-operand addition, IA-32/AMD64
has lea, and other three-operand instructions are not that common.

IA-32 has even shorter encodings for some operations on %eax (stemming
from the need for compactness on the 8080 and the 8086, and the fact
that assembly-language programmers are good at exploiting such
things). One can use this to make the code even shorter by trying to
get the compiler to use %eax for instructions where such encodings
exist. Alternatively, one could reassign this encoding space for some
other purpose, e.g., avoiding the REX prefix in some cases.

Another opportunity for shorter instructions is that IA-32/AMD64
supports byte-width register-to-register operations. These encodings
are unnecessary and can be reused for better purposes.

Another opportunity for making code shorter is that IA-32/AMD64 has
redundant encodings for register-to-register operations: e.g., "sub
%ecx, %edx" can be encoded with the first byte bein 0x29 or 0x2b (they
make a difference if one of the operands is in memory). These
encodings can be reused; one possibility would be to support only
load-and-op instructions, not read-modify-write instructions; the the
first byte 29 (for sub, and similar for the other operations) can be
used for a different purpose, e.g., avoiding the need for a REX
prefix.

One idea I have had is that many instructions encode for a source
register the same register as the target register of the previous
instruction. One could just refer to the target of the previous
instruction and thus save encoding space. The downside is that such instructions no longer are complete, but need the previous instruction
to be decoded, which complicates interrupts and various tools.

Bottom line: IA-32 is probably more compact than VAX, and even for
IA-32 one can think of various ways to possibly make it even more
compact.

And looking at my latest code size measurements <2024Jan4.101941@mips.complang.tuwien.ac.at>, both armhf (ARM T32) and
riscv64 (RV64GC) result in shorter code than IA-32 and AMD64:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
853892 152068 61124 i386

Apparently the additional registers of AMD64 (or maybe the different
calling convention) result in smaller code than IA-32 despite having
to use REX prefixes not only if the additional registers are used, but
also if 64-bit width is required.

The 16-bit wide encodings of ARM T32 and the RISC-V C extension
apparently catch many common cases. These load/store architectures
avoids the encoding waste of having several operation widths[1], and
redundant encodings for register-to-register operations. However, in
those cases where load-and-op instructions are useful, they need to
encode an intermediate register, twice. In those cases where
read-modify-write instructions are useful, they need to encode an
intermediate register, 4 times, and the memory operand a second time;
but obviously on the balance these instruction sets are more compact.

[1] RV64G includes 32-bit wide register-register ops that I consider unnecessary: Usually the top 32 bits of 32-bit operations are not
used, and then one can just use the 64-bit version. In the few cases
where they are used, a 64-bit operation followed by a sign extension
will produce the same result. But maybe the RISC-V architects have
data that shows that the top 32 bits are used more often than I
expect; maybe in C code with int variables that are used for indexing
arrays (in that case we can thank the people who decided to go with
I32LP64 (rather than ILP64) for that).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Mon Mar 3 08:06:08 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

The VAX is still supported with gcc and binutils, with newlib as
its C library, so building up a tool chain for assembly/disassembly
should be doable with a few (CPU) hours; you can then compare
sizes.

Good. While the CPU hours are not the problem, I cannot spare the
human hours for such a project.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MarkC@21:1/5 to Anton Ertl on Mon Mar 3 12:43:51 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

John Levine <johnl@taugh.com> writes:

How was the code density?

I have no data on that. Interestingly, unlike the 68k, which was
outcompeted by RISCs at around the same time, the VAX did not have an >afterlife of hobbyists who produced Linux and Debian ports, so I
cannot easily make a comparison.

NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Mar 3 14:45:37 2025

On Mon, 03 Mar 2025 07:39:03 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

And looking at my latest code size measurements <2024Jan4.101941@mips.complang.tuwien.ac.at>, both armhf (ARM T32) and riscv64 (RV64GC) result in shorter code than IA-32 and AMD64:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
853892 152068 61124 i386

I never measured size of gnu utilities, but my measurements of few
of my own embedded projects and of some microbenchmarks always gave
very different ratios.
That is, in my measurements T32 was also a champion among extant 32b/64b architectures (extinct nanoMIPS was better), but i386 was MUCH closer
than in your figures above. Up to twice closer, actually.
It seems, newer gcc is much worse than older versions at generation of
compact i386 code.
Also in my measurement T32 was significantly denser than RV64GC,
although in case of RV I only did microbenchmarks.

One of my early measurements that I have bookmarked. https://www.realworldtech.com/forum/?threadid=86001&curpostid=86094

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Mar 3 14:56:13 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 2 Mar 2025 13:19:32 +0000, Anton Ertl wrote:

My impression is that Bell was just grasping at straws to justify
their wrong choices.

Likely, but looking at it from the originating time perspective,
VAX would have lost PDP-11 compatibility if it were more RISC-
like.

In fact, that PDP-11 compatibility provided early market opportunities
for the VAX as until VMS 2.0, many of the utility programs and vax
commands were from RSX-11M running in compatibility mode (e.g. PIP).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 3 16:34:35 2025

Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

It almost seems like they could have tried making a PDP-11 based PC.

I dimly remember that there were efforts in that direction. But the
PDP-11 does not even have the cumbersome support for more than 64KB
that the 8086 has (there were PDP-11s with more, but that was even
more cumbersome to use).

DEC also tried their hand in the PC-like business (DEC Rainbow 100).

When I was hired as the PC guy in Hydro (that Fortune-100 corporation
with 77K employees in 130+ countries) in 1984, I took over all
PC-related stuff (HW/SW/OS/add-on HW etc) while the guy who hired me
kept his belowed DEC Rainbow which he felt had the better architecture:

For one thing they did not break Intel's rules about where to place the interrupt vectors. In hindsight this was a bad decision since 100% compatibility with Microsoft Flight Simulator was an absolute
requirement at the time.

They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

For some definition of success, i.e they were sufficiently worse at PCs
to later merge with Compaq who was the first significant vendor in the
PC Compatible marketplace. Columbia beat both of them by half a year or
so, but faded away a bit later.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to MarkC on Mon Mar 3 16:44:17 2025

MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:

NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.

Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a
wide variety of prebuilt stuff there. I took those that sound like architecture names (and probably belong to NetBSD): aarch64 alpha
amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

Unfortunately, they do not seem to port to RISC-V in any form yet, and
their earmv7hf port uses ARM A32, not T32. So the NetBSD competition
is performed without entries for those two instruction set encodings
that showed the smallest code sizes on Debian. Anyway, here are the
results:

bash grep xz
710838 42236 m68k
748354 159304 40930 vax
829077 176836 42840 amd64
855400 164188 aarch64
877284 186924 48032 sparc
882847 187203 49866 i386
898532 179844 earmv7hf
962128 205776 54704 powerpc
1004864 192256 53632 sparc64
1025136 51160 mips64eb
1147664 232688 63456 alpha
1172692 mipsel

I did not find packages for everything on all architectures. In
particular, I did not find packages for gzip on vax, so I used xz
instead.

So VAX is indeed a leading architecture in terms of code size, at
least if ARM T32 and RISC-V C is not in play.

Here are the scripts I used:

for i in aarch64 alpha amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax; do mkdir -p $i/unpacked; (cd $i; for j in bash-5.2.37.tgz grep-3.11.tgz xz-5.6.2.tgz; do wget https://cdn.netbsd.org/pub/pkgsrc/packages/NetBSD/$i/10.1/All/$j;
done); done

This did not get everything, because some packages are in other
version directories or have other version "numbers", so I manually
searched for and downloaded some of the packages. Next time I should
leave some of that to wget, see <https://superuser.com/questions/1424700/wget-download-all-files-starting-with-a-specified-name>.

for i in aarch64 alpha amd64 earmv7hf m68k mips64eb mipsel powerpc sparc sparc64 vax; do echo $i; mkdir -p $i/unpacked; (cd $i; for j in *.tgz; do (cd unpacked; if gzip -t ../$j; then tar xfz ../$j 2>/dev/null; else tar xfJ ../$j 2>/dev/null; fi); done);
done
for i in *; do (cd $i/unpacked/bin; for j in bash ggrep xz; do if test -f $j; then objdump -h $j|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'; else echo -n " "; fi; done); echo $i; done|sort -n

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Mar 3 17:21:32 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

For some definition of success, i.e they were sufficiently worse at PCs
to later merge with Compaq who was the first significant vendor in the
PC Compatible marketplace.

Pfeiffer got Compaq into trouble by buying DEC and not being able to
digest it. HP then bought Compaq and was able to digest all the
parts, leading to a successful PC business (I have no idea how much
Compaq contributed to that and how much HP did) and a successful HPE;
pretty much all of the stuff coming from/through DEC went away (I
think the Tandem legacy may still be identifiable), but maybe they
managed to keep the customers.

Columbia beat both of them by half a year or
so, but faded away a bit later.

I don't think I ever heard about Columbia. At what did they beat
Compaq and HP?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Mar 3 18:51:35 2025

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

For some definition of success, i.e they were sufficiently worse at PCs
to later merge with Compaq who was the first significant vendor in the
PC Compatible marketplace.

Pfeiffer got Compaq into trouble by buying DEC and not being able to
digest it. HP then bought Compaq and was able to digest all the
parts, leading to a successful PC business (I have no idea how much
Compaq contributed to that and how much HP did) and a successful HPE;
pretty much all of the stuff coming from/through DEC went away (I
think the Tandem legacy may still be identifiable), but maybe they
managed to keep the customers.

Columbia beat both of them by half a year or
so, but faded away a bit later.

I don't think I ever heard about Columbia. At what did they beat
Compaq and HP?

Columbia created the first ever PC compatibles (at least afaik), I
bought a pair (one desktop and one luggable) at the same cost as a
single IBM PC in order to develop SW for my father-in-law.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Mar 3 19:53:12 2025

On Mon, 03 Mar 2025 17:21:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

They did not succeed. Maybe that's the decisive difference from
HP: They did succeed in the PC market.

Digital sold solid PCs in the 1990s. Some under brand DECpc, others
under brand DEC Station. Later on (after 1994) the used yet another
brand that I can not recollect. Wikipedia says that most of this
machines were not manufactured on DEC factories, but I would think that
as reseller they still got some profit.

For some definition of success, i.e they were sufficiently worse at
PCs to later merge with Compaq who was the first significant vendor
in the PC Compatible marketplace.

Pfeiffer got Compaq into trouble by buying DEC and not being able to
digest it. HP then bought Compaq and was able to digest all the
parts, leading to a successful PC business (I have no idea how much
Compaq contributed to that and how much HP did)

My impression is that Compaq contribution prevailed.

and a successful HPE;
pretty much all of the stuff coming from/through DEC went away (I
think the Tandem legacy may still be identifiable), but maybe they
managed to keep the customers.

They preserved VMS. Sold it some 15+ years later.

Columbia beat both of them by half a year or
so, but faded away a bit later.

I don't think I ever heard about Columbia. At what did they beat
Compaq and HP?

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Mon Mar 3 17:57:37 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Note that the “big bang” arrival of RISC in the
latter 1980s is pretty much in agreement with his timeline.

Correlation does not prove causation. And when the facts (performance
of cacheless and small-cache RISCs) are counterevidence of his
explanation, his explanation is obviously wrong.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon Mar 3 17:32:09 2025

Michael S <already5chosen@yahoo.com> writes:

It seems, newer gcc is much worse than older versions at generation of >compact i386 code.

Yes, a weakness of my measurement method is that if the compiler does
not care much about code size (and compilers usually do what gives
good benchmark timings, not the smallest code), it can easily make the
code a lot bigger in ways that have nothing to do with encoding size,
e.g., by loop unrolling or padding before branch targets,

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 17:53:35 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:

NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.

Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a
wide variety of prebuilt stuff there. I took those that sound like architecture names (and probably belong to NetBSD): aarch64 alpha
amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

Unfortunately, they do not seem to port to RISC-V in any form yet, and
their earmv7hf port uses ARM A32, not T32. So the NetBSD competition
is performed without entries for those two instruction set encodings
that showed the smallest code sizes on Debian. Anyway, here are the
results:

bash grep xz
710838 42236 m68k
748354 159304 40930 vax
829077 176836 42840 amd64
855400 164188 aarch64
877284 186924 48032 sparc
882847 187203 49866 i386
898532 179844 earmv7hf
962128 205776 54704 powerpc
1004864 192256 53632 sparc64
1025136 51160 mips64eb
1147664 232688 63456 alpha
1172692 mipsel

Utilities are often compiled with a medium to high level of
optimizationo (like -O2), which can do loop unrolling, inlining
and function cloning, all of which can increase code size.
These decisions can also depend on the number of available register.

If your aim is small code size, it is better to compare output
compiled with -Os.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Mar 3 18:05:27 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete
implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite
direction:

The first article in this Mar-1987 HP Journal is about the
HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL. >Implementation started in Apr-1983, prototype ready early 1984.

<https://people.csail.mit.edu/emer/media/papers/1999.06.retrospective.vax.pdf> says:

|the VAX 11/780 CPU spanned about 20 boards.

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so. Are the boards of a different
size? If the answers to both questions are "no", this would be
counterevidence to Mitch Alsup's claim.

"[3 stage] pipeline fetches and executes an instruction every 125 ns,
a 4096-entry translation lookaside buffer (TLB) for high-speed address >translation, and 128K bytes of cache memory."

"The measured MIPS rate for the Model 840 varies from
about 3.5 to 8 MIPS with an average of 4.5 to 5."

which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

It's interesting that this HP machine needed a cache at 8MHz, while
the contemporary ARM2 could run from DRAM at the same speed. But
then, the HP machine supports bigger memories, and includes an MMU,
both of which slow things down.

https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Mar 3 20:04:39 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.

Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.

According to https://www.openpa.net/pa-risc_processor_pa-early.html#ts-1
the first HP-PA CPU was introduced in 1986, and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

For example, you could buy state machines programmable by FPGA in 1986,
which was not available in 1977. (No idea if HP used them or not).

I don't believe that HP used FPGAs for the first PA/HP3000 processor;
the hp journal article posted earlier says that they were standard
TTL SSI logic chips (which allowed them to build several hundred
prototypes to use for software development).

The VAX-11 design _shipped_ in 1978, so the logic family used
was selected several years prior.

We were designing a new mainframe using ECL gate-arrays in 1986.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Mon Mar 3 19:24:33 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.

Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.

According to https://www.openpa.net/pa-risc_processor_pa-early.html#ts-1
the first HP-PA CPU was introduced in 1986, and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

For example, you could buy state machines programmable by FPGA in 1986,
which was not available in 1977. (No idea if HP used them or not).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Mon Mar 3 22:15:45 2025

On Mon, 03 Mar 2025 16:44:17 GMT, Anton Ertl wrote:

MarkC <usenet@em-ess-see-twenty-seven.me.uk> writes:

NetBSD still has a VAX port, so the sizes of pre-built packages from
there might be informative.

Thanks. The NetBSD pkgsrc is not limited to NetBSD, and there is a wide variety of prebuilt stuff there. I took those that sound like
architecture names (and probably belong to NetBSD): aarch64 alpha amd64 earmv7hf i386 m68k mips64eb mipsel powerpc sparc sparc64 vax

Unfortunately, they do not seem to port to RISC-V in any form yet, and
their earmv7hf port uses ARM A32, not T32. So the NetBSD competition is performed without entries for those two instruction set encodings that
showed the smallest code sizes on Debian. Anyway, here are the results:

You could compare sizes of applications in the base.tgz tarball for each architecture, this is available for RISC-V as well as all the others.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Mon Mar 3 22:41:46 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.

Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.

So what?

and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

Nice! The pictures are pretty good. I can read the markings on the
chip. The first chip I looked at was marked 74AS181. TI introduced
the 74xx series of TTL chips starting in 1964, and when I read TTL, I
expected to see 74xx chips. The 74181 was introduced in February
1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
for the VAX, and the photo confirms my expectation for the first HP-PA
CPU.

The AS family was only introduced in 1980, so there was some advances
between the VAX and this HP-PA CPU indeed. However, as far as the
number of boards is concerned, a 74AS181 takes as much space as a
plain 74181, so that difference is irrelevant for that aspect.

I leave it to you to point out a chip on the HP-PA CPU that did not
have a same-sized variant avalable in, say, 1975.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Mon Mar 3 23:23:46 2025

On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

If your aim is small code size, it is better to compare output compiled
with -Os.

Then it becomes an artificial benchmark, trying to minimize code size at
the expense of real-world performance.

Remember, VAX was built for real-world use, not for academic benchmarks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Mar 3 23:27:57 2025

On Mon, 03 Mar 2025 17:57:37 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Note that the “big bang” arrival of RISC in the latter 1980s is pretty >> much in agreement with his timeline.

Correlation does not prove causation.

Back atcha. Your attempt at drawing correlations (or lack thereof) between Bell’s claims and reality is no more valid than mine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Mon Mar 3 23:32:07 2025

On Mon, 3 Mar 2025 16:34:35 +0100, Terje Mathisen wrote:

... while the guy who hired me kept his belowed DEC Rainbow which he
felt had the better architecture:

I only had a brief exposure to them, but I think they were beautiful
machines, too.

For one thing they did not break Intel's rules about where to place the interrupt vectors. In hindsight this was a bad decision since 100% compatibility with Microsoft Flight Simulator was an absolute
requirement at the time.

This is why I refer to “Microsoft-compatible”, rather than “IBM- compatible”, PCs. Because it was Microsoft that very quickly took over the mantle of arbiter of “compatibility” from IBM.

Anton Ertl wrote:

They did not succeed. Maybe that's the decisive difference from HP:
They did succeed in the PC market.

Bell mentions that: DEC tried to set a standard (a reasonable thing to try
in 1982), and failed. They should have quickly pivoted to embracing the
actual standard that won, but they did not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Mar 3 23:26:03 2025

On Mon, 03 Mar 2025 07:39:03 GMT, Anton Ertl wrote:

... VAX has 16 GPRs ...

Technically only 13.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Mar 3 20:12:51 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit
boards than VAX 11/780, making it a lot more expensive.

What makes you think that a pipelined single-issue RV32GC would take
more circuit boards than VAX11/780? I have no data about discrete
implementations, but if we look at integrated ones and assume that the
number of transistors or the area corresponds to the number of circuit
boards in discrete implementations, the evidence goes in the opposite
direction:

The first article in this Mar-1987 HP Journal is about the
HP 9000 MODEL 840 and HP 3000 Series 930 implementing the HP PA ISA.
The cpu is 5 boards, 6 with FPU, built with standard and FAST TTL.
Implementation started in Apr-1983, prototype ready early 1984.

<https://people.csail.mit.edu/emer/media/papers/1999.06.retrospective.vax.pdf>
says:

|the VAX 11/780 CPU spanned about 20 boards.

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so. Are the boards of a different
size? If the answers to both questions are "no", this would be counterevidence to Mitch Alsup's claim.

"[3 stage] pipeline fetches and executes an instruction every 125 ns,
a 4096-entry translation lookaside buffer (TLB) for high-speed address
translation, and 128K bytes of cache memory."

"The measured MIPS rate for the Model 840 varies from
about 3.5 to 8 MIPS with an average of 4.5 to 5."

which at 125 ns, 8 MHz clock is an IPC of 0.43 to 1.0, avg 0.625.

It's interesting that this HP machine needed a cache at 8MHz, while
the contemporary ARM2 could run from DRAM at the same speed. But
then, the HP machine supports bigger memories, and includes an MMU,
both of which slow things down.

https://archive.org/download/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard/Hewlett-Packard_Journal_Vol._38_No._3_1987-03_Hewlett-Packard.pdf

- anton

ARM1 was launched 1985, ARM2 in 1986.
I found an ARM2 manual and the short answer is that the chip
drives the RAS and CAS signals to the dram directly.
The chip's clock is adjustable from 100 kHz to 10 MHz
and you match the cpu clock to your dram timing.
There is no READY line on the memory bus.

It does have one interesting feature that if the current address
is sequential to the prior one it skips the RAS cycle.

I looks like it takes 1 cycle to do a read but their timing diagrams
are total crap. There are only two and they contradict each other.
It looks like the RAS signal is changing state on 1/4 cycle for no reason.

The Motorola Memory Book from 1979 shows MCM4027A 4kb*1 drams with
80 to 165 ns CAS access, 120 to 250 RAS access, 320 to 375 R/W cycle.
Similar numbers for MCM4116A 16kb*1 R/W cycle of 500 ns.

VAX probably used 4kb 500 ns drams.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Mar 4 02:09:04 2025

According to Brian G. Lucas <bagel99@gmail.com>:

That was not what customers were interested in. There were various
Unix variants available for the PC, but the customers preferred using
DOS, which was preinstalled and did not cost extra. ...

Yup. PC/IX was a really nice Unix port for the IBM PC and nobody was interested.

As this (the kernel part) was my project, it was very disappointing. I think >IBM priced such that with DOS being "free", it had no chance.

Nobody knew what the market for PC/IX was supposed to be beyond some
handwaving "if 5% if PC users buy it we'll be rich." PC/IX could do
anything a PDP-11 running Unix could do, give or take peripherals,
but the PC market was very different from the PDP-11 market and by
that time the PDP-11 was rather long in the tooth.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Tue Mar 4 10:04:20 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

If your aim is small code size, it is better to compare output compiled
with -Os.

Then it becomes an artificial benchmark, trying to minimize code size at
the expense of real-world performance.

Remember, VAX was built for real-world use, not for academic benchmarks.

And supposedly the real-world constraints at the time made it
necessary to minimize code size. In the current discussion we look at
how RV32GC might have fared under this constraint. So compiling for
small code size could be a way to find that out. Whether -Os really
achives that is another question (some earlier things I have seen and
discussed here make me doubt that).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Swindells on Tue Mar 4 08:39:16 2025

Robert Swindells <rjs@fdy2.co.uk> writes:

You could compare sizes of applications in the base.tgz tarball for each >architecture, this is available for RISC-V as well as all the others.

I did that, see below. There is one problem: RISC-V is only available
in the daily builds, many of the other architectures are not. So I
used 10.0 for all architectures except RISC-V. I also measured the
daily builds for AMD64, to see how the difference in versions of the
source affects the code sizes.

The .text section sizes are (sorted by the ksh column):

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
1030288 150686 79852 31492 mvme68k
779393 155764 75795 31813 vax
1302254 171505 83249 35085 amd64
1229032 178332 89180 36876 evbarm-aarch64
1539052 179055 82280 34717 amd64-daily
1374961 184458 96971 37218 i386
1247476 185792 96728 42028 evbarm-earmv7hf
1333952 187452 96328 39472 sparc
1586608 204032 106896 45408 evbppc
1536144 204320 106768 43232 hppa
1397024 216832 109792 48512 sparc64
1538536 222336 107776 44912 evbmips-mips64eb
1623952 243008 122096 50640 evbmips-mipseb
1689920 251376 120672 51168 alpha
2324752 2259984 1378000 ia64

libc seems to be quite different between different architectures,
probably with specialized code for different architectures (with vax
being an outlier towards small sizes, see below), the programs seem to
be less specialized. Looking at the two amd64 results, the current
differences between 10.0 and daily seem to be small for pax and ed,
while ksh and libc seem to have grown. quite a bit. The RISC-V
variants use compressed instructions, evbarm-earmv7hf use A32 (no
16-bit instructions). The ia64 binaries are statically linked and no
shared libraries are present in the base package.

I looked at what the largest functions in libc are:

For vax:

000afab4 g F .text 0000266d __ns_sprintrrf
00079d06 g F .text 00002c09 __vfwprintf_unlocked_l
000cd5f0 g F .text 00003888 __vfprintf_unlocked_l

For riscv-riscv32:

000792ac g F .text 0000238a __vfwprintf_unlocked_l
001169ec g F .text 000026aa __vfprintf_unlocked_l
000f64b2 g F .text 00002af0 __ns_sprintrrf
000c798c l F .text 00002c74 malloc_conf_init_helper
0013be64 l F .text 0000503a stats_arena_print

The last two functions do not occur in the vax's libc (as do a lot of
others), which probably explains much of the size difference.
__ns_sprintrrf is larger on RISC-V while __vfprintf_unlocked_l and __vfwprintf_unlocked_l are smaller; for the total of these three
functions: they are a factor of 1.186 larger on the vax than on
riscv-riscv32. So the difference in libc sizes is probably due to
additional functions in riscv-riscv32.

Looking at ksh, pax and ed, the RISC-V variants have the smallest code
sizes, even for ksh. The VAX has significantly larger sizes, even
though it is still small relative to most other architectures.

So if a major goal of the VAX project was to have small code sizes,
going for RV32GC (riscv-riscv32) would have been a good idea. And an implementation somewhat similar to the HP-PA TS-1 (with smaller cache
due to the SRAM technology at the time) plus a PDP-11-to-microcode
decoder would not have increased the cost compared to the actual VAX,
and probably resulted in faster execution.

In an earlier posting I suggested a PDP-11->RV32G decoder, but that's
not be a good match given the condition-code architecture of the
PDP-11 and the CC-less architecture of RISC-V. So one solution is to
have a microarchitecture that has a CC register for PDP-11 emulation
and decode the PDP-11 code to that. Another approach would be to add
carry and overflow to the GPRs as RISC-V extension as I suggested
elsewhere, and I guess then PDP-11 -> extended RISC-V would be
possible.

Instead of having a cache, an interleaved memory subsystem might also
be able to provide the memory bandwidth to make better use of the RISC execution rate potential. Also, the compressed instructions reduce
the instruction bandwidth requirements (compared to RISC-V without
compressed instructions), but require an additional instruction buffer (additional TTL chips).

Here are the scripts I used:

for i in alpha evbarm-earmv7hf evbmips-mips64eb evbmips-mipseb evbppc hppa i386 ia64 mvme68k sparc vax; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD/NetBSD-10.0/$i/binary/sets/base.tgz); done
for i in amd64 evbarm-aarch64 sparc64; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD/NetBSD-10.0/$i/binary/sets/base.txz); done
for i in riscv-riscv32 riscv-riscv64; do mkdir -p $i/unpacked && (cd $i && wget http://ftp.fr.netbsd.org/pub/NetBSD-daily/HEAD/latest/$i/binary/sets/base.tgz); done
mkdir -p amd64-daily/unpacked
cd amd64-daily
wget http://ftp.fr.netbsd.org/pub/NetBSD-daily/HEAD/latest/amd64/binary/sets/base.tar.xz
cd ..
for i in *; do (cd $i/unpacked; if test -f ../base.tgz; then tar xfz ../base.tgz; else tar xfJ ../base.tar.xz; fi); done
for i in *; do for j in lib/libc.so bin/ksh bin/pax bin/ed; do (cd $i/unpacked; if test -f $j; then objdump -h $j|awk --non-decimal-data '/[.]text/ {printf("%8d ","0x"$3)}'; else echo -n " "; fi); done; echo $i; done|sort -nk2

For determining the largest functions in libc (in an unpacked/lib
directory):

objdump -t libc.so|grep '[.]text'|sort -t '\0' -k1.25

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Tue Mar 4 10:09:09 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

I found an ARM2 manual and the short answer is that the chip
drives the RAS and CAS signals to the dram directly.
The chip's clock is adjustable from 100 kHz to 10 MHz
and you match the cpu clock to your dram timing.
There is no READY line on the memory bus.

It does have one interesting feature that if the current address
is sequential to the prior one it skips the RAS cycle.

Yes, my memory is returning: someone explained here a while ago that
this allowed fast execution: Instructions are executed sequentially,
so as long as the row does not change, ARM get the instructions
quickly out of the DRAM. Every data access incurs a RAS cycle, so the architects added load/store-multiple in order to benefit from this sequential-access optimization also for block copies and for register
spill and refill around calls.

Maybe switch from RV32GC to ARM T32 for my better-VAX time-travel
project:-) Might also help with the condition codes.

The Motorola Memory Book from 1979 shows MCM4027A 4kb*1 drams with
80 to 165 ns CAS access, 120 to 250 RAS access, 320 to 375 R/W cycle.
Similar numbers for MCM4116A 16kb*1 R/W cycle of 500 ns.

VAX probably used 4kb 500 ns drams.

But with the sequential-access optimization, 250ns cycles should be
possible (with waiting on changing rows).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lucas on Tue Mar 4 16:53:00 2025

In article <vq4jrl$1cguk$1@dont-email.me>, bagel99@gmail.com (Brian G.
Lucas) wrote:

On 3/2/25 5:27 PM, John Levine wrote:

Yup. PC/IX was a really nice Unix port for the IBM PC and nobody
was interested.

As this (the kernel part) was my project, it was very
disappointing. I think IBM priced such that with DOS being "free",
it had no chance.

Also, timing. According to Wikipedia, PC/IX cost $900 and was released in
1984. By that time, there was a lot of business software and games
available for DOS, but presumably, very little for PC/IX?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to already5chosen@yahoo.com on Tue Mar 4 16:32:42 2025

On Mon, 3 Mar 2025 19:53:12 +0200, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 03 Mar 2025 17:21:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Digital sold solid PCs in the 1990s. Some under brand DECpc, others
under brand DEC Station.

Was there an Intel based DECstation?
The only ones I ever saw were MIPS based.

<searches>

Ahh! Wikipedia says there were 3 different DECstation lines: one based
on PDP-8, annother based on MIPS, and yet another based on Intel.
Naturally one has to scan/read the entire article to find the Intel
references.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to George Neuner on Wed Mar 5 01:28:18 2025

On Tue, 04 Mar 2025 16:32:42 -0500
George Neuner <gneuner2@comcast.net> wrote:

On Mon, 3 Mar 2025 19:53:12 +0200, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 03 Mar 2025 17:21:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Digital sold solid PCs in the 1990s. Some under brand DECpc, others
under brand DEC Station.

Was there an Intel based DECstation?
The only ones I ever saw were MIPS based.

<searches>

Ahh! Wikipedia says there were 3 different DECstation lines: one based
on PDP-8, annother based on MIPS, and yet another based on Intel.
Naturally one has to scan/read the entire article to find the Intel references.

The name I dug out of 1994 Byte issue are DEC Celebris desktop and
HiNote laptop. But that's not the names I had in mind.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Tue Mar 4 23:27:38 2025

On Tue, 04 Mar 2025 10:04:20 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:

If your aim is small code size, it is better to compare output
compiled with -Os.

Then it becomes an artificial benchmark, trying to minimize code size
at the expense of real-world performance.

Remember, VAX was built for real-world use, not for academic
benchmarks.

And supposedly the real-world constraints at the time made it necessary
to minimize code size.

Remember, RAM was much more expensive back then.

For comparison, when Data General started their “Eagle” project (as chronicled in Tracy Kidder’s book “The Soul Of A New Machine”), which finally shipped as the MV/8000, they decided that having a full 32-bit
address, VAX-style, was unnecessary, so they used some of those bits--4, I think--to hold privilege levels.

Overall, they managed to end up with a simpler architecture than VAX. But
it ran out of address space a little bit sooner.

In the current discussion we look at how RV32GC might have fared under
this constraint.

Sure. Except you need a much more complicated and resource-hungry compiler
than would have been reasonable to run on a VAX back then.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Wed Mar 5 00:05:30 2025

On Wed, 5 Mar 2025 01:28:18 +0200, Michael S wrote:

On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner <gneuner2@comcast.net> wrote:

Ahh! Wikipedia says there were 3 different DECstation lines: one based
on PDP-8, annother based on MIPS, and yet another based on Intel.
Naturally one has to scan/read the entire article to find the Intel
references.

The name I dug out of 1994 Byte issue are DEC Celebris desktop and
HiNote laptop. But that's not the names I had in mind.

Entirely different decades.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Tue Mar 4 23:29:12 2025

On Tue, 4 Mar 2025 16:53 +0000 (GMT Standard Time), John Dallman wrote:

According to Wikipedia, PC/IX cost $900 and was released
in 1984. By that time, there was a lot of business software and games available for DOS, but presumably, very little for PC/IX?

How would you have done games without being able to directly address
screen memory? I’m sure PC/IX, being a Unix-type system, would have disallowed that. And X11 hadn’t even been developed yet.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 07:36:36 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

In the current discussion we look at how RV32GC might have fared under
this constraint.

Sure. Except you need a much more complicated and resource-hungry compiler >than would have been reasonable to run on a VAX back then.

Looking at compiler technology available in 1975 close to DEC
[wulf+75] (highly recommended), I don't think so. RISC-V code size
(as well as VAX code size) benefits from register allocation (existent
in that compiler, running on the PDP11). Instruction scheduling would
have been helpful for performance, but would not help code size, and
is not particularly complex or resource-hungry when done on the
basic-block level (good enough for single-issue RISCs).

By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for RISC-II).
Compilers at the time did not use the CISCy features much, which is
one reason why the IBM 801 project and later the Berkeley RISC and
Stanford MIPS proposed replacing them with a load/store architecture.
I think that a lot of that is inherent, but a part of it may be due to
the state of the art in instruction selection at the time. So RISC
code comes easily out of compiler technology of the time, and for
smaller code, you just have to perform register allocation, which was
possible at the time, as demonstrated by Wulf et al.

@Book{wulf+75,
author = {William Wulf and Richard K. Johnsson and Charles
B. Weinstock and Steven O. Hobbs and Charles M. Geschke},
title = {The Design of an Optimizing Compiler},
publisher = {Elsvier},
year = {1975},
isbn = {0-444-0164-6},
annote = {Describes a complete Bliss/11 compiler for the
PDP-11. It uses some interesting techniques: it
uses a (hand-constructed) tree parsing automaton for
parts of the code selection (Section~3.4); it
optimizes the use of unary complement operators
(Section~3.3); it uses a smart scheme to represent
a conservative approximation of the lifetime of
variables in constant space and uses that for
register allocation (Sections~4.1.3 and~4.3).}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Wed Mar 5 09:16:00 2025

In article <vq82c8$232tl$7@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

How would you have done games without being able to directly
address screen memory? I'm sure PC/IX, being a Unix-type system,
would have disallowed that.

How?

There's no memory management hardware in an 8088, and PC/IX ran on a
basic PC/XT.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Theo@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 11:58:35 2025

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

The answer was no, the VAX could not have been done as a RISC
architecture. RISC wasn’t actually price-performance competitive until
the latter 1980s:

RISC didn’t cross over CISC until 1985. This occurred with the
availability of large SRAMs that could be used for caches. It
should be noted at the time the VAX-11/780 was introduced, DRAMs
were 4 Kbits and the 8 Kbyte cache used 1 Kbits SRAMs. Memory
sizes continued to improve following Moore’s Law, but it wasn’t
till 1985, that Reduced Instruction Set Computers could be built
in a cost-effective fashion using SRAM caches. In essence RISC
traded off cache memories built from SRAMs for the considerably
faster, and less expensive Read Only Memories that held the more
complex instructions of VAX (Bell, 1986).

ARM2 had no caches, but was still table-topping in its era.

The thing often missed in the CISC v RISC debate is the cost of main memory.

In the 1970s DRAM (or discrete SRAM) was very expensive. So you want a very tight instruction encoding that is maximally expressive - resulting in
complex microcode and many-cycle instructions. Effectively the microcode
was a table of library functions and the assembly was more like a series of
API calls.

In the mid 1980s (~1984) the Japanese had entered the DRAM market which
caused the price of DRAMs to fall dramatically. That meant you could have a RISC CPU which was more profligate with its instruction encoding but could
have a much simplified pipeline and so much better IPC. You didn't need
to have the microcode library any more, you could just let the compiler do
it. Also memory bandwidth had improved, allowing better feeding of a more profligate CPU (and compilers had got better too)

In the late 1980s process improvements meant that on-die caches had become
more affordable, which assisted memory bandwidth and latency further.

Theo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed Mar 5 15:07:03 2025

On Wed, 5 Mar 2025 00:05:30 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Wed, 5 Mar 2025 01:28:18 +0200, Michael S wrote:

On Tue, 04 Mar 2025 16:32:42 -0500 George Neuner
<gneuner2@comcast.net> wrote:

Ahh! Wikipedia says there were 3 different DECstation lines: one
based on PDP-8, annother based on MIPS, and yet another based on
Intel. Naturally one has to scan/read the entire article to find
the Intel references.

The name I dug out of 1994 Byte issue are DEC Celebris desktop and
HiNote laptop. But that's not the names I had in mind.

Entirely different decades.

You are not obliged to take part in discussion you aren't able to
follow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Wed Mar 5 15:01:11 2025

On Sun, 02 Mar 2025 18:30:24 GMT, Anton Ertl wrote:

Robert Swindells <rjs@fdy2.co.uk> writes:

On Sun, 02 Mar 2025 09:34:37 GMT, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A pipelined machine in 1978 would have had 50% to 100% more circuit >>>>boards than VAX 11/780, making it a lot more expensive.

...

You could look at the MIT Lisp Machine, it used basically the same chips
as a VAX 11/780 but was a pipelined load/store architecture internally.

And what was the effect on the number of circuit boards? What effect
did the load/store architecture have, and what effect did the pipelining have?

It's been a number of years since I read about Lisp Machines and
Symbolics. My impression was that they were both based on CISCy ideas;
it's about closing the semantic gap, no? Load/store would surprise me.

I don't know the internal architecture of Symbolics machines well enough
to comment on it, only the MIT/LMI/TI ones.

The MIT Lisp Machine was described as microcoded but this is more like a
simple RTOS combined with an interpreter for the 16-bit instructions of
the higher level emulated stack machine.

The "micro" instruction set is three address, load/store, even has
delay slots. There are a lot of registers so the instruction word is wide
at 56 bits, there is 16kw of SRAM to hold this code.

Code written in this looks like typical RISC assembler to me, I have added
TFTP support to it, there was also an option to compile Lisp down to the
real instruction set.

Built using a 74181+74182 ALU with other 74 series logic same as a VAX
11/780. The pipeline is two instructions deep.

The design documentation is available online, someone could go through
that to get the exact number of boards used. The purchase price was lower
than a VAX though, even with a high-resolution display.

And when the RISC revolution came, they could not compete. The RISCy
way to Lisp implementation was explored in SPUR (and Smalltalk in SOAR)
(one of which counts as RISC-III and the other as RISC-IV, I don't
remember which), and commercialized in SPARC's instructions with support
for tags (not used in the Lisp system that a former comp.arch regular contributed to).

SOAR was before SPUR.

The tags support on SPARC(32) only helps in Lisp for integer operations
inline, like using a number as an array offset.

The same word layout of using the "free" lower bits for tags when you know
that objects are aligned to larger boundaries is still used in most Lisp systems today, just without any hardware support, you need to generate instructions to shift down an integer value before using it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Wed Mar 5 18:08:01 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.

Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.

So what?

and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

Nice! The pictures are pretty good. I can read the markings on the
chip. The first chip I looked at was marked 74AS181. TI introduced
the 74xx series of TTL chips starting in 1964, and when I read TTL, I expected to see 74xx chips. The 74181 was introduced in February
1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
for the VAX, and the photo confirms my expectation for the first HP-PA
CPU.

The AS family was only introduced in 1980, so there was some advances
between the VAX and this HP-PA CPU indeed. However, as far as the
number of boards is concerned, a 74AS181 takes as much space as a
plain 74181, so that difference is irrelevant for that aspect.

I leave it to you to point out a chip on the HP-PA CPU that did not
have a same-sized variant avalable in, say, 1975.

What I found intruiging are the chips that have numbers on paper
on them, like 09740-81710. That chip has a MMI logo still sticking
out. This is the logo of Monolithic Memories, Inc. which developed
the PAL chips of "Soul of a New Machine" and Eclipse MV 8000 fame.
At https://en.wikipedia.org/wiki/Programmable_Array_Logic you can
see the logo of the company.

PALs were not available for the VAX development, and they certainly
made implemting logic far less cumbersome, and they took up far less
space than their equivalent in logic gates (again, as described in
"The Soul of a New Machine", where Tom West gambled the development
on MMI getting its act togetther).

Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.

Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Mar 5 15:28:16 2025

Anton Ertl [2025-03-01 11:58:17] wrote:

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.

I wonder if an RV32GC would be competitive if implemented in the
technology available back in 1977 (when the VAX-11/780 came out,
according to Wikipedia).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stefan Monnier on Wed Mar 5 21:15:44 2025

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

Anton Ertl [2025-03-01 11:58:17] wrote:

Bottom line: If you sent, e.g., me and the needed documents back in
time to the start of the VAX project, and gave me a magic wand that
would convince the DEC management and workforce that I know how to
design their next architecture, and how to compiler for it, I would
give the implementation team RV32GC as architecture to implement, and
that they should use pipelining for that, and of course also give that
to the software people.

I wonder if an RV32GC would be competitive if implemented in the
technology available back in 1977 (when the VAX-11/780 came out,
according to Wikipedia).

RISC in general could have been - the 801 (although it was
implemented in ECL), HP's first HP-PA, implemented in TTL, and
ARM show that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Wed Mar 5 19:04:20 2025

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Have TTL chips advanced between the VAX and the first HP-PA
implementation? I don't think so.

Oh yes, they did; there were nine years between the launch of the
VAX and the launch of HP-PA.

So what?

and you can see pictures
at https://computermuseum.informatik.uni-stuttgart.de/dev_en/hp9000_840/

Nice! The pictures are pretty good. I can read the markings on the
chip. The first chip I looked at was marked 74AS181. TI introduced
the 74xx series of TTL chips starting in 1964, and when I read TTL, I
expected to see 74xx chips. The 74181 was introduced in February
1970, and I expected it to be in the HP-PA CPU, as well as in the VAX
11/780. <https://en.wikipedia.org/wiki/74181> confirms my expectation
for the VAX, and the photo confirms my expectation for the first HP-PA
CPU.

The AS family was only introduced in 1980, so there was some advances
between the VAX and this HP-PA CPU indeed. However, as far as the
number of boards is concerned, a 74AS181 takes as much space as a
plain 74181, so that difference is irrelevant for that aspect.

I leave it to you to point out a chip on the HP-PA CPU that did not
have a same-sized variant avalable in, say, 1975.

What I found intruiging are the chips that have numbers on paper
on them, like 09740-81710. That chip has a MMI logo still sticking
out. This is the logo of Monolithic Memories, Inc. which developed
the PAL chips of "Soul of a New Machine" and Eclipse MV 8000 fame.
At https://en.wikipedia.org/wiki/Programmable_Array_Logic you can
see the logo of the company.

PALs were not available for the VAX development, and they certainly
made implemting logic far less cumbersome, and they took up far less
space than their equivalent in logic gates (again, as described in
"The Soul of a New Machine", where Tom West gambled the development
on MMI getting its act togetther).

Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.

Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.

MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array. PAL has programmable AND matrix but a fixed OR matrix.
PLA has both AND and OR matrix programmable.

Mask programmed PLA's were available since 1970, and field programmable
FPLA's available in 1976 from a number of suppliers (e.g. Signetics). https://en.wikipedia.org/wiki/Programmable_logic_array

If one was building a RISC style ISA cpu in 1975 they could be used
for decoding and state machines for fetch, load/store, page table walker.
I don't know the price.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Robert Swindells on Thu Mar 6 00:24:25 2025

On Wed, 5 Mar 2025 15:01:11 -0000 (UTC), Robert Swindells wrote:

The same word layout of using the "free" lower bits for tags when
you know that objects are aligned to larger boundaries is still used
in most Lisp systems today, just without any hardware support, you
need to generate instructions to shift down an integer value before
using it.

*Lightbulb moment*

How much would it cost in hardware to add support for ignoring some
bottommost N bits (N fixed? configurable?) for most accesses?

This ties in with my idea that it would have been useful to reserve the
bottom 3 bits for a bit offset, albeit ignored (or even MBZ) by normal load/store instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 6 01:01:41 2025

On Thu, 6 Mar 2025 0:24:25 +0000, Lawrence D'Oliveiro wrote:

On Wed, 5 Mar 2025 15:01:11 -0000 (UTC), Robert Swindells wrote:

The same word layout of using the "free" lower bits for tags when
you know that objects are aligned to larger boundaries is still used
in most Lisp systems today, just without any hardware support, you
need to generate instructions to shift down an integer value before
using it.

*Lightbulb moment*

How much would it cost in hardware to add support for ignoring some bottommost N bits (N fixed? configurable?) for most accesses?

The gates used to query and control whether that functionality is
present
or absent will be 50× larger than the gates to ignore any misalignment.

This ties in with my idea that it would have been useful to reserve the bottom 3 bits for a bit offset, albeit ignored (or even MBZ) by normal load/store instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Theo on Thu Mar 6 00:28:34 2025

On 05 Mar 2025 11:58:35 +0000 (GMT), Theo wrote:

ARM2 had no caches, but was still table-topping in its era.

Table-topping in the PC market, perhaps, and less so if you look at price/ performance as opposed to performance.

Was it table-topping in the workstation market? Somehow it never seems to
have been considered seriously for that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Mar 6 02:30:55 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of >https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for RISC-II). >Compilers at the time did not use the CISCy features much, which is
one reason why the IBM 801 project and later the Berkeley RISC and
Stanford MIPS proposed replacing them with a load/store architecture.

I'm not so sure. The IBM Fortran H compiler used a lot of the 360's instruction set and it is my recollection that even the dmr C compiler would generate memory
to memory instructions when appropriate. The PL.8 compiler generated code for 5 architectures including S/360 and 68K, and I think I read somewhere that its S/360 code was considrably better than the native PL/I compilers.

I get the impression that they found that once you have a reasonable number of registers, like 16 or more, the benefit of complex instructions drops because you can make good use of the values in the registers.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Thu Mar 6 06:53:23 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.

Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.

MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array. PAL has programmable AND matrix but a fixed OR matrix.
PLA has both AND and OR matrix programmable.

Mask programmed PLA's were available since 1970, and field programmable FPLA's available in 1976 from a number of suppliers (e.g. Signetics). https://en.wikipedia.org/wiki/Programmable_logic_array

I read somewhere that these were not used much because, in the
beginning, they were slow, big, expensive and difficult to program.

This is probably why they were not considered as a replacement for
the PAL chips for the MV/8000, had MMI failed - they were not up
to the job.

If one was building a RISC style ISA cpu in 1975 they could be used
for decoding and state machines for fetch, load/store, page table walker.
I don't know the price.

They could have been used for the same things on the VAX 11/780. Does
anybody know if they were?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Thu Mar 6 16:11:15 2025

John Levine <johnl@taugh.com> writes:

I'm not so sure. The IBM Fortran H compiler used a lot of the 360's instruction
set and it is my recollection that even the dmr C compiler would generate memory
to memory instructions when appropriate. The PL.8 compiler generated code for 5
architectures including S/360 and 68K, and I think I read somewhere that its S/360 code was considrably better than the native PL/I compilers.

I get the impression that they found that once you have a reasonable number of
registers, like 16 or more, the benefit of complex instructions drops because you can make good use of the values in the registers.

long ago and far away ... comparing pascal to pascal front-end with
pl.8 back-end (3033 is 370 about 4.5MIPS)

Date: 8 August 1981, 16:47:28 EDT
To: wheeler

the 801 group here has run a program under several different PASCAL
"systems". The program was about 350 statements and basically
"solved" SOMA (block puzzle..). Although this is only one test, and
all of the usual caveats apply, I thought the numbers were
interesting... The numbers given in each case are EXECUTION TIME ONLY
(Virtual on 3033).

6m 30 secs PERQ (with PERQ's Pascal compiler, of course)
4m 55 secs 68000 with PASCAL/PL.8 compiler at OPT 2
0m 21.5 secs 3033 PASCAL/VS with Optimization
0m 10.5 secs 3033 with PASCAL/PL.8 at OPT 0
0m 5.9 secs 3033 with PASCAL/PL.8 at OPT 3

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Mar 7 02:27:59 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for RISC-II). Compilers at the time did not use the CISCy features much, which is
one reason why the IBM 801 project and later the Berkeley RISC and
Stanford MIPS proposed replacing them with a load/store architecture.

VAX intstructions are very complex and much of that complexity
is hard to use in compilers. But even extremaly simple compiler
can generate load-op combinations decreasing number of instructions.
Rather simple hack is enough to combine additions in address
artihmetic into addressing mode. Also, operations with two or three
memory addresses are easy to generate from compiler. I think
that chains of pointer dereferences in C should be not hard to
convert to indirect addressing mode.

I think that state of chip technology was more important. For
example 486 has RISC-like pipeline with load-ops, but load-ops
take the same time as two separate instructions. Similarly,
operations on memory take the same time as load-op-store.
So there were no execution time gain from combined instructions
and clearly some complication compared to load/store
architecture. Main speed gain of RISC came from having
pipeline on a chip (multichip processors were pipelined,
but expensive, earlier single chip ones had no pipeline).
So load/store architecture (and no microcode) meant that
early RISC could offer good pipeline earlier.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Fri Mar 7 04:09:13 2025

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to John Dallman on Fri Mar 7 06:03:13 2025

On Wed, 5 Mar 2025 09:16 +0000 (GMT Standard Time), jgd@cix.co.uk
(John Dallman) wrote:

In article <vq82c8$232tl$7@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

How would you have done games without being able to directly
address screen memory? I'm sure PC/IX, being a Unix-type system,
would have disallowed that.

How?

There's no memory management hardware in an 8088, and PC/IX ran on a
basic PC/XT.

Programmatically - the compiler (and/or assembler) could disallow it.

Programmatic isolation works quite well as long as everyone plays by
the rules. [Which, of course, is hard to enforce.]

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Waldek Hebisch on Fri Mar 7 13:57:02 2025

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

By contrast, making good use of the complex instructions of VAX in a compiler consumed significant resources (e.g., Figure 2 of https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for
RISC-II). Compilers at the time did not use the CISCy features
much, which is one reason why the IBM 801 project and later the
Berkeley RISC and Stanford MIPS proposed replacing them with a
load/store architecture.

VAX intstructions are very complex and much of that complexity
is hard to use in compilers. But even extremaly simple compiler
can generate load-op combinations decreasing number of instructions.
Rather simple hack is enough to combine additions in address
artihmetic into addressing mode. Also, operations with two or three
memory addresses are easy to generate from compiler. I think
that chains of pointer dereferences in C should be not hard to
convert to indirect addressing mode.

I think that state of chip technology was more important. For
example 486 has RISC-like pipeline with load-ops, but load-ops
take the same time as two separate instructions. Similarly,
operations on memory take the same time as load-op-store.
So there were no execution time gain from combined instructions
and clearly some complication compared to load/store
architecture.

In specific case of i486, with its small (8KB) unfied I+D cache,
you will see good gain from load+Op combining, even if going by cycle
count in the manual they are the same.
For Pentium , not necessarily so.

Main speed gain of RISC came from having
pipeline on a chip (multichip processors were pipelined,
but expensive, earlier single chip ones had no pipeline).
So load/store architecture (and no microcode) meant that
early RISC could offer good pipeline earlier.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Fri Mar 7 15:23:02 2025

Michael S wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

By contrast, making good use of the complex instructions of VAX in a
compiler consumed significant resources (e.g., Figure 2 of
https://dl.acm.org/doi/pdf/10.1145/502874.502876 reports about a
factor 1.5 more code in the code generator for VAX than for
RISC-II). Compilers at the time did not use the CISCy features
much, which is one reason why the IBM 801 project and later the
Berkeley RISC and Stanford MIPS proposed replacing them with a
load/store architecture.

VAX intstructions are very complex and much of that complexity
is hard to use in compilers. But even extremaly simple compiler
can generate load-op combinations decreasing number of instructions.
Rather simple hack is enough to combine additions in address
artihmetic into addressing mode. Also, operations with two or three
memory addresses are easy to generate from compiler. I think
that chains of pointer dereferences in C should be not hard to
convert to indirect addressing mode.

I think that state of chip technology was more important. For
example 486 has RISC-like pipeline with load-ops, but load-ops
take the same time as two separate instructions. Similarly,
operations on memory take the same time as load-op-store.
So there were no execution time gain from combined instructions
and clearly some complication compared to load/store
architecture.

In specific case of i486, with its small (8KB) unfied I+D cache,
you will see good gain from load+Op combining, even if going by cycle
count in the manual they are the same.
For Pentium , not necessarily so.

Right:

My Pentium-optimized Word Count program ran nearly twice as fast (in
cycle counts) on a Pentium as on a 486. The inner loop was inverted to
maximize the load-use distance and I got close to perfect pairing:

From memory, similar to

REPT 64
add ax,dx
mov dx,increment_table[bx]
mov bl,[es:di] ;; 64 KB table to classify a pair of chars
mov di,[si+OFFSET]

add ax,dx
mov dx,increment_table[bx+16]
mov bh,[es:di]
mov di,[si+OFFSET+2]
ENDM

On the Pentium this was only possible with separate load and operate instructions.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Fri Mar 7 17:35:57 2025

On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

One could also say at that point in time that FORTRAN was not that high
of a high level language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Mar 7 17:34:21 2025

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:

----------------

So, writing things like:
y[55:48]=x[19:12];

2 instructions in My 66000. One extract, one insert.

And:
j=x[19:12];
Also a single instruction, or 2 or 3 in the fallback case (encoded as a
shift and mask).

1 instruction--extract (SLL or SLA)

----------------------

For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

Though, looking at the compiler code, it would be subject to the "side effects in lvalue may be applied twice" bug:
(*ct++)[19:12]=(*cs++)[15:8];

5 instructions:: LD, LD, EXT, INS, ST; with deferred ADD to Rcs and Rct.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Mar 7 18:52:43 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of
polynomial functions.

In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.

One could also say at that point in time that FORTRAN was not that high
of a high level language.

It was high enough, right from the start, to abstract away a _lot_
of the machine, while still being quite efficient.

"Since Fortran should virtually eliminate coding and debugging"
was rather optimistic, though; programming tasks expanded too
fast for that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Fri Mar 7 14:26:51 2025

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Given a (very rough) estimate that each PAL replaced four standard
logic chips of simlar size, my guess would be that it saved
them toe equivalent of two to three circuit boards, not bad.

Another striking thing is how densely the circuit boards are packed,
compared to the VAX boards one finds. I suspect they had access
to more layers of printed circuit board than DEC ten years earlier.

MMI's PAL Programmable Array Logic is a subset of a Programmable Logic Array.
PAL has programmable AND matrix but a fixed OR matrix.
PLA has both AND and OR matrix programmable.

Mask programmed PLA's were available since 1970, and field programmable
FPLA's available in 1976 from a number of suppliers (e.g. Signetics).
https://en.wikipedia.org/wiki/Programmable_logic_array

I read somewhere that these were not used much because, in the
beginning, they were slow, big, expensive and difficult to program.

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

That seems like a lot but I looked up what it would take if you built
the same out of discrete 2-input NOR and and 8-input NAND and its
basically about the same delay but a lot more parts and board space.

This is probably why they were not considered as a replacement for
the PAL chips for the MV/8000, had MMI failed - they were not up
to the job.

If one was building a RISC style ISA cpu in 1975 they could be used
for decoding and state machines for fetch, load/store, page table walker.
I don't know the price.

They could have been used for the same things on the VAX 11/780. Does anybody know if they were?

PLA's are useful when you have don't care bits or have different bit-wise layout formats to parse, exactly like RISC ISA's do.

VAX opcode was 1 or two whole bytes so PLA would have been a waste.
The operand specifier did have a varying format but was simple so
was decoded with discrete logic.

The VAX 780 decoder used 2k * 12 ROM's to look up the starting uAddress.
I haven't checked if it actually did this but a microsequencer might execute
a multiway microsubroutine call by jamming that ROM output into the low
address bits of the next address field while pushing the return uAddr
on a hardware stack.

If anyone is interested the VAX 780 hardware designs are available
but you have to know how to read TTL and have old hardware manuals
to look up part numbers.

CPU Assembly
http://www.bitsavers.org/pdf/dec/vax/780/MP00496_KA780_197911.pdf

Data Path Description http://www.bitsavers.org/pdf/dec/vax/780/AA-H307-TE_VAX-11_780_Data_Path_Description_197902.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Fri Mar 7 14:59:47 2025

Lawrence D'Oliveiro wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

And the decimal instructions for COBOL (also on some PDP-11's).

The only reason to add complex instructions like MOVC3, MOVC5 and
others SKIPC, SPANC, etc is if hardware can do a better job than a
software subroutine. And you only add those instructions when you
know you can afford the hardware, not in anticipation that someday
we might do a better job.

The reason VAX and 8086 benefit from string instructions is because
they are sequential processors. It allows them to do decode once and
sit in a tight loop doing execute. But both still move byte-by-byte
and do not attempt to optimize memory access operations.
Also the sequencer is sequential so the loop counting and branch testing
each take microcycles.

So there is some benefit when comparing a VAX MOVc3 to a VAX subroutine,
but not compared to a pipelined TTL RISC.

If it is a pipelined RISC then decode is overlapped with execute
so there is no advantage to these complex instructions vs a RISC
subroutine doing the same in a loop. And the RISC subroutine might be
faster because it can overlap the loop count and branch with memory access.

In both cases the real advantage is when you can afford the HW to
optimize bus accesses as this is where the majority of cycles are spent.
When you can afford the HW optimizer then you add them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Fri Mar 7 20:07:35 2025

On Fri, 7 Mar 2025 18:27:06 +0000, Robert Finch wrote:

On 2025-03-07 12:34 p.m., MitchAlsup1 wrote:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:

----------------

So, writing things like:
y[55:48]=x[19:12];

2 instructions in My 66000. One extract, one insert.

Ibid for Q+. The logic for an extract and insert as one operation might
add to the timing. Extract, sign/zero extend and copy back. Fields may
be different sizes.

And:
j=x[19:12];
Also a single instruction, or 2 or 3 in the fallback case (encoded as a
shift and mask).

1 instruction--extract (SLL or SLA)

Q+ has EXT/EXTU which is basically a SRL or SRA with mask applied
afterwards. PowerPC has a rotate-left-and-mask instruction. In my
opinion it makes more sense for extracts to be shifting right.

Both SR and SL have both sign control and masking.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Mar 7 20:25:44 2025

On Fri, 7 Mar 2025 19:59:47 +0000, EricP wrote:

Lawrence D'Oliveiro wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g.
MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of
polynomial functions.

How the VAX Lost Its POLY (and EMOD and ACB_floating too), 2011 https://simh.trailing-edge.com/docs/vax_poly.pdf

In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.

And the decimal instructions for COBOL (also on some PDP-11's).

The only reason to add complex instructions like MOVC3, MOVC5 and
others SKIPC, SPANC, etc is if hardware can do a better job than a
software subroutine. And you only add those instructions when you
know you can afford the hardware, not in anticipation that someday
we might do a better job.

The reason VAX and 8086 benefit from string instructions is because
they are sequential processors. It allows them to do decode once and
sit in a tight loop doing execute. But both still move byte-by-byte
and do not attempt to optimize memory access operations.
Also the sequencer is sequential so the loop counting and branch testing
each take microcycles.

So there is some benefit when comparing a VAX MOVc3 to a VAX subroutine,
but not compared to a pipelined TTL RISC.

If it is a pipelined RISC then decode is overlapped with execute
so there is no advantage to these complex instructions vs a RISC
subroutine doing the same in a loop.

You forgot the word "sequentially" in the previous sentence.

And the RISC subroutine might be
faster because it can overlap the loop count and branch with memory
access.

In both cases the real advantage is when you can afford the HW to
optimize bus accesses as this is where the majority of cycles are spent.
When you can afford the HW optimizer then you add them.

As to MOVc3; once your cache supports wide access (in support of
64-bit misaligned access) you can get 128-bits read or written
per cycle per port. So, there is very little added to the HW in
order to support doing MOVc3 stuff at 64-bits per cycle:: in cycle
1 we read 128-bits, in cycle 2 we write 128-bits and increment the
iterator. For startup and terminations, the incrementation of the
iterator creates a mask shutting down the lanes byte by byte.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Mar 7 22:25:12 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

Isn't that just 'bswap32' on x86, or REV32 on ARM64?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 7 22:45:56 2025

On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

Isn't that just 'bswap32' on x86, or REV32 on ARM64?

A degenerate version is:: but consider::

BITR Rd,Rs1,<1>

performs bit reversion, while::

BITR Rd,Rs1,<2>

reverses pairs of bits, ...

BITR Rs,Rs1,<16>

reverses halfwords.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Mar 7 22:51:13 2025

On Fri, 7 Mar 2025 21:00:09 +0000, BGB wrote:

On 3/7/2025 11:34 AM, MitchAlsup1 wrote:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:

----------------

So, writing things like:
   y[55:48]=x[19:12];

2 instructions in My 66000. One extract, one insert.

1 instruction in this case...

The 3 sub-fields being, 36, 48, and 56.

The way I defined things does mean adding 1 to the high bit in the
encoding, so 63:56 would be expressed as 64:56, which nominally uses 1
more bit of range. Though, if expressed in 6 bits, the behavior I had
defined it as, effectively causes it to be modulo.

My 66000 uses the same trick, allowing both 64 and 0 to indicate
64-bits.

----------------------

For a simple test:
   lj[ 7: 0]=li[31:24];
   lj[15: 8]=li[23:16];
   lj[23:16]=li[15: 8];
   lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

In this particular case, there is also a SWAP.L instruction, but I was ignoring it for sake of this example, and my compiler isn't that clever.

------------

Unlike Verilog, in C mode it will currently require single-bit fetch to
use a notation like x[17:17], but this is more because a person is much
more likely to type "x[17]" by accident (such as by using the wrong
variable, a missing '*', or ...).

I use <1,16> where the first is the length of the field, and the second
is the offset from bit<0>.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Mar 8 00:57:13 2025

On Fri, 7 Mar 2025 17:35:57 +0000, MitchAlsup1 wrote:

On Fri, 7 Mar 2025 4:09:13 +0000, Lawrence D'Oliveiro wrote:

In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.

One could also say at that point in time that FORTRAN was not that high
of a high level language.

It was to most people of the time, particularly in the USA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Sat Mar 8 01:00:02 2025

On Fri, 7 Mar 2025 18:52:43 -0000 (UTC), Thomas Koenig wrote:

[Fortran] was high enough, right from the start, to abstract away a
_lot_ of the machine, while still being quite efficient.

John Backus described the techno-cultural milieu in which Fortran was
born, in one of a collection of papers on the origins of historically- significant programming languages, that I read many decades ago.

It was one of the first, if not the first, serious attempt at an
optimizing compiler. He mentions the surprise people felt (including, of course, seasoned assembly-language programmers) at how far the generated
code departed from a simple correspondence to the individual statements of
the original source.

As I recall, much of the competition at the time took the form of floating-point calculation engines with interpreted languages on top.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Mar 8 01:02:29 2025

On Fri, 7 Mar 2025 16:57:31 -0600, BGB wrote:

Like, bitfield helpers were too weird/obscure, but hard-coding parts of
the CRC or stuff related to DES encryption and similar into the ISA is fine...

I blame C. The fact that C does not have built-in constructs to make
convenient use of variable bitfields seems to be the main excuse for not supporting them in hardware instruction sets.

And then in return, the lack of efficient support in hardware becomes an
excuse for not having such constructs in the higher-level language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Sat Mar 8 02:08:26 2025

On Sat, 8 Mar 2025 1:02:29 +0000, Lawrence D'Oliveiro wrote:

On Fri, 7 Mar 2025 16:57:31 -0600, BGB wrote:

Like, bitfield helpers were too weird/obscure, but hard-coding parts of
the CRC or stuff related to DES encryption and similar into the ISA is
fine...

I blame C. The fact that C does not have built-in constructs to make convenient use of variable bitfields seems to be the main excuse for not supporting them in hardware instruction sets.

Way back in 1983, John Perry of Bell Northern Research convinced me
to add the masker to the barrel shifter in Mc 88100, and I saw then
and there that it was an advancement in the state of "shift
instructions".

C notwithstanding.

And then in return, the lack of efficient support in hardware becomes an excuse for not having such constructs in the higher-level language.

Same excuse as misaligned memory references--shortsightedness.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Mar 8 03:28:49 2025

On Sat, 8 Mar 2025 2:49:50 +0000, BGB wrote:

------------------------

I guess, while a person could do something like (in C):
_BitInt(1048576) bmp;
_Boolean b;
...
b=(bmp>>i)&1; //*blarg* (shift here would be absurdly expensive)

This is liklely to be rare vs more traditional strategies, say:
uint64_t *bmp;
int b, i;
...
b=(bmp[i>>6]>>(i&63))&1;

Question: How do you handle the case where the bit vector is an odd
number of bits in width ?? Say <3, 5, 7, 17, ...>

As well as the traditional strategy being a whole lot more efficient in
this case...

I guess the case could be made for a generic dense bit array.

Mc 68020 had instructions to access bit-fields that cross word
boundaries.

Though, an open question is how one would define it in a way that is consistent with existing semantics rules.

Architecture is more about what gets left OUT than what gets left IN.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Lynn Wheeler on Sat Mar 8 07:27:46 2025

Lynn Wheeler <lynn@garlic.com> schrieb:

long ago and far away ... comparing pascal to pascal front-end with
pl.8 back-end (3033 is 370 about 4.5MIPS)

Date: 8 August 1981, 16:47:28 EDT
To: wheeler

the 801 group here has run a program under several different PASCAL "systems". The program was about 350 statements and basically
"solved" SOMA (block puzzle..). Although this is only one test, and
all of the usual caveats apply, I thought the numbers were
interesting... The numbers given in each case are EXECUTION TIME ONLY (Virtual on 3033).

6m 30 secs PERQ (with PERQ's Pascal compiler, of course)
4m 55 secs 68000 with PASCAL/PL.8 compiler at OPT 2
0m 21.5 secs 3033 PASCAL/VS with Optimization
0m 10.5 secs 3033 with PASCAL/PL.8 at OPT 0
0m 5.9 secs 3033 with PASCAL/PL.8 at OPT 3

Interesting figures. there is a factor of 50 of 68000 vs the 3033,
which was in the ~1 MIPS range, with the same compiler technology.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sat Mar 8 10:52:36 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

He reports that the 82S100 generates the #CASRAM signal with a
propagation delay of 35ns in one direction and 25ns in the other, and
the #ROMH signal with a propagation delay of 25ns in both directions
(table 3.4). I guess that the 50ns are the worst case of anything you
can do with the 82S100.

He reports a current consumption of 102mA for the 82S100 (table 3.3),
which at 5V (the regular voltage at the time) is pretty close to the
600mW given in the data sheet. The rest of the board, including
several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
(I/O)) , consumed at most 770mA in his measurements; most of the rest
was NMOS, while the 82S100 was TTL.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Mar 8 14:36:37 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

Isn't that just 'bswap32' on x86, or REV32 on ARM64?

A degenerate version is:: but consider::

BITR Rd,Rs1,<1>

performs bit reversion, while::

BITR Rd,Rs1,<2>

reverses pairs of bits, ...

Is there an application for this particular variant?

BITR Rs,Rs1,<16>

reverses halfwords.

Since there generally aren't higher level language
constructs that encapsulate this behavior, how useful
is it in the real world? Does it justify the verif
costs, much less the engineering cost?

Bswap32/64 are genuinely useful in real world applications
(particularly networking) thus the presence in most modern instruction sets.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Mar 8 17:57:57 2025

On Sat, 8 Mar 2025 14:36:37 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 22:25:12 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 7 Mar 2025 11:08:56 +0000, BGB wrote:

For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.

1 innstruction:: BITR rd,rs1,<8>

Isn't that just 'bswap32' on x86, or REV32 on ARM64?

A degenerate version is:: but consider::

BITR Rd,Rs1,<1>

performs bit reversion, while::

BITR Rd,Rs1,<2>

reverses pairs of bits, ...

Is there an application for this particular variant?

BITR Rs,Rs1,<16>

reverses halfwords.

Since there generally aren't higher level language
constructs that encapsulate this behavior, how useful
is it in the real world? Does it justify the verif
costs, much less the engineering cost?

You already have a vector of bits that is spread across
the data path (reversing bits.)

The only cost is the multiplexer that selects certain
orders of bits out as a result. So, the cost in area is
negligible.

Bswap32/64 are genuinely useful in real world applications
(particularly networking) thus the presence in most modern instruction
sets.

You might want to do a BE->LE conversion where they byte
string contains a mixture of 8-bit and 16-bit characters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sat Mar 8 13:03:38 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

It looks like the C64's circuit design was one culprit.
Though I do remember back then hearing about failures over time with
other fuse programmable devices like PROMs.
Something about the sputter from the blown fuses.

He reports that the 82S100 generates the #CASRAM signal with a
propagation delay of 35ns in one direction and 25ns in the other, and
the #ROMH signal with a propagation delay of 25ns in both directions
(table 3.4). I guess that the 50ns are the worst case of anything you
can do with the 82S100.

Yes, and it sounds like the circuit design depends on a race condition
between two logic paths to work. Big no-no.

He reports a current consumption of 102mA for the 82S100 (table 3.3),
which at 5V (the regular voltage at the time) is pretty close to the
600mW given in the data sheet. The rest of the board, including
several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
(I/O)) , consumed at most 770mA in his measurements; most of the rest
was NMOS, while the 82S100 was TTL.

- anton

This is not a problem with the 82S100.
Whoever designed that circuit didn't know what they were doing.
One can't use any combinatorial logic circuit and expect exact timing.
The manufacturer specs indicate a range of speeds which depend on
things like variations in power supply voltage, load, temperature.
In the case of the 82S100 it is 35 ns typical, 50 ns max.
Also these are logic chains, so each gate adds its own variations.

The circuit should be designed so it works across all timing variations
which is what synchronization clocks and flip flops are for.
And even then flip flops have their own timing variations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Mar 8 19:42:46 2025

On Sat, 8 Mar 2025 18:03:38 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

It looks like the C64's circuit design was one culprit.
Though I do remember back then hearing about failures over time with
other fuse programmable devices like PROMs.
Something about the sputter from the blown fuses.

A laser blasts a short wire so that there is no longer any connection.
Then, in use, the electrical forces cause the still present aluminum
wires to reconstruct themselves making contact and changing the state.

The blowable wire is still immersed within an oxide layer, preventing
the blown aluminum atoms from "really going anywhere" allowing small
forces to reassemble the wire.

He reports that the 82S100 generates the #CASRAM signal with a
propagation delay of 35ns in one direction and 25ns in the other, and
the #ROMH signal with a propagation delay of 25ns in both directions
(table 3.4). I guess that the 50ns are the worst case of anything you
can do with the 82S100.

Yes, and it sounds like the circuit design depends on a race condition between two logic paths to work. Big no-no.

In general, all signals that interact with the data path, must be
clocked and driven from the same edge of the data path. But even
here, designers must be careful to load each select line evenly
so that the line driving operand 1 forwarding has the same "cross
data path" delay as the line driving every other data path select
line.

He reports a current consumption of 102mA for the 82S100 (table 3.3),
which at 5V (the regular voltage at the time) is pretty close to the
600mW given in the data sheet. The rest of the board, including
several chips with much more logic (CPU, VIC, SID (sound), 2xCIA
(I/O)) , consumed at most 770mA in his measurements; most of the rest
was NMOS, while the 82S100 was TTL.

- anton

This is not a problem with the 82S100.
Whoever designed that circuit didn't know what they were doing.
One can't use any combinatorial logic circuit and expect exact timing.
The manufacturer specs indicate a range of speeds which depend on
things like variations in power supply voltage, load, temperature.
In the case of the 82S100 it is 35 ns typical, 50 ns max.
Also these are logic chains, so each gate adds its own variations.

The circuit should be designed so it works across all timing variations
which is what synchronization clocks and flip flops are for.
And even then flip flops have their own timing variations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Torbjorn Lindgren on Sun Mar 9 01:46:03 2025

On Sun, 9 Mar 2025 1:27:19 +0000, Torbjorn Lindgren wrote:

Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.

Radio Shack TRS-80 would buy every Z80 that did not make 2MHz
operating frequency. They used something around 1.87 MHz so the
CPU clock and the TV clock were the same clock.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Torbjorn Lindgren@21:1/5 to Anton Ertl on Sun Mar 9 01:27:19 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report ><http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

AFAIK the Signetics 82S100 isn't failure-prone in C64, the MOS
Technology (IE Commodore) *clone* that they switched to due to cost
reasons IS known to be failure prone. Those are the reason there's
lots of PLA replacement projects.

If you have any actual Signetics device in a C64 it'll very likely
fine unless the power supply failed and fed everything too high
voltages which is unfortunately a common failure mode on the C64 power
brick. This PSU failure usually destroys most or all of the memory
chips, the SID and one or more likely multiple of CPU, PLA and ROMs.

Other things with known high failure rates are the MOS Technology 74xx
clones and the MT memory. These failure also include MT branded memory
chips of that specific type when used in non-Commodore items like PC
clones so it's not just C64.

Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sun Mar 9 02:55:09 2025

On Sat, 8 Mar 2025 03:28:49 +0000, MitchAlsup1 wrote:

Mc 68020 had instructions to access bit-fields that cross word
boundaries.

And, with typical big-endian quirkiness†, the numbering was completely the opposite way round from the single-bit instructions from the earlier
members of the 680x0 family.

†to put it politely

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Mar 9 07:02:06 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sun, 9 Mar 2025 1:27:19 +0000, Torbjorn Lindgren wrote:

Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.

Radio Shack TRS-80 would buy every Z80 that did not make 2MHz
operating frequency. They used something around 1.87 MHz so the
CPU clock and the TV clock were the same clock.

RaptorCS buys POWER 9 chips where a higher number of cores failed
than permitted by IBM's specs, and then sells them as systems with
a lower working number of cores.

The main disadvantages are a) price and b) they are stuck with POWER
9 (due to the binary driver blob on Power 10, among other things).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Mar 9 10:13:06 2025

MitchAlsup1 wrote:

On Sat, 8 Mar 2025 18:03:38 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw >>>> in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

It looks like the C64's circuit design was one culprit.
Though I do remember back then hearing about failures over time with
other fuse programmable devices like PROMs.
Something about the sputter from the blown fuses.

A laser blasts a short wire so that there is no longer any connection.
Then, in use, the electrical forces cause the still present aluminum
wires to reconstruct themselves making contact and changing the state.

The blowable wire is still immersed within an oxide layer, preventing
the blown aluminum atoms from "really going anywhere" allowing small
forces to reassemble the wire.

I'm referring to the electrically programmable bipolar PROMs.
The 1977 TI Bipolar Memory manual says theirs had titanium-tungsten fuses. These were programmed with a 10.5V pulse up to 750 mA for 1 us to 1 ms.

This would create a metallic vapor cloud around the cell inside the chip.
I can envision that perhaps over time the electric field on the cell
might attract the debris to grow into dendrites that eventually
short the cell and turn a 0 to 1.

As these could only be programmed once and cost a bundle,
they were soon replaced by ultraviolet light erasable EPROMs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Torbjorn Lindgren on Sun Mar 9 10:15:00 2025

Torbjorn Lindgren wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Looking at the Signetics 82S100 in 1976 has max access of 50 ns, 600 mw
in a 28 pins dip.

The Commodore 64 used a 82S100 or compatible for various purposes,
especially for producing various chip select and RAM control signals
from the addresses produced by the CPU or the VIC (graphics chip).
Thomas Giesel wrote a very detailed report
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf> on
the original PLAs and their behaviour, in order to replace it
(apparently it's a chip that was failure-prone).

AFAIK the Signetics 82S100 isn't failure-prone in C64, the MOS
Technology (IE Commodore) *clone* that they switched to due to cost
reasons IS known to be failure prone. Those are the reason there's
lots of PLA replacement projects.

C64 could just as easily have failed if Signetics had rev'd the
chip design to improve the yield.

All the 82S100 spec says is that if any input changes then all the outputs
will have settled after 50 ns. It makes no statement about the relative
order that outputs will change for different input combinations.
As it is NORMAL for many outputs of combinatorial circuits to glitch
when changing state, it is wrong to depend on them not doing so.

This is independent of failures of the MT clone which also could be true.

If you have any actual Signetics device in a C64 it'll very likely
fine unless the power supply failed and fed everything too high
voltages which is unfortunately a common failure mode on the C64 power
brick. This PSU failure usually destroys most or all of the memory
chips, the SID and one or more likely multiple of CPU, PLA and ROMs.

Affordable switching power supplies were relatively new back then.
IIRC there were two designs, the expensive one that used a transformer and failed safe to ground, and the cheap one that failed to line voltage.

Other things with known high failure rates are the MOS Technology 74xx
clones and the MT memory. These failure also include MT branded memory
chips of that specific type when used in non-Commodore items like PC
clones so it's not just C64.

Notice the common factor here - MT/Commodore was making a lot of
"working but only barely" chips and Commodore used them internally to
save money and also sold them to others which used them because, well,
they were frequently the cheapest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Mar 10 23:17:38 2025

On Mon, 10 Mar 2025 17:40:55 -0500, BGB wrote:

It is rare for bitmap bits to not be a power of 2...

In MPEG, some timestamp field was 33 bits in length, as I recall.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 00:53:31 2025

On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:

On 3/7/2025 9:28 PM, MitchAlsup1 wrote:

On Sat, 8 Mar 2025 2:49:50 +0000, BGB wrote:

------------------------

I guess, while a person could do something like (in C):
   _BitInt(1048576) bmp;
   _Boolean b;
   ...
   b=(bmp>>i)&1; //*blarg* (shift here would be absurdly expensive)

This is liklely to be rare vs more traditional strategies, say:
   uint64_t *bmp;
   int b, i;
   ...
   b=(bmp[i>>6]>>(i&63))&1;

Question: How do you handle the case where the bit vector is an odd
number of bits in width ?? Say <3, 5, 7, 17, ...>

It is rare for bitmap bits to not be a power of 2...

I would guess, at least for C, something like (for 3 bits):
uint32_t *bmp;
uint64_t bv;
int i, b, bp;
...
bp=i*3;
bv=*(uint64_t *)(bmp+(bp>>5));
b=(bv>>(bp&31))&7;

Could apply to anything up to 31 bits.

Not bad.

Could do similar with __int128 (or uint128_t), which extends it up to 63 bits.

------------

Mc 68020 had instructions to access bit-fields that cross word
boundaries.

I guess one could argue the use-case for adding a generic funnel shift instruction.

My 66000 has CARRY-SL/SR which performs a double wide operand shifted
by a single wide count (0..63) and produces a double wide result {IO}.

If I added it, it would probably be a 64-bit encoding (generally needed
for 4R).

By placing the width in position {31..37} you can compress this down
to 3-Operand.

----------

Architecture is more about what gets left OUT than what gets left IN.

Well, except in this case it was more a question of trying to fit it in
with C semantics (and not consideration for more ISA features).

Clearly, you want to support C semantics--but you can do this in a way
that also supports languages with real bit-field support.
---------------

There are still some limitations, for example:
In my current implementation, CSR's are very limited (may only be used
to load and store CSRs; not do RMW operations on CSRs).

<y 66000 only has 8 CPU CRs, and even these are R/W through MMI/O
space. All the other (effective) CRs are auto loaded in line quanta.

This mechanism allows one CPU to figure out what another CPU is up to
simply by meandering through its CRs...

Though, have noted that seemingly some number of actual RISC-V cores
also have this limitation.

A more drastic option might be to try to rework the hardware interfaces
and memory map hopefully enough to try to make it possible to run an OS
like Linux, but there doesn't really seem to be a standardized set of hardware interfaces or memory map defined.

Some amount of SOC's though seem to use a map like:
00000000..0000FFFF: ROM goes here.
00010000..0XXXXXXX: RAM goes here.
ZXXXXXXX..FFFFFFFF: Hardware / MMIO

My 66000::
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.

------------

They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS
boot the kernel directly as an ELF image (traditionally UEFI was always PE/COFF based?...).

Boot ROM should be big enough that no BOOT ROM will ever exceed its
size.
--------------

There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used for
a lewd act) and seems to be triggering to Google's automatic content filtering (probably for a similar reason).

Hilarious--and reason enough to change names.

When you do change names, can you spell LD and ST instead of MOV ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 17:57:02 2025

On Tue, 11 Mar 2025 4:49:16 +0000, BGB wrote:

On 3/10/2025 7:53 PM, MitchAlsup1 wrote:

-------------------

I guess one could argue the use-case for adding a generic funnel shift
instruction.

My 66000 has CARRY-SL/SR which performs a double wide operand shifted
by a single wide count (0..63) and produces a double wide result {IO}.

OK.

If I added it, it would probably be a 64-bit encoding (generally needed
for 4R).

By placing the width in position {31..37} you can compress this down
to 3-Operand.

It is 3 operand if being used as a 128-bit shift op.
But, funnel shift operators implies 3 independent inputs and 1 output.

And 2 shifts or exotic masking. Which is why I stopped early.

----------

Architecture is more about what gets left OUT than what gets left IN.

Well, except in this case it was more a question of trying to fit it in
with C semantics (and not consideration for more ISA features).

Clearly, you want to support C semantics--but you can do this in a way
that also supports languages with real bit-field support.
---------------

Yeah.

Amidst debugging and considering Verilog support...

There are still some limitations, for example:
In my current implementation, CSR's are very limited (may only be used
to load and store CSRs; not do RMW operations on CSRs).

My 66000 only has 16 CPU CRs, and even these are R/W through MMI/O
space. All the other (effective) CRs are auto loaded in line quanta.

This mechanism allows one CPU to figure out what another CPU is up to
simply by meandering through its CRs...

I had enough space for 64 CRs, but only a small subset are actually
used. Some more had space reserved, but were related to non-implemented features.

RISC-V has a 12-bit CSR space, of which:
Some map to existing CRs;
My whole CR space was stuck into an implementation-dependent range.

My whole space is mapped by BAR registers as if they were on PCIe.

Some read-only CSRs were mapped over to CPUID.

I don't even have a CPUID--if you want this you go to config space
and read the configuration lists and extended configuration lists.

Of which, all of the CPUID indices were also mapped into CSR space.

CPUID is soooooo pre-PCIe.

Seemingly lacks defined user CSRs for timer or HW-RNG, which do exist in
my case. It is very useful to be able to access a HW timer in userland,
as otherwise it would waste a lot of clock-cycles using system calls for "clock()" and similar.

That is why they are ALL available in MMI/O Space. If this user needs
access to that timer, then there is a PTE that translated the LD/ST
into an access to that device.

Though, have noted that seemingly some number of actual RISC-V cores
also have this limitation.

A more drastic option might be to try to rework the hardware interfaces
and memory map hopefully enough to try to make it possible to run an OS
like Linux, but there doesn't really seem to be a standardized set of
hardware interfaces or memory map defined.

Some amount of SOC's though seem to use a map like:
   00000000..0000FFFF: ROM goes here.
   00010000..0XXXXXXX: RAM goes here.
   ZXXXXXXX..FFFFFFFF: Hardware / MMIO

My 66000::
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.

There seems to be a lot here defined in terms of 32-bit physical spaces, including on 64-bit targets.

Though, thus far, my existing core also has pretty all of its physical
map in 32-bit space.

My 66000 does not even have a 32-bit space to map into.
You can synthesize such a space by not using any of the
top 32-address bits in PTEs--but why ??

The physical ranges from 0001_00000000 .. 7FFF_FFFFFFFF currently
contain a whole lot of nothing.

I once speculated on the possibility of special hardware to memory-map
the whole SDcard into physical space, but nothing has been done yet (and
such a hardware interface would be a lot more complicated than my
existing interface).

An intermediate option being to expand the SPI interface to support 256
bit bursts.

My interconnect bus is 1 cache line (512-bits) per cycle plus
address and command.

Say:
P_SPI_QDATA0..P_SPI_QDATA3

It appears this has already been partly defined (though not fully
implemented in the 256-bit case).

Where, the supported XMIT sizes are:
8 bit: Single Byte
64 bit: 8 bytes
256 bit: 32 bytes

With larger bursts mostly to reduce the amount of round-trip delay over
the bus.

My 66000 interconnect bus can transmit a whole page in a single
burst--that appears ATOMIC to interested 3rd parties.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Mar 11 17:44:45 2025

On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:

On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:

------------------------

My 66000::
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

How does one reference DRAM vs MMI/O at the same address using a LD / ST instruction?

The MMU translates the virtual address to a universal address.
The PTE supplies the extra bits.

Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with addresses. There is a region table in the system that describes up to
eight distinct regions.

Every major block in my architecture has ports in config space that
smell just like that of a device on PCIe having said control block.
My thought was that adding all these to the config name space might
cramp any fixed (or programmable) partition. So, the easiest thing
is to give it its own big space.

Then every device header gets 1 or more pages of address space for
its own control registers. PCIe is now a 42-bit address space::
segment, bus; device; function, xreg, reg and likely to grow as
ACHI can consume a whole PCIe segment by itself.

Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.

Are there any CRs accessible with any instructions besides LD / ST?

CRs accessible via HR instruction theoretically == 40
CRs accessible via HR instruction at a privilege >= 16

Basically, HR provides access to this threads critical CRs
{IP, Root, ASID, CSP, exception ctrl, inst ctrl, interrupts ...}
and has access to the CPU SW stack according to privilege.

------------

They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS
boot the kernel directly as an ELF image (traditionally UEFI was always
PE/COFF based?...).

Boot ROM should be big enough that no BOOT ROM will ever exceed its
size.
--------------

There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used for >>> a lewd act) and seems to be triggering to Google's automatic content
filtering (probably for a similar reason).

Coming up with names is surprisingly difficult. I got into a discussion
with a colleague a while ago about this. They were having difficulty
coding something an it turned out to be simply what names to choose for routines.

Hilarious--and reason enough to change names.

When you do change names, can you spell LD and ST instead of MOV ??

Yes, please LD / ST it is so much clearer what is going on. Less trouble getting confused by the placement of operands.

I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.

I go cross-eyed reading code that is a whole lot of moves.

I agree.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Mar 11 18:00:45 2025

On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

A low of people swear by:
movl %eax, 16(%rdi)
....

More swear at it than for it.

Most likely: those who swear by it have brain damage by x86-ism.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Mar 11 11:15:06 2025

On 3/11/2025 10:44 AM, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:

On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:

------------------------

My 66000::
  00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
  01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
  10 0000000000000000..FFFFFFFFFFFFFFFF: config
  11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

How does one reference DRAM vs MMI/O at the same address using a LD / ST
instruction?

The MMU translates the virtual address to a universal address.
The PTE supplies the extra bits.

Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
addresses. There is a region table in the system that describes up to
eight distinct regions.

Every major block in my architecture has ports in config space that
smell just like that of a device on PCIe having said control block.
My thought was that adding all these to the config name space might
cramp any fixed (or programmable) partition. So, the easiest thing
is to give it its own big space.

Then every device header gets 1 or more pages of address space for
its own control registers. PCIe is now a 42-bit address space::
segment, bus; device; function, xreg, reg and likely to grow as
ACHI can consume a whole PCIe segment by itself.

Whatever you are trying to do, you won't run out of address space until
64 bits becomes insufficient. Note: all HW interfaces are in config
space
and all CRs are in MMI/O space.

Are there any CRs accessible with any instructions besides LD / ST?

CRs accessible via HR instruction theoretically == 40
CRs accessible via HR instruction at a privilege >= 16

Basically, HR provides access to this threads critical CRs
{IP, Root, ASID, CSP, exception ctrl, inst ctrl, interrupts ...}
and has access to the CPU SW stack according to privilege.

------------

They seem to also be asking for a UEFI based boot process, but this
would require having a bigger BootROM (can't likely fit a UEFI
implementation into 32K). Seems that the idea is to have the UEFI BIOS >>>> boot the kernel directly as an ELF image (traditionally UEFI was always >>>> PE/COFF based?...).

Boot ROM should be big enough that no BOOT ROM will ever exceed its
size.
--------------

There is a probable need to move away from the "BJX2" name, which as
noted, has some unfortunate connotations (turns out it was also used
for
a lewd act) and seems to be triggering to Google's automatic content
filtering (probably for a similar reason).

Coming up with names is surprisingly difficult. I got into a discussion
with a colleague a while ago about this. They were having difficulty
coding something an it turned out to be simply what names to choose for
routines.

Hilarious--and reason enough to change names.

When you do change names, can you spell LD and ST instead of MOV ??

Yes, please LD / ST it is so much clearer what is going on. Less trouble
getting confused by the placement of operands.

I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.

I go cross-eyed reading code that is a whole lot of moves.

I agree.

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is interesting.

A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Mar 11 18:56:52 2025

On Tue, 11 Mar 2025 18:15:06 +0000, Stephen Fuld wrote:

On 3/11/2025 10:44 AM, MitchAlsup1 wrote:

When you do change names, can you spell LD and ST instead of MOV ??

Yes, please LD / ST it is so much clearer what is going on. Less trouble >>> getting confused by the placement of operands.

I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.

I go cross-eyed reading code that is a whole lot of moves.

I agree.

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background?

Even when both LD and ST are written MOV there is a different OpCode
for the inbound MOV versus the outbound MOV, so, in effect, they are
really different instructions requiring different pipeline semantics.

Only (O N L Y) when one has a memory to memory move instruction can
the LDs and STs be MOVs. VAX had this, BJX* does not.

One should argue that different pipeline semantics requires a different OpCode--and you already have said OpCode having different bit patterns
and different signedness semantics different translation access rights
, ... At the HW level about the only thing LD has in common with ST is
the way the address is generated--although MIPS did something different.

The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is

MARY and MARY2 used X = Y to mean the value in X is deposited into Y.
Both were left to right only languages. This should surprise most !!
{{Although to be fair, Mary used the =: operator to perform assign.}}

It took me 40 years of writing specifications to get to the point where
I can write a specification such that neither the uninformed benevolent
reader nor the malicious engineer can misread that specification. MOV
is one of those things that makes getting the specification perfect
harder--and down the road, you too will figure out why I carry this
torch ...

logically a move. I don't know if this is right, but I think it is interesting.

A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)

In several RISC ISAs it is an ADD #0 or OR #0, however, in My 66000
we get a MOV instruction (reg-reg only) as a degenerate version of
the CMOV instruction {Hey, fell out for free}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Stephen Fuld on Tue Mar 11 19:07:08 2025

On 11/03/2025 18:15, Stephen Fuld wrote:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is interesting.

No, it is logically a copy.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Tue Mar 11 13:46:24 2025

On 3/11/2025 11:56 AM, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 18:15:06 +0000, Stephen Fuld wrote:

On 3/11/2025 10:44 AM, MitchAlsup1 wrote:

When you do change names, can you spell LD and ST instead of MOV ??

Yes, please LD / ST it is so much clearer what is going on. Less
trouble
getting confused by the placement of operands.

I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.

I go cross-eyed reading code that is a whole lot of moves.

I agree.

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background?

Even when both LD and ST are written MOV there is a different OpCode
for the inbound MOV versus the outbound MOV, so, in effect, they are
really different instructions requiring different pipeline semantics.

Only (O N L Y) when one has a memory to memory move instruction can
the LDs and STs be MOVs. VAX had this, BJX* does not.

One should argue that different pipeline semantics requires a different OpCode--and you already have said OpCode having different bit patterns
and different signedness semantics different translation access rights
, ... At the HW level about the only thing LD has in common with ST is
the way the address is generated--although MIPS did something different.

You are making my point. No software guy talks about "pipeline
semantics" :-) Note that I am not saying you are wrong, just noting the difference.

The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is

MARY and MARY2 used X = Y to mean the value in X is deposited into Y.
Both were left to right only languages. This should surprise most !! {{Although to be fair, Mary used the =: operator to perform assign.}}

And see my point about COBOL in the post above.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to moi on Tue Mar 11 13:39:55 2025

On 3/11/2025 12:07 PM, moi wrote:

On 11/03/2025 18:15, Stephen Fuld wrote:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea
is that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Tue Mar 11 21:26:15 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 3/11/2025 10:44 AM, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

I always put the memory operand second, which breaks the pattern of
having the destination operand first. Otherwise the destination is
first.

I go cross-eyed reading code that is a whole lot of moves.

I agree.

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is

I think it may depend on first experiences with assembler language; in
mine first experiences (on VAX, Burroughs mainframes, AT&T Unix) the first
was always the source operand and the second operand was the destination operand. The VAX and PDP-11 also followed this paradigm, as does
the AT&T style of x86 assembler (which I find vastly preferable than
the over-annotated microsoft/intel form).

Vax:
;++
; Initialize FAO buffer.
;--
movl #80,obuf
movab obuffer,obuf+4
;++

Burroughs:

MVN Source, Dest
ADD src1, src2, dest

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 11 21:29:12 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

A low of people swear by:
movl %eax, 16(%rdi)
....

More swear at it than for it.

Most likely: those who swear by it have brain damage by x86-ism.

It's the oververbose, bas-ackwards intel syntax that one does swear at.

The AT&T syntax that BGB noted above is far superior.

YMMVO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Mar 11 21:18:34 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Mar 2025 2:49:00 +0000, Robert Finch wrote:

On 2025-03-10 8:53 p.m., MitchAlsup1 wrote:

On Mon, 10 Mar 2025 22:40:55 +0000, BGB wrote:

------------------------

My 66000::
00 0000000000000000..FFFFFFFFFFFFFFFF: DRAM
01 0000000000000000..FFFFFFFFFFFFFFFF: MMI/O
10 0000000000000000..FFFFFFFFFFFFFFFF: config
11 0000000000000000..FFFFFFFFFFFFFFFF: ROM

How does one reference DRAM vs MMI/O at the same address using a LD / ST
instruction?

The MMU translates the virtual address to a universal address.
The PTE supplies the extra bits.

Q+ CPU just uses a 64-bit address range. The config space is specified
in a CR defaulting to FFFFFFFFDxxxxxxx The TLB is setup at boot to
access ROM at FFFFFFFFFFFCxxxxx Otherwise there is no distinction with
addresses. There is a region table in the system that describes up to
eight distinct regions.

Every major block in my architecture has ports in config space that
smell just like that of a device on PCIe having said control block.
My thought was that adding all these to the config name space might
cramp any fixed (or programmable) partition. So, the easiest thing
is to give it its own big space.

Or make it relocatable anywhere in the address space. Assuming you
format the config data in a standard form (e.g. PCI configuration
space registers) to support off-the-shelf software, you can arrange
most of your functions on a handful of buses on a single pci segment
using ARI.

Then every device header gets 1 or more pages of address space for
its own control registers. PCIe is now a 42-bit address space::
segment, bus; device; function, xreg, reg and likely to grow as
ACHI can consume a whole PCIe segment by itself.

AHCI consumes a single function on a bus, you can have up to
seven functions per bus (without ARI) or 256 functions
(with ARI). One or more of those functions may be a
PCI-PCI bridge, which forwards to additional downstream buses.

SRIOV capable devices, on the other hand, can consume
an entire segment when supporting up to 64k virtual functions
and the physical function for the device usually occupies
function zero of the bus zero in a given segment. The
endpoint device doesn't have any knowledge of the actual
bus upon which it exists; if the physical function is on bus 8, function 0,
for example, the virtual functions can occupy the rest of
bus 8 through bus 255 on that segment for a maximum VF count
of ((256 - 8) * 256 - PF#).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to Terje Mathisen on Tue Mar 11 16:57:40 2025

On 3/11/25 4:41 PM, Terje Mathisen wrote:

x86 asm, as used in MASM and DEBUG, was the first assembler language O
used, I found it very familiar that

mov ax,bx
or
mov ax,[bx]
or
mov ax,[bx+1234]

all correspond nicely to

a = b
a = *b
a = b[1234]

Looks more like move a to b.

--
http://davesrocketworks.com
David Schultz
"The cheeper the crook, the gaudier the patter." - Sam Spade

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Mar 11 22:41:48 2025

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 11 Mar 2025 6:14:31 +0000, BGB wrote:

A low of people swear by:
movl %eax, 16(%rdi)
....

More swear at it than for it.

Most likely: those who swear by it have brain damage by x86-ism.

It's the oververbose, bas-ackwards intel syntax that one does swear at.

The AT&T syntax that BGB noted above is far superior.

YMMVO.

x86 asm, as used in MASM and DEBUG, was the first assembler language O
used, I found it very familiar that

mov ax,bx
or
mov ax,[bx]
or
mov ax,[bx+1234]

all correspond nicely to

a = b
a = *b
a = b[1234]

I.e having the target on the left is the only one that makes sense to
me. I can read the wrong one of course, but I have internally translate everything before I can grok it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Lawrence D'Oliveiro on Tue Mar 11 22:15:30 2025

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:

VAX intstructions are very complex and much of that complexity is hard
to use in compilers.

A lot of them mapped directly to common high-level operations. E.g. MOVC3/ MOVC5 for string copying, and of course POLYx for direct evaluation of polynomial functions.

In a way, one could say that, in many ways, VAX machine language was a higher-level language than Fortran.

Trouble is that such "common" operations have rather low
frequency compared to simple stuff. They are really library
functions. String copies, if done well in microcode could
give some measurable speed gain. Other probably not.

If they managed to make some simpler instruction faster,
there would be substantial gain. RISC folks understood
this, but it is not clear if VAX folks were aware of this.
Of course, it is possible that VAX designers understood
performace implications of their decisons (or rather
meager speed gain from complex instructions), but bet
that "nice" instruction set will tie programs to their
platform.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:43:08 2025

On Tue, 11 Mar 2025 13:46:24 -0700, Stephen Fuld wrote:

No software guy talks about "pipeline semantics" :-)

Remember what pipes are in Unix, and how they can be used to construct
command pipelines.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Tue Mar 11 22:40:19 2025

On Tue, 11 Mar 2025 22:41:48 +0100, Terje Mathisen wrote:

I.e having the target on the left is the only one that makes sense to
me.

One of the early programming languages I came across was POP-2. This was
fully dynamic and heap-based, like Lisp, but also had an operand stack. So
a simple assignment statement looked like

a -> b;

but this could actually be written as two separate statements:

a;
-> b;

The first one pushed the value of a on the stack, the second one popped it
off and stored it in b.

This made it easy to do things like swap variable values:

a, b -> a -> b;

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:42:02 2025

On Tue, 11 Mar 2025 11:15:06 -0700, Stephen Fuld wrote:

A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)

The true hardware engineer knows that it is neither, it is merely a
register rename. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stephen Fuld on Tue Mar 11 22:44:31 2025

On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.

There is a language (C++) which has introduced reference operators that distinguish between “move semantics” versus “copy semantics”.

No, I haven’t got my head around it either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Mar 11 22:48:17 2025

On Tue, 11 Mar 2025 22:15:30 -0000 (UTC), Waldek Hebisch wrote:

Trouble is that such "common" operations have rather low frequency
compared to simple stuff. They are really library functions.

Inline library functions. And they did contribute to keeping the code
compact, as Bell said.

One thing, though, I don’t think the POLYx instruction was all that
useful. It is typical, when computing functions approximated by
polynomials, for the polynomial to actually be infinite. And so you have a
loop that computes each term in turn, accumulates it to the result, works
out an estimate of the remaining error, and stops only when this falls
below some threshold.

This cannot be expressed by some fixed-length table of coefficients, as
the POLYx instruction expects.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to It appears that Scott Lurndal on Wed Mar 12 00:26:50 2025

It appears that Scott Lurndal <slp53@pacbell.net> said:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is

I think it may depend on first experiences with assembler language;

Absolutely. My first assembler was PDP-8, where nothing has more than
one operand. Next was S/360 where the result goes first, e.g. LR R1,R2
copies R2 into R1. Next was PDP=11 where MOV R1,R2 copies R1 into R2.

So I think they're all about equally bad. People should get over it.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Stephen Fuld on Wed Mar 12 00:27:26 2025

On 11/03/2025 20:39, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

On 11/03/2025 18:15, Stephen Fuld wrote:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea
is that when hardware guys see the instruction, they think in terms
of register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.

Being technically right is the best kind of right. 8-)

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Mar 12 00:34:08 2025

On Tue, 11 Mar 2025 18:52:50 -0500, BGB wrote:

Putting the destination on the right is also fairly common in general in
Unix style command notation:
dosomething args infile outfile

prog1 infile | prog2 > outfile

True enough. Except that the diff command shows changes as additions to
the file on the right, and deletions from the file on the left. So if I
want to change the diff command to cp, to actually replace the old file
with the new one, I have to remember to swap the file names around.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Mar 12 00:33:06 2025

On Tue, 11 Mar 2025 21:20:22 +0000, BGB wrote:

On 3/11/2025 12:57 PM, MitchAlsup1 wrote:

--------------

My whole space is mapped by BAR registers as if they were on PCIe.

Not a thing yet.

But, PCIe may need to exist for Linux or similar.

But, may still be an issue as Linux could only use known hardware IDs,
and it is a question what IDs it would know about (and if any happen to
map closely enough to my existing interfaces).

Otherwise, would be necessary to write custom HW drivers, which would
add a lot more pain to all of this.

There is already a driver in BOOT that reads config headers for
manufacture
and model, and use those to look up an actual driver for that device.
I simply plan on having My 66000 BOOT up code indexed by Mfg:Dev.

--------------

Some read-only CSRs were mapped over to CPUID.

I don't even have a CPUID--if you want this you go to config space
and read the configuration lists and extended configuration lists.

Errm, so vendor/Hardware ID's for each feature flag...

No, a manufacture:device for every CPU-type on the die. Then all of
the core identification is found on the [extended] configuration
lists.
core kind
fetch width
decode width
execute width
retire width
cache sizes
TLB sizes
predictor stuff
..

In practice, I expect some later phase in BOOT will read all this out
and package it for user consumption (and likely another copy for
supervisor consumption.) Then it is accessed as fast as any cached
chunk of memory.

30 and 31 give the microsecond timer and HW-RNG, which are more relevant
to user-land.

The timer running in virtual time or the one running in physical time ??

32..63: Currently unused.

There is also a cycle counter (along vaguely similar lines to x86
RDTSC), but for many uses a microsecond counter is more useful (where
the timer-tick count updates at 1.0 MHz, and all cores would have the
same epoch).

On x86, trying to use RDTSC as a timer is rather annoying as it may jump around and goes at a different rate depending on current clock speed.

By placing the timers in MMI/O memory address space*, accesses from
different cores necessarily get different value--so the RTC can be
used to distinguish "who got there first".

MMI/O space is sequentially consistent across all cores in the system.

------------

This scheme will not roll over for around 400k years (for a 64-bit microsecond timer), so "good enough".

So at 1GHz the roll over time is 400 years. Looks good enough to me.

Conceptually, this time would be in UTC, likely with time-zones handled
by adding another bias value.

What is UTC time when standing on the north or south poles ??

This can in turn be used to derive the output from "clock()" and
similar.

Also, there are relatively few software timing tasks where we have much reason to care about nanoseconds. For many tasks, milliseconds are sufficient, but there are some things where microseconds matters.

We used to run a benchmark 1,000,000 times in order to get accurate
information using a time with 1 second resolution. We do not want
to continue on that level.

Of which, all of the CPUID indices were also mapped into CSR space.

CPUID is soooooo pre-PCIe.

Dunno.

Mine is different from x86, in that it mostly functions like read-only registers.

x86 uses a narrow bus that runs around the whole chip so it can access
all sorts of stuff (only some it is available to users) and some of
these
accesses take 1,000's of cycles.

RISC-V land seemingly exposes a microsecond timer via MMIO instead, but
this is much less useful as this means needing to use a syscall to fetch
the current time, which is slow.

Or a generous MMU handler that lets some trusted low privilege level
processes direct access.

Doom manages to fetch the current time frequently enough that doing so
via a syscall has a visible effect on performance.

I had an old Timex a long time ago that I had to adjust the time
about 3 times a day to have any chance of accuracy. Solution--
quit wearing a watch.

My 66000 does not even have a 32-bit space to map into.
You can synthesize such a space by not using any of the
top 32-address bits in PTEs--but why ??

32-bit space is just the first 4GB of physical space.
But, as-is, there is pretty much nothing outside of the first 4GB.

The actually in use MMIO space is also still 28 bits.

You are not trying to access 1,000 ACHI disks on a single rack, either;
each disk supporting several hundred GuestOSs.

The VRAM maps 128K in MMIO space, but in retrospect probably should have
been more. When I designed it, I didn't figure there would have been
more than 128K. The RAM backed framebuffer can be bigger though, but not
too much bigger, as then screen refresh starts getting too glitchy (as
it competes with the CPU for the L2 cache, but is more timing
sensitive).

One would think the very minimum to be do 32-bit color (8,8,8,8)
on an 8K display/monitor.

-----------

My interconnect bus is 1 cache line (512-bits) per cycle plus
address and command.

My bus is 128 bits, but MMIO operations are 64-bits.

Where, for MMIO, every access involves a whole round-trip over the bus (unlike for RAM-like access, where things can be held in the L1 cache).

In theory, MMIO operations could be widened to allow 128-bit access, but haven't done so. This would require widening the data path for MMIO
devices.

Can note that when the request goes onto the MMIO bus, data narrows to
64-bit and address narrows to 28 bits. Non-MMIO range requests (from the ringbus) are not allowed onto the MMIO bus, and the MMIO bus will not
accept any new requests until the prior request has either finished or
timed out.

I see a big source of timing problems here.

We are approaching 128-cores on a die and more than 256-devices down
the PCIe tree. Does allowing only 1 access from one core to one device
at a time make any sense ?? No, you specify virtual channels, accesses
down a PCIe segment remain ordered while on the Tree and serialize at
the device (function) itself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Mar 12 00:37:13 2025

On Tue, 11 Mar 2025 18:58:11 -0500, BGB wrote:

I still haven't seen any good reason to move to C++.

No disagreement here. ;)

Some people (at the C people): use C++, it has features...

It appears the GNU C compiler itself is written in C++ now.

Others (at the C++ people): Use Rust, it is less of a trash fire.

Google started a project called “Carbon” a little while back, kind of a
C++ done right, with all the accumulated legacy crap removed.

Wonder what happened to it ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Wed Mar 12 00:38:59 2025

On Wed, 12 Mar 2025 00:26:50 -0000 (UTC), John Levine wrote:

Next was PDP=11 where MOV R1,R2 copies R1 into R2.

What about CMP (compare) versus SUB (subtract)? CMP does the subtract
without updating the destination operand, only setting the condition
codes. But are the operands the same way around as SUB (i.e. backwards for comparison purposes) or are they flipped?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to moi on Wed Mar 12 00:37:53 2025

On Tue, 11 Mar 2025 19:07:08 +0000, moi wrote:

On 11/03/2025 18:15, Stephen Fuld wrote:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.

No, it is logically a copy.

But does it copy X into Y or copy Y into X ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:36:00 2025

On Tue, 11 Mar 2025 22:44:31 +0000, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.

There is a language (C++) which has introduced reference operators that distinguish between “move semantics” versus “copy semantics”.

Is the distinction between overlapping and non-overlapping memory
to memory moves ?? Ala:: memcopy versus memmove !

No, I haven’t got my head around it either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to All on Wed Mar 12 00:43:24 2025

On 12/03/2025 00:37, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 19:07:08 +0000, moi wrote:

On 11/03/2025 18:15, Stephen Fuld wrote:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is >>> that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.

No, it is logically a copy.

But does it copy X into Y or copy Y into X ??

Quite so.
My preference is for:

LD dest source - reads as load dest from source
ST source dest - reads as store source in dest

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:46:29 2025

On Tue, 11 Mar 2025 22:48:17 +0000, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 22:15:30 -0000 (UTC), Waldek Hebisch wrote:

Trouble is that such "common" operations have rather low frequency
compared to simple stuff. They are really library functions.

Inline library functions. And they did contribute to keeping the code compact, as Bell said.

One thing, though, I don’t think the POLYx instruction was all that
useful. It is typical, when computing functions approximated by
polynomials, for the polynomial to actually be infinite. And so you have
a
loop that computes each term in turn, accumulates it to the result,
works
out an estimate of the remaining error, and stops only when this falls
below some threshold.

This cannot be expressed by some fixed-length table of coefficients, as
the POLYx instruction expects.

People quit doing it that way because Chebyshev (and later Remez)
coefficients can be sufficiently accurate whereas Taylor (or Maclaurin)
cannot. By the time of the Cody and Waite Book the continue the poly
until convergence was already old in the tooth.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 00:42:19 2025

On Tue, 11 Mar 2025 22:42:02 +0000, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 11:15:06 -0700, Stephen Fuld wrote:

A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)

The true hardware engineer knows that it is neither, it is merely a
register rename. ;)

And a move of the pre-renamed register back to the free pool.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Mar 12 00:51:57 2025

On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op
code. :-) I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e.
"Move A to B." even though you are technically right that it is a copy.

There is a language (C++) which has introduced reference operators that
distinguish between “move semantics” versus “copy semantics”.

No, I haven’t got my head around it either.

I still haven't seen any good reason to move to C++.

C++ is for those situations where you want to write a small amount of
code and have it compile into a vast string of instructions.

C is for those situations where you want to write a small amount of
code and have it compile into a small string of instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Lawrence D'Oliveiro on Tue Mar 11 19:28:15 2025

On 3/11/2025 5:37 PM, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 18:58:11 -0500, BGB wrote:

I still haven't seen any good reason to move to C++.

No disagreement here. ;)

Some people (at the C people): use C++, it has features...

It appears the GNU C compiler itself is written in C++ now.

Others (at the C++ people): Use Rust, it is less of a trash fire.

Google started a project called “Carbon” a little while back, kind of a C++ done right, with all the accumulated legacy crap removed.

Wonder what happened to it ...

Still being developed

https://en.wikipedia.org/wiki/Carbon_(programming_language)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Wed Mar 12 08:42:28 2025

antispam@fricas.org (Waldek Hebisch) writes:

Of course, it is possible that VAX designers understood
performace implications of their decisons (or rather
meager speed gain from complex instructions), but bet
that "nice" instruction set will tie programs to their
platform.

I don't think that they fully understood the performance implications,
but I believe that creating an appealing environment for software
developers was a major consideration of the architects: For the assembly-language programmers, provide orthogonality; that also makes
it easy to write compilers (optimility in some form is a different
story). The much-critized VAX CALL instruction is designed for a
software ecosystem where various languages can call each other, there
exists a common debugger for all of them, etc. I am sure that they
were aware that this call instruction was expensive, but they expected
that it was worth the cost, and also expected that implementors would
reduce the cost to below what a sequence of simpler instructions would
cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
we see such expectations disappointed; I have not measured recent
generations, though).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Wed Mar 12 08:02:07 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is >interesting.

I am a software person. When talking about register-memory copies, I
prefer to talk about load and store operations, whether I talk about
assembly language (even one where the mnemonic for these operations is
MOV) or C; in Forth the spoken names for these operations are "fetch"
(written: @) and "store" (written: !).

My first programming was on the TI-58C programmable calculator, which
has RCL (recall) and STO (store).

Then some BASIC, where you write A = Expr.

Then some 6502 assembly language, which has, e.g., LDA (load into
accumulator) and STA (store from accumulator).

In his HOPL paper Ritchie describes a number of competitors of C
(e.g., BLISS), most of which make the use of addresses explicit,
unlike C. So copying from one variable to another in one of these
languages was along the lines of

a = .b

where a and b are addresses of variables, and "." fetches the value
from the address (i.e., C's *). C won out over these languages, and
it seems to me that Ritchie thought that the different approach in C
with lexprs and (value) exprs was a major contributing factor.

Back to assembly languages:

In PDP-11-style architectures (e.g., 8086, IA-32, AMD64, 68000), where
you have instructions of the form

op r ,m/r
op m/r,r

it's pretty obvious that you implement load, store, and
register-register copy as special cases of this scheme. And the
common mnemonic for these three operations in all these architectures
is MOV. This does not come out of higher-level software
considerations, this comes out of the architecture, and how to
implement the assembler and disassembler for it in the simplest way.

By contrast, RISC architectures do not follow this scheme, so they
have mnemonics starting with L (for load) and S (for store).

Looking at <https://www.righto.com/2023/08/datapoint-to-8086.html>,
the Datapoint 2200 and the 8008 called both LAM (load A from M(emory))
and LMA (store A to M(emory)) a load. This continues in the 8008.
The 8080 assembly language replaces LAM with MOV A,M and LMA with
MOV A,M. On the 8086 they are replaced with MOV AL,[BX] and
MOV [BX],AL.

A somewhat related question, if one wants to copy the contents of R3
into R2, is that a load or a store? :-)

It's a register-register copy.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 09:07:09 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

There is a language (C++) which has introduced reference operators that >distinguish between “move semantics” versus “copy semantics”.

This is the first time I read about copy semantics; the thing that is
not reference semantics is usually called value semantics. But a web
search indeed turns up uses of "copy semantics". Interestingly, it
also turns up a link to a web site from the C++ standards committee:

https://isocpp.org/wiki/faq/value-vs-ref-semantics

The title says: "value vs ref semantics", and while there are 11
occurences of "copy" on the page, "copy semantics" does not occur
once. However, it does say

|Value (or "copy") semantics

so apparently "copy semantics" is a synonym for "value semantics".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 08:57:19 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

One of the early programming languages I came across was POP-2. This was >fully dynamic and heap-based, like Lisp, but also had an operand stack. So
a simple assignment statement looked like

a -> b;

but this could actually be written as two separate statements:

a;
-> b;

The first one pushed the value of a on the stack, the second one popped it >off and stored it in b.

This made it easy to do things like swap variable values:

a, b -> a -> b;

In Forth you can define VALUEs that work like these POP-11 variables.
In standard Forth you can write

3 value a
5 value b
a b to a to b \ swaps the contents of a and b
a . b . \ print a and b

In some Forth systems you can write it in a syntax even closer to that
of POP:

On VFX Forth you can write:

a b -> a -> b

In Gforth (development version) you can write

a b ->a ->b

That's all not very idiomatic, though. VARIABLEs (which push their
address rather than their value) are more popular than VALUEs, but
usage such as the following is also not idiomatic; you usually use
variables (and values) sparingly.

variable a 3 a !
variable b 5 b !
a @ b @ a ! b !
a @ . b @ .

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Mar 12 11:48:28 2025

On Wed, 12 Mar 2025 08:42:28 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Of course, it is possible that VAX designers understood
performace implications of their decisons (or rather
meager speed gain from complex instructions), but bet
that "nice" instruction set will tie programs to their
platform.

I don't think that they fully understood the performance implications,
but I believe that creating an appealing environment for software
developers was a major consideration of the architects: For the assembly-language programmers, provide orthogonality; that also makes
it easy to write compilers (optimility in some form is a different
story). The much-critized VAX CALL instruction is designed for a
software ecosystem where various languages can call each other, there
exists a common debugger for all of them, etc. I am sure that they
were aware that this call instruction was expensive, but they expected
that it was worth the cost, and also expected that implementors would
reduce the cost to below what a sequence of simpler instructions would
cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
we see such expectations disappointed; I have not measured recent generations, though).

- anton

It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this
century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and stores.
If we replace a dozen with hundred or three then it will become true
for loop of 64-bit loads/stores as well.

Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory accesses
still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Wed Mar 12 09:14:00 2025

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Wed, 12 Mar 2025 00:26:50 -0000 (UTC), John Levine wrote:

Next was PDP=11 where MOV R1,R2 copies R1 into R2.

What about CMP (compare) versus SUB (subtract)? CMP does the subtract
without updating the destination operand, only setting the condition
codes. But are the operands the same way around as SUB (i.e. backwards for >comparison purposes) or are they flipped?

In AT&T syntax for IA-32 and AMD64, the operands are the same way
round as for SUB (i.e., flipped compared to Intel syntax). This leads
to counterintuitive combinations of compare and branches. E.g., "cmp %rax,%rbx; jlt ..." jumps if %rax is greater than %rbx.

For the PDP-11, where the mnemonics etc. were created with that order
in mind, the mnemonics of the flags-using instructions could be named appropriately. I.e., after "cmp x,y" or "sub x,y", "blt" could branch
if x<y. I have not checked if that's the case for the PDP-11, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Wed Mar 12 11:28:36 2025

Michael S <already5chosen@yahoo.com> writes:

I am sure that they
were aware that this call instruction was expensive, but they expected
that it was worth the cost, and also expected that implementors would
reduce the cost to below what a sequence of simpler instructions would
cost (looking at REP MOVSB in many generations of Intel and AMD CPUs,
we see such expectations disappointed; I have not measured recent
generations, though).

...

It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this
century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and stores.
If we replace a dozen with hundred or three then it will become true
for loop of 64-bit loads/stores as well.

Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory accesses
still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.

My experiments were with the code in
<https://github.com/AntonErtl/move/>. I posted performance results in <2017Sep19.082137@mips.complang.tuwien.ac.at> <2017Sep20.184358@mips.complang.tuwien.ac.at> <2017Sep23.174313@mips.complang.tuwien.ac.at>

My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).

The longest of the routines is ssememmove at 275 bytes.

I expect that an avx512memmove would be quite a bit smaller, thanks to predication, but I have not yet written that nor measured how that
performs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Wed Mar 12 14:09:15 2025

On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

I am sure that they
were aware that this call instruction was expensive, but they
expected that it was worth the cost, and also expected that
implementors would reduce the cost to below what a sequence of
simpler instructions would cost (looking at REP MOVSB in many
generations of Intel and AMD CPUs, we see such expectations
disappointed; I have not measured recent generations, though).

...

It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple >non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this >century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and
stores. If we replace a dozen with hundred or three then it will
become true for loop of 64-bit loads/stores as well.

Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory
accesses still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.

My experiments were with the code in
<https://github.com/AntonErtl/move/>.

Non of those are simple loops that I mentioned above.

I posted performance results in <2017Sep19.082137@mips.complang.tuwien.ac.at> <2017Sep20.184358@mips.complang.tuwien.ac.at> <2017Sep23.174313@mips.complang.tuwien.ac.at>

My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).

Idiots from corporate IT blocked http://al.howardknight.net/
Trying to argue is not totally futile, but almost so.
So, link to google groups or, if posts are relatively recent, to https://www.novabbs.com/devel/thread.php?group=comp.arch
would be helpful.

The longest of the routines is ssememmove at 275 bytes.

I don't know why gnu memcpy is huge. I don't even know if it is
really *that* huge. But several KB is number that I had seen
stated by other people.

I expect that an avx512memmove would be quite a bit smaller, thanks to predication, but I have not yet written that nor measured how that
performs.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Wed Mar 12 14:14:22 2025

BGB <cr88192@gmail.com> writes:

On 3/11/2025 7:51 PM, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op >>>>> code. :-) I had thought about mentioning in the software part of the >>>>> argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>> "Move A to B." even though you are technically right that it is a copy. >>>>

There is a language (C++) which has introduced reference operators that >>>> distinguish between “move semantics” versus “copy semantics”.

No, I haven’t got my head around it either.

I still haven't seen any good reason to move to C++.

C++ is for those situations where you want to write a small amount of
code and have it compile into a vast string of instructions.

Yeah, one can use iostream and have a trivial "hello world" type program
have build times and binary size like it was something quite substantial...

You don't have to use iostream. vsnprintf/snprintf/printf all work
fine in C++ code and are far more efficient (and far less verbose).

Use a subset of C++ (C with classes) and the resulting code is
quite compact, but you still get data encapsulation and
inheritance (with a minor perf hit for virtual functions).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Wed Mar 12 17:08:51 2025

Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 3/11/2025 7:51 PM, MitchAlsup1 wrote:

On Tue, 11 Mar 2025 23:58:11 +0000, BGB wrote:

On 3/11/2025 5:44 PM, Lawrence D'Oliveiro wrote:

On Tue, 11 Mar 2025 13:39:55 -0700, Stephen Fuld wrote:

On 3/11/2025 12:07 PM, moi wrote:

No, it is logically a copy.

While that is true, I don't think anyone is talking about a "copy" op >>>>>> code.Â :-)Â I had thought about mentioning in the software part of the
argument that COBOL actually has a "move" verb to accomplish that, i.e. >>>>>> "Move A to B." even though you are technically right that it is a copy. >>>>>

There is a language (C++) which has introduced reference operators that >>>>> distinguish between â€œmove semanticsâ€ versus â€œcopy semanticsâ€.

No, I havenâ€™t got my head around it either.

I still haven't seen any good reason to move to C++.

C++ is for those situations where you want to write a small amount of
code and have it compile into a vast string of instructions.

Yeah, one can use iostream and have a trivial "hello world" type program
have build times and binary size like it was something quite substantial...

You don't have to use iostream. vsnprintf/snprintf/printf all work
fine in C++ code and are far more efficient (and far less verbose).

Use a subset of C++ (C with classes) and the resulting code is
quite compact, but you still get data encapsulation and
inheritance (with a minor perf hit for virtual functions).

This the "C+" language that I started to use several decades ago! Just
getting access to local declarations and inline comments was sufficient
back then. Now regular C has of course mostly caught up here, but it
doesn't matter since I'm using Rust for anything time-critical.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Wed Mar 12 16:46:36 2025

Michael S <already5chosen@yahoo.com> writes:

On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My experiments were with the code in
<https://github.com/AntonErtl/move/>.

Non of those are simple loops that I mentioned above.

They are not. If you want short code, rep movsb is unbeatable (for
memmove(), you have to do a little more, however).

I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>

My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).

Idiots from corporate IT blocked http://al.howardknight.net/

I feel with you. In my workplace, Usenet is blocked (probably unintentionally). I have to post from home.

So, link to google groups

Sorry, I cannot provide that service. Trying to access
groups.google.com tells me:

|Couldn’t sign you in
|
|The browser you’re using doesn’t support JavaScript, or has JavaScript |turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.

I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.

But all I would do is try whether google groups finds the message-ids.
You can do that yourself.

or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
would be helpful.

The posts are from 2017; these message-ids are not random-generated.

I don't know why gnu memcpy is huge. I don't even know if it is
really *that* huge. But several KB is number that I had seen
stated by other people.

I stated in one of these messages that I have seen an 11KB memmove in
glibc. Let's see:

objdump -t /debian8/usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep 'memmove' 00000000000001a0 g i .text 0000000000000047 __libc_memmove 0000000000000000 g F .text 000000000000019f __memmove_sse2 00000000000001a0 g i .text 0000000000000047 memmove
0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3 0000000000000010 g F .text.ssse3 0000000000002b67 __memmove_ssse3 0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3_back
0000000000000010 g F .text.ssse3 0000000000002b06 __memmove_ssse3_back ...

Yes, 11111 bytes for __memmove_ssse3. Debian 8 is one of the systems
I used at the time.

Let's see how it looks in Debian 12:

objdump -t /usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep 'memmove'|grep -v wmemmove
0000000000000000 l F .text 00000000000000f6 __libc_memmove_ifunc 0000000000000000 g i .text 00000000000000f6 __libc_memmove 0000000000000000 g i .text 00000000000000f6 memmove
0000000000000010 g F .text.avx 000000000000002f __memmove_avx_unaligned
0000000000000080 g F .text.avx 00000000000006de __memmove_avx_unaligned_erms
0000000000000010 g F .text.avx.rtm 000000000000002d __memmove_avx_unaligned_rtm
0000000000000080 g F .text.avx.rtm 00000000000006df __memmove_avx_unaligned_erms_rtm
0000000000000020 g F .text.avx512 0000000000000009 __memmove_chk_avx512_no_vzeroupper
0000000000000030 g F .text.avx512 000000000000073b __memmove_avx512_no_vzeroupper
0000000000000010 g F .text.evex512 0000000000000037 __memmove_avx512_unaligned
0000000000000080 g F .text.evex512 00000000000007a0 __memmove_avx512_unaligned_erms
0000000000000020 g F .text 0000000000000009 __memmove_chk_erms 0000000000000030 g F .text 000000000000002d __memmove_erms 0000000000000010 g F .text.evex 0000000000000034 __memmove_evex_unaligned
0000000000000080 g F .text.evex 00000000000007bb __memmove_evex_unaligned_erms
0000000000000010 g F .text 0000000000000028 __memmove_sse2_unaligned 0000000000000080 g F .text 0000000000000552 __memmove_sse2_unaligned_erms 0000000000000040 g F .text.ssse3 0000000000000f3d __memmove_ssse3 0000000000000000 g F .text 000000000000000e __memmove_chk

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 21:35:47 2025

On Wed, 12 Mar 2025 08:57:19 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

One of the early programming languages I came across was POP-2. This
was fully dynamic and heap-based, like Lisp, but also had an operand
stack.

In Forth you can define VALUEs that work like these POP-11 variables.

Never saw much point in Forth. I suppose in the early days when memories
were smaller and CPUs were slower, it gave you something a little bit
better than assembler, but not by much.

Besides garbage collection, POP-2 also had macros and custom operators.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 23:58:08 2025

On Wed, 12 Mar 2025 08:02:07 GMT, Anton Ertl wrote:

My first programming was on the TI-58C programmable calculator, which
has RCL (recall) and STO (store).

It also had RCL IND and STO IND. Pointers, no less.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Wed Mar 12 23:55:14 2025

On Wed, 12 Mar 2025 09:07:09 GMT, Anton Ertl wrote:

https://isocpp.org/wiki/faq/value-vs-ref-semantics

What I was talking about was this <https://en.cppreference.com/w/cpp/language/reference>
(described there as “lvalue” versus “rvalue” references).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Thu Mar 13 00:03:15 2025

On Wed, 12 Mar 2025 14:09:15 +0200, Michael S wrote:

I don't know why gnu memcpy is huge.

Full of special cases, for speed.

Back in the old MacOS days, there was a system routine called BlockMove
(plus later BlockMoveData), that similarly got larger over time as it got faster.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Mar 13 00:01:08 2025

On Wed, 12 Mar 2025 08:42:28 GMT, Anton Ertl wrote:

The much-critized VAX CALL instruction is designed for a software
ecosystem where various languages can call each other, there exists a
common debugger for all of them, etc.

And don’t forget it also reserved a longword in the call frame for use for exception handling.

I am sure that they were aware that this call instruction was
expensive, but they expected that it was worth the cost ...

For high-level languages, yes. For lower-level stuff you had BSBB/BSBW/
JSB, which did nothing more than push a return address on the stack and
jump.

And remember, all kernel calls were done via CHMK/CHME instructions
wrapped in procedures meant to be invoked via CALL.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Thu Mar 13 12:08:37 2025

On 3/12/2025 1:02 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

I wonder if the different preferences is at least partially due to
whether the person has a hardware or a software background? The idea is
that when hardware guys see the instruction, they think in terms of
register ports (read versus write), what is required of the memory
system (somewhat different for loads versus stores), etc. However
software guys think of a language construct, e.g. X = Y, which is
logically a move. I don't know if this is right, but I think it is
interesting.

I am a software person. When talking about register-memory copies, I
prefer to talk about load and store operations, whether I talk about
assembly language (even one where the mnemonic for these operations is
MOV) or C; in Forth the spoken names for these operations are "fetch" (written: @) and "store" (written: !).

snipped lots of interesting history. A good counterpoint to my
assertion. Thanks Anton.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Thu Mar 13 23:06:02 2025

On Wed, 12 Mar 2025 16:46:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My experiments were with the code in
<https://github.com/AntonErtl/move/>.

Non of those are simple loops that I mentioned above.

They are not. If you want short code, rep movsb is unbeatable (for memmove(), you have to do a little more, however).

I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>

My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).

Idiots from corporate IT blocked http://al.howardknight.net/

I feel with you. In my workplace, Usenet is blocked (probably unintentionally). I have to post from home.

So, link to google groups

Sorry, I cannot provide that service. Trying to access
groups.google.com tells me:

|Couldnâ€™t sign you in
|
|The browser youâ€™re using doesnâ€™t support JavaScript, or has JavaScript |turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.

I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.

For me it works fine without login. But not without JS.
For those who are willing to use JS, the link: https://groups.google.com/g/comp.arch/c/ULvFgEM_ZSY/m/ysPySToGAwAJ

But all I would do is try whether google groups finds the message-ids.
You can do that yourself.

GG only searches by contexts. It appears to have no idea about message
ids.

or, if posts are relatively recent, to >https://www.novabbs.com/devel/thread.php?group=comp.arch
would be helpful.

The posts are from 2017; these message-ids are not random-generated.

Then GG is the only place to find it that I am aware of. http://al.howardknight.net helped me to see that start of the message,
but not the full message.
And eternal-september is still struggling with restoration of its
archives after the crash of 9 months ago. More and more it looks like
they will never be restored.

I don't know why gnu memcpy is huge. I don't even know if it is
really *that* huge. But several KB is number that I had seen
stated by other people.

I stated in one of these messages that I have seen an 11KB memmove in
glibc. Let's see:

objdump -t /debian8/usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
'memmove' 00000000000001a0 g i .text 0000000000000047
__libc_memmove 0000000000000000 g F .text 000000000000019f __memmove_sse2 00000000000001a0 g i .text 0000000000000047
memmove 0000000000000000 g F .text.ssse3 0000000000000009 __memmove_chk_ssse3 0000000000000010 g F .text.ssse3
0000000000002b67 __memmove_ssse3 0000000000000000 g F .text.ssse3
0000000000000009 __memmove_chk_ssse3_back 0000000000000010 g F .text.ssse3 0000000000002b06 __memmove_ssse3_back ...

Yes, 11111 bytes for __memmove_ssse3. Debian 8 is one of the systems
I used at the time.

Let's see how it looks in Debian 12:

objdump -t /usr/lib/x86_64-linux-gnu/libc.a|grep .text|grep
'memmove'|grep -v wmemmove 0000000000000000 l F .text
00000000000000f6 __libc_memmove_ifunc 0000000000000000 g i .text 00000000000000f6 __libc_memmove 0000000000000000 g i .text 00000000000000f6 memmove 0000000000000010 g F .text.avx
000000000000002f __memmove_avx_unaligned 0000000000000080 g F
.text.avx 00000000000006de __memmove_avx_unaligned_erms
0000000000000010 g F .text.avx.rtm 000000000000002d __memmove_avx_unaligned_rtm 0000000000000080 g F .text.avx.rtm 00000000000006df __memmove_avx_unaligned_erms_rtm 0000000000000020 g
F .text.avx512 0000000000000009
__memmove_chk_avx512_no_vzeroupper 0000000000000030 g F
.text.avx512 000000000000073b __memmove_avx512_no_vzeroupper 0000000000000010 g F .text.evex512 0000000000000037 __memmove_avx512_unaligned 0000000000000080 g F .text.evex512 00000000000007a0 __memmove_avx512_unaligned_erms 0000000000000020 g
F .text 0000000000000009 __memmove_chk_erms 0000000000000030 g
F .text 000000000000002d __memmove_erms 0000000000000010 g F
.text.evex 0000000000000034 __memmove_evex_unaligned
0000000000000080 g F .text.evex 00000000000007bb __memmove_evex_unaligned_erms 0000000000000010 g F .text
0000000000000028 __memmove_sse2_unaligned 0000000000000080 g F
.text 0000000000000552 __memmove_sse2_unaligned_erms
0000000000000040 g F .text.ssse3 0000000000000f3d
__memmove_ssse3 0000000000000000 g F .text 000000000000000e __memmove_chk

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 12:43:27 2025

Michael S <already5chosen@yahoo.com> writes:

On Wed, 12 Mar 2025 16:46:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 12 Mar 2025 11:28:36 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: =20

My experiments were with the code in
<https://github.com/AntonErtl/move/>. =20

...

I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>

...

http://al.howardknight.net helped me to see that start of the message,
but not the full message.=20

That's deplorable. The postings with the second and third Message-Id
are delivered from http://al.howardknight.net in full. For <2017Sep19.082137@mips.complang.tuwien.ac.at>, the remaining parts
(including a few lines still shown by http://al.howardknight.net) are:

|K8 (Athlon 64 X2 4400+), glibc 2.3.6
| 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
| 21 28 54 90 162 307 595 1171 2325 4632 9244 18467 repmovsb
| 17 40 69 80 104 161 253 433 794 1514 2955 5836 memmove
| 24 31 57 82 98 129 199 323 570 1064 2053 4032 memcpy
| 21 28 53 87 155 292 566 1113 2206 4394 8768 17516 repmovsb aligned | 17 40 33 37 46 68 118 234 451 834 1635 3237 memmove aligned
| 24 31 56 45 54 72 120 193 338 627 1207 2367 memcpy aligned
| 17 27 53 89 161 306 594 1171 2325 4629 9248 18461 repmovsb blksz-1 | 17 37 61 81 105 152 251 433 792 1513 2952 5825 memmove blksz-1
| 20 30 56 83 100 130 202 325 572 1067 2054 4030 memcpy blksz-1
|
|K10 (Phenom II X2 560), glibc 2.19
| 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
| 15 22 48 84 157 309 566 1080 2107 4161 8270 16487 repmovsb
| 16 35 56 69 104 152 262 456 839 1604 3135 6201 memmove
| 16 19 13 19 31 68 114 226 408 774 1505 2968 memcpy
| 14 21 48 85 158 122 154 219 348 606 1122 2155 repmovsb aligned
| 16 39 35 38 46 63 95 190 364 664 1268 2583 memmove aligned
| 19 21 13 20 25 56 89 177 306 566 1084 2121 memcpy aligned
| 14 21 47 83 155 300 565 1079 2106 4160 8269 16487 repmovsb blksz-1 | 17 32 55 68 91 156 261 454 837 1602 3131 6190 memmove blksz-1
| 17 23 13 18 30 69 114 228 411 774 1508 2966 memcpy blksz-1
|
|Zen (Ryzen 5 1600X), glibc 2.24
| 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
| 25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
| 13 14 13 14 30 42 65 107 175 325 600 1222 memmove
| 10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
| 25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
| 12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
| 11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
| 23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
| 13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
| 10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1
|
|Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
| 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
| 25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
| 10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
| 26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
| 25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
| 10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
| 26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
| 22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
| 8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
| 24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1

And eternal-september is still struggling with restoration of its
archives after the crash of 9 months ago. More and more it looks like
they will never be restored.

My impression pretty soon after the event was that it would not
happen. Given that they lost the mapping message-id <-> article
number, the insertion of the old messages would have been disruptive
to clients that work with the article number. It was bad enough as it
was.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Sun Mar 16 01:22:09 2025

On Wed, 12 Mar 2025 16:46:36 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

... Trying to access groups.google.com tells me:

|Couldn’t sign you in
|
|The browser you’re using doesn’t support JavaScript, or has JavaScript >|turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.

I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.

But all I would do is try whether google groups finds the message-ids.
You can do that yourself.

You don't have to log in: just enter "group:comp.arch" (or whatever
group you want) into your search engine [I use duckduckgo] and follow
the link that says "Google Groups".

Once in the the group, you can move around and search within it. What
you can't do (easily) is switch to another group or get back to the
start point if you lose track of it ... for these things you have to
return to your search engine and start over.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:24
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Re: Why VAX Was the Ultimate CISC and Not RISC

Who's Online

Recent Visitors

System Info