Forum: >>> Magnum BBS <<<

Re: Reduced compared to what, Concertina II Progress

From John Levine@21:1/5 to All on Fri Dec 8 19:35:43 2023

According to Scott Lurndal <slp53@pacbell.net>:

David Brown <david.brown@hesbynett.no> writes:

I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. ...

Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple >(sequential) operations.

The PDP-8 was a a PDP-5 reimplemented with newer transistors. The 12
bit PDP-5 was a cut down version of the 18 bit PDP-4 which was later reimplented as the PDP-7, -9, and somewhat fancier -15. The PDP-4 was
a redesign of the PDP-1 to make it a lot simpler and cheaper and
modestly slower.

All of them were single accumulator single address machines, single
cycle everything where the cycles were based on the memory speed. The
PDP-8 did one cycle to fetch and decode an instruction, a second cycle
to fetch the indirect address if it was a memory reference and the
indirect bit was set, and a third cycle for memory refs to fetch or
store the operand. I think that I/O instructions might sometimes have
been a little slower to allow for the time it took signals to
propagate on the I/O bus. All of the others had the same general
design. The -15, the last in the line, was tarted up with an index
register but by that time it was clear that the PDP-11 was the winner.

I suppose you could say RISC but they weren't really reduced from
anything, they were born simple. The PDP-8 did a fantastic job of
hitting a sweet spot that could be implemented cheaply using late
1960s and 1970s technology while still being capable enough to do
significant work.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Dec 8 19:44:43 2023

According to David Brown <david.brown@hesbynett.no>:

By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.

Having actually programmed a PDP-8 I find this assertion hard to
understand. It's true, it had instructions that both fetched an
operand and did something with it, but with only one register what
else were they going to do?

It was very RISC-like in that you used a sequence of simple
instructions to do what would would be one instruction on more complex machines. For example, you got the effect of a load by clearing the
register (CLA) and then adding the memory word. To do subtraction,
clear, add the second operand, negate, add the first operand, maybe
negate again depending on whether you wanted A-B or B-A. We all knew a
long list of these idioms.

There was a microcontroller that we once considered for a project, which
had only a single instruction - "move". We ended up with a different
chip, so I never got to play with it in practice.

I saw some of those, and the one-instruction thing was a conceit. They
all had plenty of instructions, just with the details in the operand
specifiers rather than the op code.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Levine on Fri Dec 8 21:14:05 2023

On 08/12/2023 20:44, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few
instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.

Having actually programmed a PDP-8 I find this assertion hard to
understand. It's true, it had instructions that both fetched an
operand and did something with it, but with only one register what
else were they going to do?

I have never used one, and don't know about the PDP-8 in any kind of
detail. I am just responding to Scott's post. I wrote that my
understanding of "RISC" was it meant simpler instructions, not fewer instructions, and then Scott suggested that meant the PDP-8 was "RISC"
because it had few instructions. I have no idea if the PDP-8 is/was
generally considered "RISC", but I /do/ know that Scott appeared to have
got my post completely backwards.

Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)

It was very RISC-like in that you used a sequence of simple
instructions to do what would would be one instruction on more complex machines. For example, you got the effect of a load by clearing the
register (CLA) and then adding the memory word. To do subtraction,
clear, add the second operand, negate, add the first operand, maybe
negate again depending on whether you wanted A-B or B-A. We all knew a
long list of these idioms.

There was a microcontroller that we once considered for a project, which
had only a single instruction - "move". We ended up with a different
chip, so I never got to play with it in practice.

I saw some of those, and the one-instruction thing was a conceit. They
all had plenty of instructions, just with the details in the operand specifiers rather than the op code.

The one I am thinking of was the MAXQ. No, it is/was not a conceit - it
was a real transfer-triggered architecture.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to david.brown@hesbynett.no on Fri Dec 8 22:16:26 2023

It appears that David Brown <david.brown@hesbynett.no> said:

Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)

The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.

The OPR opcode had a bunch of bits that did various things to the
accumulator and the link (approximately the carry bit.) It wasn't
microcoded in the modern sense, it was that you could set more than
one bit to get more than one operation, e.g. octal 7040 bit complemented
the accumulator and 7001 incremented it, so 7041 negated it.

The IOT instructions were extremely simple, a six bit device address
field it sent out on the I/O bus, and the three low bits that sent
pulses out on three control lines. The devices did whatever they did,
read the contents of the accumulator, send back a value to put in it,
or tell the CPU to skip the next instruction which was how you tested
a flag.

Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Dec 9 02:42:43 2023

According to MitchAlsup <mitchalsup@aol.com>:

PDP-8 certainly is simple nor does it have many instructions,
but it certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

That's a reasonable set of criteria but I think the last one is the
most important. The IBM 801 ruthlessly took everything out of the
hardware that could be done as fast in software, and the rest of the
design followed from that.

Since then RISC-y designs have incorporated virtual memory and
floating point even though the 801 didn't, because they aren't things
they tried to make the 801 do, and they turn out to be a lot faster
with hardware support.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sat Dec 9 08:33:14 2023

David Brown <david.brown@hesbynett.no> writes:

On 08/12/2023 16:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

(I'm snipping because I pretty much agree with the rest of what you wrote.)

Is Coldfire a load/store architecture? If not, it's not a RISC.

I agree that there's a fairly clear boundary between a "load/store >architecture" and a "non-load/store architecture". And I agree that it
is usually a more important distinction than the number of instructions,
or the complexity of the instructions, or any other distinctions.

Not sure what you mean by "the complexity of the instructions".

But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?

Skimming in Patterson and Ditzel's 1980 paper "The Case for the
Reduced Instruction Set Computer", I fail to see a definition of RISC
(which may be one of the reasons for it being particularly easy to
claim RISCness for everything under the sky). In his 1985 paper
"Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, Patterson mentioned
the 801, Berkeley RISC-I and RISC-II, and Stanford MIPS as actual RISC machines, and saw the following common characteristics among them:

|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. Operations
| between registers complete in one cycle, permitting a simpler,
| hardwired control for each RISC, instead of
| microcode. Multiple-cycle instructions such as floating-point
| arithmetic are either executed in software or in a special-purpose
| coprocessor. (Without a coprocessor, RISCs have mediocre
| floating-point performance.) Only two simple addressing modes,
| indexed and PC-relative, are provided. More complicated addressing
| modes can be synthesized from the simple ones.
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

One commonality that Patterson fails to mention is that these are all
register machines (machines where most instructions working with
addresses or scalar integers work on general-purpose registers) with
16 or more general-purpose registers. He probably did not mention
this because the VAX and the S/360 are also register machines. Let's
call this commonality "5.".

Looking at these criteria:

2. is quite implementation-oriented. The commercial MIPS R2000 and
R3000 (descendant of Stanford MIPS) have the FPU in a coprocessor,
but the R4000 already has it on the same chip as the integer unit;
is the R4000 not a RISC? Likewise for SPARC (descendant of
Berkeley RISC) and the RSC implementation of Power (the descendant
of the 801). And commercial MIPS already has multiply and divide
instructions that do not complete in one cycle (and they involve
special-purpose registers).

RISC-V (with involvement from Patterson) has the M extension which
includes integer multiply and divide instructions which typically
take more than one cycle. Is RISC-V with the M extension not a
RISC?

Where 2. is ISA-oriented, in the addressing modes, ARM (A32), HPPA
and 88000 provided more addressing modes. Are they not RISC? I
think they have enough in common with these three RISCs that they
can be considered RISC, and that this commonality between the three
research RISCs is not a relevant criterion.

3. The ROMP (the first commercial offspring of the 801) has mixed
16-bit and 32-bit instructions, and the 32-bit instructions can
cross word boundaries. This has later been adopted by the ARM
Thumb2 (T32) instruction set (after experiments with the
16-bit-only Thumb ISA), microMIPS (after the 16-bit only MIPS16e)
and in the RISC-V C extension. Are these all not RISCs? I think
they are RISCs, so that criterion has to be relaxed to include
instruction sets with two instruction sizes with a factor of two
between them. Interestingly, ARM A64 satisfies 3 in unrelaxed
form.

4. The delayed branches are an example of an implementation-oriented
instruction set feature. The ARM architects had the wisdom to
avoid it from the start, and early RISC architectures that have it
found it to be a burden after a few years. The 88000 (1988)
already had both delayed and nondelayed branches, Power (1990) and
Alpha (1992) do not have delayed branches. Are they not RISCs?

So the only commonalities that stood the test of time are 1. and 5.

And looking at ARM A64 and RISC-V (say, RV64GC), we see two recent architectures for general-purpose computers that mostly satisfy these
criteria. Interestingly, some RISC-V advocates (not sure if it is
Patterson) now use ARM A64 as a counterexample to the RISC idea like
Patterson used the VAX in the 1980s, but I have not heard hard
criteria for that from them.

And while AMD64 includes load-and-op and RMW instructions, it mostly
just uses one memory location. The number of GPRs has been raised
with AMD64 to 16, and APX will raise it it 32, so they took lessons
from RISC principles.

Things have changed a lot since the term "RISC" was first coined, and
maybe architectural and ISA features are so mixed that the terms "RISC"
and "CISC" have lost any real meaning.

Architecture is ISA. Architectural features and ISA features are the
same thing.

Sure, many people have tried to put the "RISC" label on many things,
and if you accept that, it really has no meaning; and actually, it did
not have a meaning in the 1980 paper, and was only given meaning by
looking at the commonalities of the 3-4 prototypes in 1985; and with
hindsight, we see that only commonality 1 (and the unmentioned
commonality 5) has stood the test of time.

But we see that ARM A64 and RISC-V actually satisfy 1 and 5, more than
30 years after the early research RISCs, so these criteria provide
some benefits even in the very different implementation world of the
2010s.

ARM A64 also satisfies criterion 3, so that apparently has a benefit,
too, and RISC-V C satisfies the relaxed version of 3 while AMD64 (and
VAX and 68000 do not).

So I claim that an architecture that satisfies criterion 1, 5, and the
relaxed 3 are RISCs.

As for "CISC", that term really has no meaning other than "not RISC".
I am not aware of a proper definition, nor an enumeration of
architectures that are labeled as CISCs vs. non-CISCs. Basically, we
know that Patterson considered the VAX to be a CISC. Otherwise, the
term has been often used as "non-RISC".

If that's the case, then we
should simply talk about LSA and NLSA architectures, and stop using
"RISC" and "CISC".

As we see above, RISC is still a little more than just load/store
architecture.

I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.

That would mean that "RISC" has an original definition; what is it?

As for the purpose, the purpose is still an architecture that is a
good and stable interface to software, and can be implemented
efficiently for a wide range of performance targets over a long time.
ARM A64 seems to do quite well in that respect (although they leave
the real low end to A32/T32, where they do very well), RISC-V
currently only covers the low end of the market while the
not-quite-RISC AMD64 only covers the high end. RISC criteria 1, 5,
and relaxed 3 seem to work quite well after four decades, while
criteria 2 and 4 went by the wayside quite soon.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Sat Dec 9 12:35:46 2023

On Sat, 09 Dec 2023 08:33:14 +0000, Anton Ertl wrote:

That would mean that "RISC" has an original definition; what is it?

It certainly does make sense to say that the _description_ of
characteristics of existing RISC implementations in the Patterson
and Ditzel paper which you cited doesn't constitute an _original_
definition of RISC.

However, Patterson also wrote a popular article about one of the
early RISC processors with which he was connected for _Scientific
American_, and he mentioned most or all of those characteristics
there, including not having hardware floating-point, because it
took more than one cycle to execute, and, although my memory may
be wrong, it seems to me that in that article he did make the leap
to treating that as a definition rather than just an observation.

Whether or not that is true, it is indeed something like the list
from Patterson and Ditzel that is being taken as the "definition"
of RISC by those who say that what passes for RISC these days is
such as to deprive the term of meaning.

Obviously, not having hardware floating-point for the sake of
RISC purism is such a stupid idea that basically no one does that
any more. Since, however, there are still many current designs
that incorporate most of the _other_ characteristics of RISC, it
would be inappropriate to draw the conclusion from this that RISC
is now a dead and obsolete concept.

On the other hand, a lot of RISC architectures - all the instructions
are 32 bits long, the register banks have at least 32 registers in
them, the architecture is load-store - currently have OoO
implementations. Like having hardware floating-point, this is done
to get the best possible speed given the much larger number of transistors
we can put on a die this day.

Unlike allowing hardware floating-point, though, I think this
change strikes directly at the _raison d'être_ of RISC itself.

If RISC exists because it's designed around making pipelining
fast and efficient, once you've got an OoO implementation, of
what benefit is RISC? Maybe the OoO circuitry doesn't have to
work so hard, or OoO plus 32 registers can delay register
hazards even longer than OoO plus 8 registers. I don't find this
so implausible as to dismiss RISC as being now nothing more than
a marketing gimmick.

Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
z/Architecture, ColdFire, maybe even Mitch's MY 66000) and
there's CISC than which _anything_ else would be better
(such as the world's most popular ISA, devised by a company
that made MOS memories, and then branched out into making
some of the world's first single-chip microprocessors)...

Given that situation, where "good CISC" is relatively
minor in its market presence compared to bad, bad, very
bad CISC, some architecture designers have chosen to
incorporate as much of Patterson's original description,
if not definition, of RISC into their designs as is
practical in order to distance themselves more convincingly
from x86 and x86-64.

In designing Concertina II, which might well be described
as a half-breed architecture from Hell that hasn't made
up its mind whether to be RISC, CISC, or VLIW, even I have
been affected by that concern.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Quadibloc on Sat Dec 9 13:03:00 2023

In article <ul1mv2$26n3i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

On the other hand, a lot of RISC architectures - all the
instructions are 32 bits long, the register banks have at
least 32 registers in them, the architecture is load-store
- currently have OoO implementations. Like having hardware
floating-point, this is done to get the best possible speed
given the much larger number of transistors we can put on a
die this day.

Unlike allowing hardware floating-point, though, I think this
change strikes directly at the _raison d'�tre_ of RISC itself.

If RISC exists because it's designed around making pipelining
fast and efficient . . .

Simple pipelining can't deliver leading-edge performance these days (or
for quite a while). And once you start having multiple pipelines, you
start encountering interactions between them, and OoO becomes a more
attractive method.

"RISC" as a principle of /implementation/ has become obsolete.

Load-store architectures and large register sets are still a method of
coping with the slowness of accessing memory, even with good caches.

Instructions with simple semantics and few side-effects can make OoO
systems easier to implement. Making it easy to decode lots of
instructions in parallel makes it possible to keep a larger number of instructions in the pool and thus find more re-ordering opportunities.

"RISC" as an ideology was a product of its time. All the architectures
designed that way have died out of commercial use.

It's an interesting question if one should even try to design an
architecture for a very long life, given that one can't anticipate how
the implementation technologies will change over decades. An architecture
that will obviously become obsolete within a decade probably isn't worth
the start-up costs, but after that, who knows?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Sat Dec 9 18:11:05 2023

Anton Ertl wrote:

I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.

That would mean that "RISC" has an original definition; what is it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

RISC was defined before CISC was coined as its contrapositive.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Quadibloc on Sat Dec 9 18:20:22 2023

Quadibloc wrote:

Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
z/Architecture, ColdFire, maybe even Mitch's MY 66000) and

My 66000 is RISC in a purer form of RISC than ARM or RISC-V.

there's CISC than which _anything_ else would be better
(such as the world's most popular ISA, devised by a company
that made MOS memories, and then branched out into making
some of the world's first single-chip microprocessors)...

Given that situation, where "good CISC" is relatively
minor in its market presence compared to bad, bad, very
bad CISC, some architecture designers have chosen to
incorporate as much of Patterson's original description,
if not definition, of RISC into their designs as is
practical in order to distance themselves more convincingly
from x86 and x86-64.

Even x86 designers use RISC to distance themselves from CISC.

In designing Concertina II, which might well be described
as a half-breed architecture from Hell that hasn't made
up its mind whether to be RISC, CISC, or VLIW, even I have
been affected by that concern.

The only things in My 66000 that are not on the 7 Tenets of RISC
is the attachments of constants as replacements for registers.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Quadibloc on Sat Dec 9 17:27:02 2023

Quadibloc <quadibloc@servername.invalid> writes:

However, Patterson also wrote a popular article about one of the
early RISC processors with which he was connected for _Scientific
American_,

A short web search on that came up empty. Do you have a reference?

On the other hand, a lot of RISC architectures - all the instructions
are 32 bits long, the register banks have at least 32 registers in
them, the architecture is load-store - currently have OoO
implementations. Like having hardware floating-point, this is done
to get the best possible speed given the much larger number of transistors
we can put on a die this day.

Unlike allowing hardware floating-point, though, I think this
change strikes directly at the _raison d'être_ of RISC itself.

That's the interesting thing. It doesn't.

If RISC exists because it's designed around making pipelining
fast and efficient, once you've got an OoO implementation, of
what benefit is RISC?

It's still simpler. With a load-and-op instruction, you have to split
the instruction into the load part and the op part, they find their
way through the OoO engine, and eventually you have to combine them
again in the ROB. With RMW, things become even more interesting; AMD
had (maybe still has) an R_W uop (ROP in AMD parlance) which works
before and after the ALU part. Obviously all doable, but it adds to
the complexity. Which was first, the PowerPC 604 or the Pentium Pro?

Maybe the OoO circuitry doesn't have to
work so hard,

Or the microarchitects and validators don't have to work so hard.

or OoO plus 32 registers can delay register
hazards even longer than OoO plus 8 registers.

Doubtful; with a given amount of physical registers, you then have 24
less registers for reordering. The more relevant reason for 32
registers is for code that has to deal with more than 16 values at the
same time; they reduce the need for loads and stores for spilling, and
for hardware-optimizing store-load dependencies. It's no surprise
that both Tiger Lake (Intel) and Zen 3 (AMD) perform store-to-load
forwarding at 0 cycle latency (which probably did cost quite a bit of
design work and who knows how much silicon), while it takes 5 cycles
or so on Firestorm (Apple); Firestorm does not need that optimization
as dearly.

Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
z/Architecture, ColdFire,

What makes you think so?

there's CISC than which _anything_ else would be better
(such as the world's most popular ISA, devised by a company
that made MOS memories, and then branched out into making
some of the world's first single-chip microprocessors)...

AMD made MOS memories and some of the world's first single-chip microprocessors?

Anyway, you seem to be referring to AMD64, but the rest is unclear.
Intel and AMD have avoided the complexities of VAX in designing IA-32
and AMD64; in particular, every instruction (but MOVS and the
newfangled gather and scatter instructions, shame on Intel, and they
were rewarded with Downfall) only refers to one memory location.

some architecture designers have chosen to
incorporate as much of Patterson's original description,
if not definition, of RISC into their designs as is
practical in order to distance themselves more convincingly
from x86 and x86-64.

I doubt that "distancing themselves from x86 and x86-64" was any
consideration in the ARM A64 and RISC-V designs. Of course they would
not design in architectural ideas where the patent has not expired,
but for load-and-op, RMW, or three-memory-address-with-autoincrement instructions, any patents that may have existed have long expired.

The ARM A64 seem to have had no qualms at introducing features like
load-pair and store-pair that raise eyebrows among purists, so if they
thought they would gain enough by deviating from A64 being a
load-store architecture, or from sticking to fixed-width instructions,
or from it having 32 registers, they would have gone there.
Apparently they did not think so, and the IPC and performance per Watt
of Firestorm indicates that they have designed well.

The surviving RISC properties are no longer as important as they were
in the late 1980s, but they still result in fewer problems for the microarchitects to solve, and all of what goes with lower complexity.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Dec 9 19:41:47 2023

According to MitchAlsup <mitchalsup@aol.com>:

That would mean that "RISC" has an original definition; what is it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

Do you mean this paper by Patterson and Ditzel or something else?

https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

Also compare this paper on the 801 which starts in a very different
place but ends up with many of the same conclusions.

https://dl.acm.org/doi/pdf/10.1145/800050.801824

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to John Levine on Sat Dec 9 20:06:54 2023

On Sat, 9 Dec 2023 19:41:47 -0000 (UTC), John Levine wrote:

According to MitchAlsup <mitchalsup@aol.com>:

That would mean that "RISC" has an original definition; what is it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

Do you mean this paper by Patterson and Ditzel or something else?

Mitch may also be thinking of Manolis Katevenis' thesis:

<http://users.ics.forth.gr/~kateveni/cv/katevenis_cv_full_v21.html#B1>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to John Levine on Sat Dec 9 21:47:09 2023

John Levine wrote:

According to MitchAlsup <mitchalsup@aol.com>:

That would mean that "RISC" has an original definition; what is it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

Do you mean this paper by Patterson and Ditzel or something else?

I actually means "Reduced Instruction Set Computers for VLSI" K.

https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

My memory ain't what it used to be........

Several quotes from the Ditzel paper::

By a judicious choice of the proper instruction set and the design of a corresponding
architecture, we feel that it should be possible to have a very simple instruction set
that can be very fast. This may lead to a substantial net gain in overall program
execution speed. This is the concept of the Reduced Instruction Set Computer.

Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].

The paper also makes the case against instructions that are seldom used two ways:
First it makes an argument against microcoded implementations using microcode as
a gas and filling up every SQ nanoAcre with seldom functionality, then pointing out that if a simple sequence of instructions is faster than the equivalent µCode
instruction, choosing µCode was a poor choice of the architect.

In my case (My 66000)::
I see enough use of indexed addressing that this feature makes the cut
I see enough use of LD Reg,[GOT[k]] that big displacements make the cut
I see enough use of LD IP, [GOT[k]] that simplifies cross module calling to make the cut
I see immediates being used all sorts of places so immediates make the cut
I see large displacements being used all sorts of places so displacements make the cut
I see a few uses of the operand sign control but this reduces the name space the programmer
.....needs to understand
I see enough VEC-LOOP pairs that these make the cut
Practically every non-leaf subroutine uses ENTER and EXIT
.....But note I must remain vigilant that these don't end up slower than a series of insts
The compiler happily produces transcendental instructions for those names which are in the
.....LLVM intrinsic function list. When you can calculate practically any transcendental
.....in fewer than 20 cycles (FDIV equivalent) it is time to do with these what happened
.....to FP instructions around the time of the first commercial RISCs.

Although not done yet:: I am not adverse to adding encryption instructions {once I can
figure out what they should actually be doing and how few I can get away with}

Also compare this paper on the 801 which starts in a very different
place but ends up with many of the same conclusions.

https://dl.acm.org/doi/pdf/10.1145/800050.801824

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to John Levine on Sat Dec 9 18:00:33 2023

John Levine <johnl@taugh.com> writes:

It appears that David Brown <david.brown@hesbynett.no> said:

Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)

The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.

The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.

Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.

The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Dec 10 16:34:12 2023

According to Tim Rentsch <tr.17687@z991.linuxsc.com>:

The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.

The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.

I checked, the 801 project started in 1975. The RISC-I paper was
published in 1981 and I think they came up with the name in 1980, so
close enough.

Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.

The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.

I poked around one that was old and dead but just looking at it, you
could see what a very elegant design it was to get useful work out of
so little logic.

The Bendix G-15 had 450 tubes and 3000 diodes so it's the other
contender for the title. Both machines were introduced in 1956,
cost about the same, and were about the same size, 800lb for the LGP-30,
956lb for the G-15.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun Dec 10 16:56:10 2023

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

The ARM A64 seem to have had no qualms at introducing features like
load-pair and store-pair that raise eyebrows among purists, so if they >thought they would gain enough by deviating from A64 being a
load-store architecture, or from sticking to fixed-width instructions,
or from it having 32 registers, they would have gone there.
Apparently they did not think so, and the IPC and performance per Watt
of Firestorm indicates that they have designed well.

Actually the 'RISC purity' of the A64 Architecture was not
likely to have ever been a consideration when choosing which
features to add to the architecture. They're in the money
making business, not some idealistic RISC business.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Dec 10 16:45:32 2023

According to MitchAlsup <mitchalsup@aol.com>:

I actually means "Reduced Instruction Set Computers for VLSI" K.

https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

My memory ain't what it used to be........

Several quotes from the Ditzel paper::

By a judicious choice of the proper instruction set and the design of a corresponding
architecture, we feel that it should be possible to have a very simple instruction set
that can be very fast. This may lead to a substantial net gain in overall program
execution speed. This is the concept of the Reduced Instruction Set Computer.

Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].

Yup. If anyone can find that Johnson tech report I'd like to read it.
Some googlage only found references to it.

It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together, a more sophisticated version of
what Johnson did, so they were constantly trading off what they could
do in hardware and what they could do in software, usually finding
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.

Berkeley used the old PCC compiler which wasn't terrible but did not
do very sophisticated register allocation, so they invented hardware
register windows. In retrospect, the 801 project was right and windows
albeit clever were a bad idea. Better to use that chip area for a
bigger cache.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Dec 10 16:59:56 2023

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

John Levine <johnl@taugh.com> writes:

It appears that David Brown <david.brown@hesbynett.no> said:

Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)

The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.

The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.

Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.

The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.

I'd suggest the ABC machine, but it was restricted to solving
linear equations :-) I did have the last remaining
component in my possession for a few months in 1981 (the
memory drum).

There's a modern replica at the CHM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Sun Dec 10 17:02:02 2023

John Levine <johnl@taugh.com> writes:

According to Tim Rentsch <tr.17687@z991.linuxsc.com>:

The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.

The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.

I checked, the 801 project started in 1975. The RISC-I paper was
published in 1981 and I think they came up with the name in 1980, so
close enough.

Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.

The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.

I poked around one that was old and dead but just looking at it, you
could see what a very elegant design it was to get useful work out of
so little logic.

The Bendix G-15 had 450 tubes and 3000 diodes so it's the other
contender for the title. Both machines were introduced in 1956,
cost about the same, and were about the same size, 800lb for the LGP-30, >956lb for the G-15.

The electrodata Datatron system shipped in 1954. I used to work in
the plant where it was designed and built (albeit decades later
when the B4800 was the high-end machine).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to All on Sun Dec 10 17:56:00 2023

In article <ul4pvc$2r22$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:

Yup. If anyone can find that Johnson tech report I'd like to read
it. Some googlage only found references to it.

The Computer History Museum has hardcopy:

<https://www.computerhistory.org/collections/catalog/102773566>

The UK's Centre for Computing History also appears to have a copy: <https://www.computinghistory.org.uk/det/12205/Bell-Computing-Science-Tech nical-Report-80-A-32-Bit-Processor-Design/> Since they're local to me,
I've asked them if they can make me a copy.

If anyone else wants to hunt, the reference is in: <https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>

The author was Stephen C Johnson, who worked at Bell Labs in the 1970s,
largely on languages; he was responsible for YACC. <https://research.google.com/pubs/archive/94.pdf>

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Sun Dec 10 17:56:00 2023

In article <K4mdN.7691$83n7.220@fx18.iad>, scott@slp53.sl.home (Scott
Lurndal) wrote:

Actually the 'RISC purity' of the A64 Architecture was not
likely to have ever been a consideration when choosing which
features to add to the architecture. They're in the money
making business, not some idealistic RISC business.

Yup. A32 was never a pure RISC: they had understood the idea, but did not
feel constrained by by it.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to John Levine on Sun Dec 10 18:27:40 2023

John Levine wrote:

According to MitchAlsup <mitchalsup@aol.com>:

I actually means "Reduced Instruction Set Computers for VLSI" K.

https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

My memory ain't what it used to be........

Several quotes from the Ditzel paper::

By a judicious choice of the proper instruction set and the design of a corresponding
architecture, we feel that it should be possible to have a very simple instruction set
that can be very fast. This may lead to a substantial net gain in overall program
execution speed. This is the concept of the Reduced Instruction Set Computer. >>
Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].

Yup. If anyone can find that Johnson tech report I'd like to read it.
Some googlage only found references to it.

It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together,

Something I keep pushing Quadrablock to do.

a more sophisticated version of
what Johnson did, so they were constantly trading off what they could
do in hardware and what they could do in software, usually finding
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.

Apparently they had no notion of JiT compilation and JiT caches.

Berkeley used the old PCC compiler which wasn't terrible but did not
do very sophisticated register allocation, so they invented hardware
register windows. In retrospect, the 801 project was right and windows
albeit clever were a bad idea. Better to use that chip area for a
bigger cache.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Sun Dec 10 19:50:10 2023

jgd@cix.co.uk (John Dallman) writes:

In article <ul4pvc$2r22$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:

Yup. If anyone can find that Johnson tech report I'd like to read
it. Some googlage only found references to it.

The Computer History Museum has hardcopy:

<https://www.computerhistory.org/collections/catalog/102773566>

The UK's Centre for Computing History also appears to have a copy: ><https://www.computinghistory.org.uk/det/12205/Bell-Computing-Science-Tech >nical-Report-80-A-32-Bit-Processor-Design/> Since they're local to me,
I've asked them if they can make me a copy.

If anyone else wants to hunt, the reference is in: ><https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>

The author was Stephen C Johnson, who worked at Bell Labs in the 1970s, >largely on languages; he was responsible for YACC.

He was also responsible for PCC, if I recall correctly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Dec 10 21:17:58 2023

According to MitchAlsup <mitchalsup@aol.com>:

It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together, ...
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.

Apparently they had no notion of JiT compilation and JiT caches.

They certainly knew about JIT code since IBM sort programs have been
generating comparison code since the 1960s if not longer. Back in the
olden days, particularly on machines without index registers, you
pretty much had to write code where one instruction would modify
another to do address or length calculations.

By the 1970s nobody did that, programs were loaded and didn't change
once they were loaded. If you want to do JIT, write out the JIT code,
then poke the cache to invalidate the area where you put the JIT code.
It's the same thing it did when loading a program in the first place.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to John Levine on Sun Dec 10 21:59:37 2023

John Levine wrote:

According to MitchAlsup <mitchalsup@aol.com>:

It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together, ...
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.

Apparently they had no notion of JiT compilation and JiT caches.

They certainly knew about JIT code since IBM sort programs have been generating comparison code since the 1960s if not longer. Back in the
olden days, particularly on machines without index registers, you
pretty much had to write code where one instruction would modify
another to do address or length calculations.

By the 1970s nobody did that, programs were loaded and didn't change
once they were loaded. If you want to do JIT, write out the JIT code,
then poke the cache to invalidate the area where you put the JIT code.
It's the same thing it did when loading a program in the first place.

My point was that the original statement was: they knew their compiler's
code never modified instructions.

Yet a JiT compiler HAS to modify instructions.

I am not poking fun at 801 {for which I have great admiration.}
I am poking fun at the inspecificity of the statement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Sun Dec 10 22:04:50 2023

On Sat, 09 Dec 2023 17:27:02 +0000, Anton Ertl wrote:

Quadibloc <quadibloc@servername.invalid> writes:

However, Patterson also wrote a popular article about one of the early
RISC processors with which he was connected for _Scientific American_,

A short web search on that came up empty. Do you have a reference?

I was thinking of

D. A. Patterson, "Microprogramming," Scientific American, vol. 248, no. 3,
pp. 36-43, March 1983.

despite its unlikely title.

there's CISC than which _anything_ else would be better (such as the >>world's most popular ISA, devised by a company that made MOS memories,
and then branched out into making some of the world's first single-chip >>microprocessors)...

AMD made MOS memories and some of the world's first single-chip microprocessors?

No, but Intel did.

I think of x86 as the architecture, and x86-64 as a feature, like MMX or AVX-512. Although the transition from the 8086 to the 80386 might well
be considered as moving to a new architecture, x86-64 is, to me, just
a feature added to the 386 architecture.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Dec 11 00:30:18 2023

According to MitchAlsup <mitchalsup@aol.com>:

My point was that the original statement was: they knew their compiler's
code never modified instructions.

Yet a JiT compiler HAS to modify instructions.

I am not poking fun at 801 {for which I have great admiration.}
I am poking fun at the inspecificity of the statement.

I was summarizing. You might want to read the paper.

The S/360 architecture says that one instruction can modify the one
that immediately follows it. This sort of thing was very common before
there were index registers and still somewhat common in the 1960s. On
S/360 the most likely example was a string move or compare, where the
string length was in the instruction. It turned out that in practice
nobody did that, if you wanted a variable length move, the EXecute
instruction let you fake it by or'ing a register into the length byte
in the executed instruction, but even now zSeries lets you store into
the instruction stream so there is some little-used logic to detect
that.

The 801 people said that's silly, it's so rare to change instuctions
that we'll require the program to explicitly invalidate the cache when
they do.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Mon Dec 11 08:54:24 2023

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

mitchalsup@aol.com (MitchAlsup) writes:

PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

What is that design philosophy supposed to be?

The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
claim that Nova led to RISC and PDP-11 to CISC, but when you look at
Nova and its successors, including single-chip implementations, it's
an accumulator architecture which consumed many cycles for each
instruction, but it invested hardware in fast multiplication and
division. The design seems to have further evolved in the direction
of CISC, and we can read in "The Soul of a New Machine" about the
headaches that the microarchitects of the Eclipse MV/8000 had dealing
with virtual memory etc. in that architecture, and this architecture
was replaced by the RISC 88000 architecture a decade later.

Does anybody design architectures with properties like the PDP-8 or
the Nova these days, or even in 1990? Not that I know of. By
contrast, people are still designing architectures with some of the
same properties that early RISCs have. Maybe the PDP-8 was designed
with the same philosophy as the early RISCs, maybe not, but why would
we call that philosophy RISC, rather than identifying the common
properties of architecture that were called RISCs by Patterson, the
inventor of the term, and calling that RISC?

Moreover, the major "philosophy" behind the PDP-8 is probably to make
it as cheap as possible. For 20 years we have had the b16-small
architecture that is cheaper to produce than any RISC; it takes
0.16mm^2 in the XC035 process (a 350nm process <https://www.eetimes.com/x-fab-expands-mixed-signal-foundry-portfolio-with-0-35-micron-process/>),
while an 8051 in the same process takes 1mm^2. According to ARM the
Cortex-M0 takes 0.11mm^s in 180ULL (presumably a 180nm process), so
probably around 0.44mm^2 in a 350nm process.

Does that make the b16-small a RISC and ARM a non-RISC? Not as far as
I am concerned. It does not share enough characteristics with the architectures that have been called RISC.

One interesting aspect is that b16-small can run at ~150MHz (and
b16-dsp at ~200MHz) when manufactured in that process and connected to
a memory subsystem that actually supports this speed, without being
pipelined. Compare this to other CPUs manufactured in 350nm
processes: the 5-stage P54CS (Pentium), which have been sold at up to
200MHz, and the 10-stage Klamath (Pentium II), which runs at up to
300MHz, and the EV56 (Alpha 21164a), which runs at up to 666MHz.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to John Levine on Mon Dec 11 18:37:40 2023

On Mon, 11 Dec 2023 00:30:18 +0000, John Levine wrote:

The S/360 architecture says that one instruction can modify the one that immediately follows it. This sort of thing was very common before there
were index registers and still somewhat common in the 1960s.

The lower-end models of the System/360, of course, didn't have any microarchitectural features that would make this a problem. For that
matter, the original top end of the series, the Model 75, didn't
either.

So, since it was so common back then for computers to allow this -
even if they had registers, and so didn't need it - it's entirely
possible that the architecture just had this property by default,
and its possible usefulness with string instructions was not
the actual reason.

People might have just felt this was the normal, expected behavior
of a computer that wasn't staggeringly difficult to understand
because it had bizarre restrictions due to being designed for
some niche application, or high speed beyond what the technology
was really ready for.

Today, now that pipelining is standard, the rules have changed.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Dec 11 19:31:36 2023

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

What is that design philosophy supposed to be?

We don't have to guess, because DEC published an entire book about the
way they built their computers. The PDP-5 was originally a front end
system for a nuclear reactor in Canada, 12 bits both because the
analog values it was handling needed that much precision, and also
because they used ideas from the LINC lab computer. The PDP-5 is
recognizable as a cut down PDP-4 which was in turn a cheaper redesign
of the PDP-1 which was largely based on the MIT TX-0 computer built to
test core memories in the 1950s. They all had word addresses and a
single register, not surprising since that's what all scientific
computers of the era had.

The PDP-8 reimplemented the PDP-5 using newer components and packaging
so was a lot smaller and cheaper. The book says it was important that
it was the first computer small enough to sit on a lab bench, or in a
rack leaving room for other equipment.

The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
claim that Nova led to RISC and PDP-11 to CISC, ...

The PDP-11 certainly led to VAX, and the Berkeley RISC was explicitly
a response to the largely unused complexity of the VAX, much as the
801 was to S/360.

when you look at
Nova and its successors, including single-chip implementations, it's
an accumulator architecture which consumed many cycles for each
instruction, but it invested hardware in fast multiplication and
division. The design seems to have further evolved in the direction
of CISC, and we can read in "The Soul of a New Machine" about the
headaches that the microarchitects of the Eclipse MV/8000 had dealing
with virtual memory etc. in that architecture, and this architecture
was replaced by the RISC 88000 architecture a decade later.

Right. The Nova made excellent use of the hardware available at the
time it was designed, e.g., some of the instruction bits went straight
into the new ALU chips it used. But as usually happens, some of those
decisions caused a great deal of pain later, particularly when they
made the decision to shoehorn the Eclipse into the holes in the Nova's instruction set rather than a separate mode for larger addresses as on
the 386 and zSeries and ARM.

Moreover, the major "philosophy" behind the PDP-8 is probably to make
it as cheap as possible.

Actually, it was to build the best computer they could for the target
price, although those often end up around the same place.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to MitchAlsup on Mon Dec 11 12:23:51 2023

On 12/9/2023 10:11 AM, MitchAlsup wrote:

Anton Ertl wrote:

I don't think trying to redefine "RISC" to mean something different
from its original purpose helps.

That would mean that "RISC" has an original definition; what is it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

RISC was defined before CISC was coined as its contrapositive.

I agree, but this illustrates some of the semantic confusion we are
having regarding defining RISC. In normal speech, the opposite of
"reduced" is "increased", and the opposite of "complex" is "simple". So
RISC and CISC are not really on the opposite end of a scale, but are on different scales!

If we substitute "simple" for "reduced", a lot of nice things sort of
fall out. Some examples

Requiring a single instruction length simplifies decoding, as does no "dependent" code where you can't decode a later part of the instruction
till you decode something in an earlier part.

Requiring all instructions be single cycle simplifies the pipeline
design. I think this applies to no mem-op instructions

Requiring no more than one memory reference (and relatedly prohibiting non-aligned memory accesses) simplifies some internal agen stuff.

etc.

Of course, as time went on, both the number of gates on a chip and our understanding of how to do things more simply increased. So we were
able to add more "complexity" to the design while keeping with the
"spirit" of simplicity. So we got multi-cycle instructions in the CPU,
not a co-processor, and non-aligned memory accesses, etc.

Under this view, the number of instructions is not the key defining
factor, but sort of a side effect of making the design "simple".

So if they had used "simple" instead of "reduced" a lot of confusion
would have been prevented. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stephen Fuld on Tue Dec 12 00:38:53 2023

On Mon, 11 Dec 2023 12:23:51 -0800
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 12/9/2023 10:11 AM, MitchAlsup wrote:

Anton Ertl wrote:

I don't think trying to redefine "RISC" to mean something
different from its original purpose helps.

That would mean that "RISC" has an original definition; what is
it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

RISC was defined before CISC was coined as its contrapositive.

I agree, but this illustrates some of the semantic confusion we are
having regarding defining RISC. In normal speech, the opposite of
"reduced" is "increased",

In context of Reduced Instruction Set the opposit of "reduced" is
"full" or "complete". IMHO.

and the opposite of "complex" is "simple".
So RISC and CISC are not really on the opposite end of a scale, but
are on different scales!

If we substitute "simple" for "reduced", a lot of nice things sort of
fall out. Some examples

Requiring a single instruction length simplifies decoding, as does no "dependent" code where you can't decode a later part of the
instruction till you decode something in an earlier part.

Requiring all instructions be single cycle simplifies the pipeline
design. I think this applies to no mem-op instructions

Requiring no more than one memory reference (and relatedly
prohibiting non-aligned memory accesses) simplifies some internal
agen stuff.

etc.

Of course, as time went on, both the number of gates on a chip and
our understanding of how to do things more simply increased. So we
were able to add more "complexity" to the design while keeping with
the "spirit" of simplicity. So we got multi-cycle instructions in
the CPU, not a co-processor, and non-aligned memory accesses, etc.

Under this view, the number of instructions is not the key defining
factor, but sort of a side effect of making the design "simple".

So if they had used "simple" instead of "reduced" a lot of confusion
would have been prevented. :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Michael S on Mon Dec 11 16:12:24 2023

On 12/11/2023 2:38 PM, Michael S wrote:

On Mon, 11 Dec 2023 12:23:51 -0800
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 12/9/2023 10:11 AM, MitchAlsup wrote:

Anton Ertl wrote:

I don't think trying to redefine "RISC" to mean something
different from its original purpose helps.

That would mean that "RISC" has an original definition; what is
it?

See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.

RISC was defined before CISC was coined as its contrapositive.

I agree, but this illustrates some of the semantic confusion we are
having regarding defining RISC. In normal speech, the opposite of
"reduced" is "increased",

In context of Reduced Instruction Set the opposit of "reduced" is
"full" or "complete". IMHO.

OK, but what does "full" or "complete" mean? There are always instructions/functionality that could be added, so in that sense, no instruction set is full or complete.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Tue Dec 12 06:57:48 2023

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/11/2023 2:38 PM, Michael S wrote:

In context of Reduced Instruction Set the opposit of "reduced" is
"full" or "complete". IMHO.

OK, but what does "full" or "complete" mean?

Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

There are always
instructions/functionality that could be added, so in that sense, no >instruction set is full or complete.

If you start with a certain instruction set, and then leave things
away, the result is reduced, while the starting point is complete.

But of course RISC became a standalone term, and the Acorn RISC
Machine was not designed as a reduced version of some instruction set,
and looking at the 32 registers and register windows of Berkeley RISC,
it was not a reduced VAX, either.

So maybe SISC (simple-instruction SC) might have been more accurate,
but RISC was probably a more marketable acronym.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Tue Dec 12 11:00:56 2023

John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

This web page suggests it was more from the other direction, they started from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which had
been a long running design philosophy in the hardware industry.

https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond "not a Vax or S/360".

R's,
John

The first paper on MIPS is

MIPS: A VLSI Processor Architecture
J Hennessy, N Jouppi, F Baskett, J Gill, 1981 http://ai.eecs.umich.edu/people/conway/VLSI/ClassicDesigns/MIPS/MIPS.CMU81.pdf

says at the start that "The basic philosophy of MIPS is to present an instruction set that is a compile-driven encoding of the microengine.
Thus little or no decoding is needed and the instructions correspond
closely to microcode instructions".

The original Stanford MIPS had only 16 32-bit registers and had
no byte or word instructions. The first commercial version, the R2000,
had 32 registers and added byte and word instructions because
of all the software porting problems they encountered.

An interesting thing for both MIPS and RISC-I was the amount of design
time they took. If you subtract out all the time they spent developing
their own software tools, it looks like the chip design would have been
about just 6 months for 2 persons. Compared to the many hundreds of person-years for a VAX, that got a lot of peoples attention.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Dec 12 15:33:03 2023

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and S/370 >instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

This web page suggests it was more from the other direction, they started
from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which had
been a long running design philosophy in the hardware industry.

https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond "not a Vax or S/360".

R's,
John
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Dec 12 11:20:59 2023

In case anyone is interested there are some
DEC internal company confidential memos that
show they were thinking about this internally too.

VOR: VAX on a RISC, 1984
http://bwlampson.site/35a-VOR/35a-VORAbstract.html http://bwlampson.site/35a-VOR/35a-VOR.pdf

Ideas for a simple fast VAX, 1985 http://bwlampson.site/35b-IdeasFastVAX/35b-IdeasFastVAXAbstract.html http://bwlampson.site/35b-IdeasFastVAX/35b-IdeasFastVAX.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Tue Dec 12 16:51:52 2023

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

What is that design philosophy supposed to be?

We don't have to guess, because DEC published an entire book about the
way they built their computers. The PDP-5 was originally a front end
system for a nuclear reactor in Canada, 12 bits both because the
analog values it was handling needed that much precision, and also
because they used ideas from the LINC lab computer. The PDP-5 is
recognizable as a cut down PDP-4 which was in turn a cheaper redesign
of the PDP-1 which was largely based on the MIT TX-0 computer built to
test core memories in the 1950s. They all had word addresses and a
single register, not surprising since that's what all scientific
computers of the era had.

The PDP-8 reimplemented the PDP-5 using newer components and packaging
so was a lot smaller and cheaper. The book says it was important that
it was the first computer small enough to sit on a lab bench, or in a
rack leaving room for other equipment.

The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
claim that Nova led to RISC and PDP-11 to CISC, ...

So we have a long line of ancestry:

TX-0, PDP-4, PDP-5, PDP-8, PDP-X, Nova, Eclipse (16-bit), Eclipse
MV/8000 (32-bit)

So no, Nova did not lead to RISC, it led to the Eclipse MV/8000, a
CISC. And if the PDP-8 had the same design philosophy as RISCs, Ed de
Castro (chief engineer of the PDP-8) and all the people that went with
him and founded Data General forgot about it when they were working at
Data General.

Moreover, the major "philosophy" behind the PDP-8 is probably to make
it as cheap as possible.

Actually, it was to build the best computer they could for the target
price, although those often end up around the same place.

Either one (1. Satisfy the requirements at the lowest cost; 2. Build
the best thing for a given price point) are general engineering
principles. The VAX architects were certainly convinced that they
design the best architecture for the target price, too. And VAX is
obviously not a RISC, so there is more to RISC than that philosophy.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 12 20:48:32 2023

Anton Ertl wrote:

Either one (1. Satisfy the requirements at the lowest cost; 2. Build
the best thing for a given price point) are general engineering
principles. The VAX architects were certainly convinced that they
design the best architecture for the target price, too. And VAX is
obviously not a RISC, so there is more to RISC than that philosophy.

In the days that store was more expensive than gates, VAX makes a lot
of sense--this era corresponds to the 10-cycles per instruction going
down towards 4-cycles per instruction. This era could not be extended
into the 1-instruction per cycle reals with a VAX-like ISA.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Dec 12 21:12:15 2023

According to MitchAlsup <mitchalsup@aol.com>:

Anton Ertl wrote:

Either one (1. Satisfy the requirements at the lowest cost; 2. Build
the best thing for a given price point) are general engineering
principles. The VAX architects were certainly convinced that they
design the best architecture for the target price, too. And VAX is
obviously not a RISC, so there is more to RISC than that philosophy.

In the days that store was more expensive than gates, VAX makes a lot
of sense--this era corresponds to the 10-cycles per instruction going
down towards 4-cycles per instruction. This era could not be extended
into the 1-instruction per cycle reals with a VAX-like ISA.

It also made sense in the era when microcode ROM was faster than main
memory RAM. Unfortunately, by the time the Vax came out, that era was
over.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to paaronclayton@gmail.com on Thu Dec 14 01:52:30 2023

On Tue, 12 Dec 2023 18:27:50 -0500, "Paul A. Clayton"
<paaronclayton@gmail.com> wrote:

On 12/10/23 11:45?AM, John Levine wrote:

According to MitchAlsup <mitchalsup@aol.com>:

:
Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].

Yup. If anyone can find that Johnson tech report I'd like to read it.
Some googlage only found references to it.

It looks like one can get a PDF from Semantic Scholar: >https://www.semanticscholar.org/paper/A-32-bit-processor-design-Johnson/5ef2b3e8a755a2c29833eba8ab61117c296d95ac

I have a PDF on my computer that I can email to anyone interested >(paaronclayton is my gmail address).

Seems the server that hosted that paper is no longer operating.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Jan 1 12:15:28 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

mitchalsup@aol.com (MitchAlsup) writes:

PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.

Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance

Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.

What is that design philosophy supposed to be?

Mitch quoted it in an earlier posting, and may be summarized as
"simple instructions, simple architecture." The PDP-8 is
consistent with that description, which is all that matters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to John Levine on Mon Jan 1 12:20:23 2024

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

This web page suggests it was more from the other direction, they
started from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.

https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".

Surely people don't view the Itanium as being a RISC. And what
about the Mill? Is that a RISC or not?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Jan 2 01:05:38 2024

According to Tim Rentsch <tr.17687@z991.linuxsc.com>:

The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.

That it was designed before is irrelevant. All that matters is
that the end result is consistent with that philosophy.

I dunno, indirect addressing and these those auto-index locations 10
to 17 don't seem so RISCful. Nor does having only one register you
have to use for everything.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Tue Jan 2 10:42:32 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Surely people don't view the Itanium as being a RISC.

What makes you think so?

It has a lot of RISC characteristics, in particular, it's a load/store architecture with a large number of general-purpose registers (and for
the other registers, there are also many of them, avoiding the
register allocation problems that compilers tend to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

And what
about the Mill?

The Mill is not even a paper design (I have yet to see a paper about
it), so how would I know?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:13:21 2024

On Mon, 1 Jan 2024 20:28:29 +0000
mitchalsup@aol.com (MitchAlsup) wrote:

Tim Rentsch wrote:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and
S/370 instruction sets for the 801 project, and VAX for the
Berkeley RISC project. Not sure what full means for Stanford
MIPS.

This web page suggests it was more from the other direction, they
started from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.

https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".

Surely people don't view the Itanium as being a RISC. And what
about the Mill? Is that a RISC or not?

Itanium is VLIW

No, it isn't.
Itanium is RISC with few VLIW-inspired additions.
Semantics are fully defined on the level of individual verbs rather
than at the level of bundles or groups.

Mill is Belted
Both are dependent on compiler to perform code scheduling.

In Itanium you can add ';' between verbs and for "defined" program
result would be the same as without. Scheduling is needed for
performance, but not for correction, the same as any wide in-order RISC.
It is [Berkeley-style] RISC with funny instruction encoding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Anton Ertl on Tue Jan 2 14:19:25 2024

On Tue, 02 Jan 2024 10:42:32 +0000, Anton Ertl wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Surely people don't view the Itanium as being a RISC.

What makes you think so?

It has a lot of RISC characteristics, in particular, it's a load/store architecture with a large number of general-purpose registers (and for
the other registers, there are also many of them, avoiding the register allocation problems that compilers tend to have with unique registers).
In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

The TMS320C6000 has individual instructions that are indeed very similar
to those of a RISC machine. But because they're grouped in blocks of
eight instructions, with a bit in each instruction to indicate whether
or not a given instruction can execute in parallel with those that precede
it, it is classed as a VLIW architecture.

Intel didn't use the term VLIW in referring to the Itanium. I guess they
didn't think that 128 bits (unlike 256 bits) was "very" long.

But that's basically what the Itanium was, even if it shared a lot of characteristics with RISC. Three instructions were grouped into a 128-bit block; possible parallelism between them was indicated explicitly, and
each of the three instructions even had a different format from the two
others.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Quadibloc on Tue Jan 2 16:43:45 2024

On Tue, 2 Jan 2024 14:19:25 -0000 (UTC)
Quadibloc <quadibloc@servername.invalid> wrote:

On Tue, 02 Jan 2024 10:42:32 +0000, Anton Ertl wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Surely people don't view the Itanium as being a RISC.

What makes you think so?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers tend
to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

The TMS320C6000 has individual instructions that are indeed very
similar to those of a RISC machine. But because they're grouped in
blocks of eight instructions, with a bit in each instruction to
indicate whether or not a given instruction can execute in parallel
with those that precede it, it is classed as a VLIW architecture.

Intel didn't use the term VLIW in referring to the Itanium. I guess
they didn't think that 128 bits (unlike 256 bits) was "very" long.

But that's basically what the Itanium was, even if it shared a lot of characteristics with RISC. Three instructions were grouped into a
128-bit block; possible parallelism between them was indicated
explicitly, and each of the three instructions even had a different
format from the two others.

John Savard

On TI C6000 (or on ADI TigerSharc, another VLIW that people actually
used to do real work and not just to write papers about and to con VCs
into investments) the pipeline is exposed to programmer, i.e. visible
through results of execution and not just through timing of execution.
In Itanium, at least for legal well-formed programs, pipeline is not
exposed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Tim Rentsch on Tue Jan 2 16:59:53 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

This web page suggests it was more from the other direction, they
started from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.

https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".

Surely people don't view the Itanium as being a RISC.

It was an Epic Risk.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Scott Lurndal on Tue Jan 2 11:29:13 2024

scott@slp53.sl.home (Scott Lurndal) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.

This web page suggests it was more from the other direction, they
started from the compiler:

The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.

https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html

I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".

Surely people don't view the Itanium as being a RISC.

It was an Epic Risk.

Okay that made me laugh. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Anton Ertl on Tue Jan 2 19:55:51 2024

On Tue, 02 Jan 2024 10:42:32 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

The Mill is not even a paper design (I have yet to see a paper about
it), so how would I know?

- anton

List of patents:
https://millcomputing.com/patents/

The Mill has a fixed block level design, but details of the blocks are
variable and/or customizable and may be different across instances.
They don't really have model lines as we normally think of them -
instead they have instances which can be one-offs or reproduced in
bulk, as desired.

They have settled on three demonstration instances - which they call
"Gold", "Silver", and "Bronze" - whose purpose is to show how
performance varies with the details.

Ivan has said that inside information might be had with an NDA. If
you really are interested you could ask them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to George Neuner on Wed Jan 3 07:20:23 2024

George Neuner <gneuner2@comcast.net> writes:

On Tue, 02 Jan 2024 10:42:32 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

The Mill is not even a paper design (I have yet to see a paper about
it), so how would I know?

- anton

List of patents:
https://millcomputing.com/patents/

Patents are not written for comprehension, and this page acknoledges
that with sentences like

|Split-stream encoding is described here in a way that is more
|accessible than the patent text.

This particular link actually points to a white paper, the few other
such links only point to videos.

What I had in mind is an overview paper of the architecture, or maybe
an architecture manual (if it is a short one like the RISC-V one, not
a reference-only monstrosity like the ones produced by Intel and ARM).

Ivan has said that inside information might be had with an NDA. If
you really are interested you could ask them.

Not that interested, but when somebody asks "What about the Mill?",
the answer I give is what you cited above.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to John Levine on Thu Jan 4 04:01:56 2024

John Levine <johnl@taugh.com> writes:

According to Tim Rentsch <tr.17687@z991.linuxsc.com>:

The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.

That it was designed before is irrelevant. All that matters is
that the end result is consistent with that philosophy.

I dunno, indirect addressing and these those auto-index locations 10
to 17 don't seem so RISCful. Nor does having only one register you
have to use for everything.

You bring up some relevant points. Let me try to explain my
perspective.

First, I don't think having only one accumulator automatically
disqualifies an architecture from being a RISC. Even though
the architectures originally designed under the name "RISC"
tended to share certain properties, the RISC concept was never
meant to be tied to a specific technology; it's just that the
technology of the time tended to favor certain properties, and
people tended to glom on to those properties as defining RISC.
A RISC of today would never have been designed in the 1980s.
Similarly a RISC of the 1980s would not have been designed in
the 1960s, when the PDP-8 was. Having X number of registers
is RISC dogma, not RISC principle.

Furthermore I view the PDP-8 not as having one register but as
having 128 + 1 registers, with only one of those registers being
fully capable of arithmetic. The non-accumulator registers offer
limited arithmetic (increment and conditional skip) and a limited
form of indexing (which if I am not mistaken was not available
via the accumulator). The question of indexing brings us to
indirect addressing.

The PDP-8 does not have register-based indexing. Instead a
rudimentary indexing capability is available by using indirect
addressing: compute an address, store it in one of the page-0
registers, and indirect through the address so formed. (In case
this wasn't clear, page-0 memory may be thought of as "registers"
partly because they are available from any instruction no matter
where it is located.) Providing indirect addressing rather than
general indexing simplifies and lightens the architecture.

Indirect addressing provides another capability unrelated to
indexing. Addresses in PDP-8 instructions don't allow access to
the entire memory space. To be able to access any word in
memory we need more bits than a single instruction has. Rather
than a complicated scheme for two-word instructions, indirect
addressing can be used to put those extra bits "somewhere else",
conveniently and strategically located on the same page as the
instruction. That makes for a simpler way of providing full
memory access while maintaining a single fixed-length (and short)
instruction format. Furthermore using indirect addressing to
provide full memory access means that those address can be reused
by several instructions on the same page, without having to pay
full price for additional uses.

On the one remaining item - auto-increment for locations 10 to 17
when used via indirect addressing, do I have that right? - I admit
it is something of an architectual oddity. On the other hand it
looks like it's pretty cheap to implement, and pragmatically very
useful. That brings us to a key aspect of my comments. Most of
my understanding of RISC comes almost entirely from reading the
early writing that came out of the RISC group (which I did about
the same time it came out, so 40 years ago give or take). My
main takeaway was this: put in only those functionalities that
carry their weight. Or, put more simply, No frills. (A good
example of a frill is the evaluate polynomial instruction on
the VAX.) This view explains why I would call the PDP-8 a
RISC. The items you mention all carry their weight; none of
them is a frill. Conversely, adding one or more additional
accumulators, or providing more general indexing, would have
added substantially to the architectural cost (and of course also
the monetary cost). So the decision to provide only a single
accumulator - in conjunction with other parts of the architectue -
is a good fit to what I see as the essence of RISC, as explained
by the people who first described and defined the term (not the
concept necessarily, but the name RISC to refer to it).

As always, my aim is just to explain my point of view, not to
convince anyone. I don't mind if people are persuaded, but
my intention here has not been to persuade, just to explain.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Thu Jan 4 04:18:50 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Surely people don't view the Itanium as being a RISC.

What makes you think so?

Perhaps I should have written that with a question mark:
surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Thu Jan 4 13:04:13 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, and indeed, it's implementations implemented
the instructions without microcode (AFAIK) and typically with a
single-cycle issue rate per functional unit. What makes you think it
is not a RISC?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to BGB on Fri Jan 5 18:25:00 2024

In article <un9alb$74e8$1@dont-email.me>, cr88192@gmail.com (BGB) wrote:

IA64 had 3 instructions per 128-bit block, with some bits
indicating how to process the other instructions. Typically, the
instructions in the block would execute in parallel rather than
serial (so, would take a big hit in code density if the code lacked sufficient ILP, as many of these spots would hold NOPs).

Another code density hit was the common use of two instructions for
functions where more conventional ISAs would use one. Advance load and
check load, square root in steps rather than single instructions, things
like that.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Sat Jan 6 09:30:30 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]

Begging the question.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Sat Jan 6 18:01:10 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]

Begging the question.

Ok, let's make this more precise:

|And I don't think so, either. That's because among the features that
|the IBM 801 left away are instructions that access memory, but are not
|just a load or a store, such as the EDIT instruction. OTOH, none of
|the special features of IA-64 add instructions that access memory, but
|are not just a load or a store, [...]

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sat Jan 6 14:13:45 2024

Anton Ertl wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]

Begging the question.

Ok, let's make this more precise:

|And I don't think so, either. That's because among the features that
|the IBM 801 left away are instructions that access memory, but are not
|just a load or a store, such as the EDIT instruction. OTOH, none of
|the special features of IA-64 add instructions that access memory, but
|are not just a load or a store, [...]

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

- anton

I recall reading that the original HPPA left off MUL as it would take
multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).

Hewlett-Packard Precision Architecture, Aug 1986 https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf

"Single-Cycle Execution
A primary design goal was that all functional computations in the basic instruction set could execute in one machine cycle in a pipelined implementation of the processor architecture.
...
Complex operations that are necessary to support required software
functions but cannot be implemented in a single execution cycle are
broken down into primitive op erations, each of which can be executed
in a single cycle."

They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Jan 7 17:29:55 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

- anton

I recall reading that the original HPPA left off MUL as it would take >multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).

Hewlett-Packard Precision Architecture, Aug 1986 >https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf

...

They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.

On page 18 I find what I meant:

|The architected assist instruction extensions include integer
|multiply and divide functions for applications requiring higher
|frequencies of multiplication and division.

IIRC early HPPA implementations implemented these instructions by
transfering the integer data to the FPU, using the multiplier or
divider there, and transferring the result back to an integer
register. At least I remember reading one paper that described it
this way.

It seems to me that the integer instruction set was first developed
without considering the existence of an FPU, and then once they
considered the FPU, they added the assist instruction extensions
mentioned above. I wonder if there were any HPPA implementations that
did not have these assist instruction extensions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Jan 7 13:14:54 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

- anton

I recall reading that the original HPPA left off MUL as it would take
multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).

Hewlett-Packard Precision Architecture, Aug 1986
https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf

....

They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.

On page 18 I find what I meant:

|The architected assist instruction extensions include integer
|multiply and divide functions for applications requiring higher
|frequencies of multiplication and division.

IIRC early HPPA implementations implemented these instructions by
transfering the integer data to the FPU, using the multiplier or
divider there, and transferring the result back to an integer
register. At least I remember reading one paper that described it
this way.

It seems to me that the integer instruction set was first developed
without considering the existence of an FPU, and then once they
considered the FPU, they added the assist instruction extensions
mentioned above. I wonder if there were any HPPA implementations that
did not have these assist instruction extensions.

- anton

It seems the first HW version had the assist instruction.
The cover story is for the first implementation of HP-PA.
The processor is used in both the HP 9000 Model 840 and HP 3000 Series 930.

Hardware Design of the First HP Precision Architecture Computers, Mar-1987 https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1987-03.pdf

"The HP 9000 Model 840 and the HP 3000 Series 930 are both based on the
same processor, memory system, and I/O system. The processor consists of
five printed circuit boards, each 8.4 by 11.3 inches, containing
off-the-shelf TTL logic. It uses FAST TTL, 25-ns and 35-ns static RAMs,
and 25-ns and 35-ns PALs. These five boards in clude the processor pipeline, which fetches and executes an instruction every 125 ns, a 4096-entry translation lookaside buffer (TLB) for high-speed address translation,
and 128K bytes of cache memory. An additional (sixth) board contains the hardware floating-point coprocessor. Each board contains about 150 ICs."
...
Execution Unit
The E-unit (execution unit) performs arithmetic calculations on the
operands. It executes the arithmetic instructions and creates the
addresses for load and store instruc tions. It contains a 32-bit ALU (arithmetic logic unit) for arithmetic and logical calculations,
a barrel shifter for shift instructions, and complex mask/merge circuitry
for extracting and depositing bit strings. It also contains a preshifter
on one input to the ALU. This is used in address calculations and for
special instructions used in software multiply routines
(the Model 840/Series 930 does not execute multiply instructions
directly in hardware.)"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jan 8 11:49:29 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

- anton

I recall reading that the original HPPA left off MUL as it would take
multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).

Hewlett-Packard Precision Architecture, Aug 1986
https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf

...

They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.

On page 18 I find what I meant:

|The architected assist instruction extensions include integer
|multiply and divide functions for applications requiring higher
|frequencies of multiplication and division.

IIRC early HPPA implementations implemented these instructions by
transfering the integer data to the FPU, using the multiplier or
divider there, and transferring the result back to an integer
register. At least I remember reading one paper that described it
this way.

It seems to me that the integer instruction set was first developed
without considering the existence of an FPU, and then once they
considered the FPU, they added the assist instruction extensions
mentioned above. I wonder if there were any HPPA implementations that
did not have these assist instruction extensions.

- anton

The Pentium (or the P6?) implemented MUL using the x87 multiplier, so
with the added transport to and from the FPU part, it took about 10
clock cycles. :-(

DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock
cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Tue Jan 9 06:36:42 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[Pentium]

DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock >cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Tue Jan 9 10:45:24 2024

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[Pentium]

DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock
cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.

That is of course possible, but maybe not very likely?

The integer part had 32-bit registers, so an integer DIV would
concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
source (that had to be larger than the EDX part) and still trigger one
of the 5 missing SRT table entries. When the bug was originally found,
it only generated wrong results in the last 4-5 bits, so most of these
would still have given the same 32-bit result.

OTOH, Tim Coe's magic test pair only needed two 7-digit integer divisor/dividend values to trigger a 1/256 final error!

The fact that Tim was able to come up with this test pair purely on
paper, with no PC to check it on, is really impressive IMHO.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Tue Jan 9 16:27:47 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.

That is of course possible, but maybe not very likely?

The integer part had 32-bit registers, so an integer DIV would
concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
source (that had to be larger than the EDX part) and still trigger one
of the 5 missing SRT table entries. When the bug was originally found,
it only generated wrong results in the last 4-5 bits, so most of these
would still have given the same 32-bit result.

The reason why I consider it more likely is because programs tend to
crash or give very wrong results if an integer computation is wrong,
because integer computations are used in addressing and for directing
program flow. By contrast, if an FP computation is wrong, very few
people notice (and the late discovery of the Pentium FDIV bug shows
this); Seymour Cray decided to make his machines FP-divide quickly
rather than precisely, because he knew his customers, and they indeed
bought his machines. I don't think he did so with integer division.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Wed Jan 10 09:02:50 2024

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.

That is of course possible, but maybe not very likely?

The integer part had 32-bit registers, so an integer DIV would
concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
source (that had to be larger than the EDX part) and still trigger one
of the 5 missing SRT table entries. When the bug was originally found,
it only generated wrong results in the last 4-5 bits, so most of these
would still have given the same 32-bit result.

The reason why I consider it more likely is because programs tend to
crash or give very wrong results if an integer computation is wrong,
because integer computations are used in addressing and for directing
program flow. By contrast, if an FP computation is wrong, very few
people notice (and the late discovery of the Pentium FDIV bug shows
this); Seymour Cray decided to make his machines FP-divide quickly
rather than precisely, because he knew his customers, and they indeed
bought his machines. I don't think he did so with integer division.

OTOH, where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium timeline all compilers were already converting division by constant to reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Wed Jan 10 18:22:06 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.

...

The reason why I consider it more likely is because programs tend to
crash or give very wrong results if an integer computation is wrong,
because integer computations are used in addressing and for directing
program flow. By contrast, if an FP computation is wrong, very few
people notice (and the late discovery of the Pentium FDIV bug shows
this); Seymour Cray decided to make his machines FP-divide quickly
rather than precisely, because he knew his customers, and they indeed
bought his machines. I don't think he did so with integer division.

OTOH, where are you using DIV in integer code?

Some modulus operations, a few perspective calculations?

Conversion to strings (and BASE is, regrettably, a variable).

I also see several uses of a division by the screen width (also not a constant).

I also occasionally run a benchmark that spends most of its cycles in
division IIRC.

By the Pentium
timeline all compilers were already converting division by constant to >reciprocal MUL,

Such a general claim is easy to disprove:

[c8:~:533] vfx64
VFX Forth 64 5.11 RC2 [build 0112] 2021-05-02 for Linux x64
© MicroProcessor Engineering Ltd, 1998-2021

: foo 3 / ; ok
see foo
FOO
( 004E3E60 B903000000 ) MOV ECX, # 00000003
( 004E3E65 488BC3 ) MOV RAX, RBX
( 004E3E68 4899 ) CQO
( 004E3E6A 48F7F9 ) IDIV RCX
( 004E3E6D 488BD8 ) MOV RBX, RAX
( 004E3E70 C3 ) RET/NEXT
( 17 bytes, 6 instructions )

And no, they did not do it in 1993, either.

But anyway, yes, you or I don't divide integers much, but among the
millions of users of the Pentium, there were people who use division
more frequently and for other things, and I expect that one of them
would have noticed pretty early.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 10 16:07:02 2024

FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument. It's even worse because every one have their own (and
often changing) definition of what is and what isn't "RISC".

Luckily, it doesn't matter because in practice being RISC or not is not
a quality that affects anything practical, contrary to performance,
design costs, power consumption, compatibility, ...

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to terje.mathisen@tmsw.no on Wed Jan 10 20:03:28 2024

On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

..., where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium >timeline all compilers were already converting division by constant to >reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few >patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

But were i486 compilers at that time routinely converting division by
constant to reciprocal MUL?

It was fairly well known that compiling integer heavy code for i486
would make it run faster on P5 than if compiled for P5. The speedup necessarily was code specific, but on average was 3-4 percent.

Somehow compiling for i486 allowed more use of the (simple) V
pipeline. This trick worked on the original P5 through at least the
(100Mhz) P54C. [Don't know if it worked on later P5 chips.]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Jan 11 01:59:45 2024

According to Stefan Monnier <monnier@iro.umontreal.ca>:

FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument.

Yeah. It occurs to me that part of the problem is that RISC is a process,
not just a checklist of what's in an architecture.

The R is key. Each of the projects we think of as RISC (Berkeley RISC,
Stanford MIPS, IBM 801) were familiar with existing architectures and
then started making tradeoffs to try to get better performance at
lower design cost. They had less complexity in the hardware, usually
balanced by more complexity in the compiler, with the less complex
hardware allowing global performance increases like bigger cache,
deeper pipeline, or faster clock rate.

The PDP-8 wasn't like that. They started with the PDP-4, then asked
themselves how much of this 18 bit machine can we cram into a 12 bit
machine that we can afford to build and still be good enough to do
useful work, ending up with the PDP-5. There were tradeoffs but of a
very different kind.

From the PDP-5 to -8 it was a reimplementation in faster cheaper
logic, and put the program counter in flip flops rather than in core
location zero.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Thu Jan 11 07:02:58 2024

John Levine <johnl@taugh.com> writes:

According to Stefan Monnier <monnier@iro.umontreal.ca>:

FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument.

Yeah. It occurs to me that part of the problem is that RISC is a process, >not just a checklist of what's in an architecture.

The R is key. Each of the projects we think of as RISC (Berkeley RISC, >Stanford MIPS, IBM 801) were familiar with existing architectures and
then started making tradeoffs to try to get better performance at
lower design cost. They had less complexity in the hardware, usually
balanced by more complexity in the compiler, with the less complex
hardware allowing global performance increases like bigger cache,
deeper pipeline, or faster clock rate.

Yes, but the results of these three projects had commonalities, and
the other architectures that are commonly identified as RISCs shared
many of these commonalities, while earlier architectures didn't.

In 1980 Patterson and Ditzel ("The Case for the Reduced Instruction
Set Computer") indeed did not give any criteria for what constitutes a
RISC, which supports the process view.

In 1985 Patterson wrote "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, and there he
identified 4 criteria:

|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. Operations
| between registers complete in one cycle, permitting a simpler,
| hardwired control for each RISC, instead of
| microcode. Multiple-cycle instructions such as floating-point
| arithmetic are either executed in software or in a special-purpose
| coprocessor. (Without a coprocessor, RISCs have mediocre
| floating-point performance.) Only two simple addressing modes,
| indexed and PC-relative, are provided. More complicated addressing
| modes can be synthesized from the simple ones.
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

which I discussed in <2023Dec9.093314@mips.complang.tuwien.ac.at>,
leaving mainly 1 and a relaxed version of 3. I also added

5. register machine

John Mashey has taken the criteria-based approach quite a bit further
in his great postings on the question <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
John Mashey ignored ARM.

If we apply the criteria of Patterson (and mine to the PDP-8, we get):

1. No (AFAIK), in particular not the indirect addressing

2. No, but that criterion has not stood the test of time.

3. Don't know, but that criterion has only partially stood the test of
time.

4. No, but that criterion has not stood the test of time.

5. No.

I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

The PDP-8 wasn't like that. They started with the PDP-4, then asked >themselves how much of this 18 bit machine can we cram into a 12 bit
machine that we can afford to build and still be good enough to do
useful work, ending up with the PDP-5. There were tradeoffs but of a
very different kind.

Yes, a very good point in the process view. And looking at the
descendants of the PDP-8 (Nova, 16-bit Eclipse, 32-bit Eclipse), you
also see there that the process was not one that led to RISCs.

Still, I think that a criteria-based way of classifying something as a
RISC (or not) is more practical, because criteria are generally easier
to determine than the process, and because the properties considered
by the criteria are what the implementors of the architecture have to
deal with and what the programmers play with.

It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to George Neuner on Thu Jan 11 10:31:11 2024

George Neuner wrote:

On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

..., where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium
timeline all compilers were already converting division by constant to
reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few
patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?

It was fairly well known that compiling integer heavy code for i486
would make it run faster on P5 than if compiled for P5. The speedup necessarily was code specific, but on average was 3-4 percent.

Somehow compiling for i486 allowed more use of the (simple) V
pipeline. This trick worked on the original P5 through at least the
(100Mhz) P54C. [Don't know if it worked on later P5 chips.]

That wasn't known to me!

OTOH, while writing far too much hand-optimized Pentium code, I did
almost completely limit myself to the instructions that a 486 could run
in a single cycle, and then I would manually pair them up so that any of
the harder opcodes (like shifts!) would be in the first (u) pipe and the simpler opcode would be in the second/less capable (v) pipe.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jan 12 00:48:47 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Yes, but the results of these three projects had commonalities, and
the other architectures that are commonly identified as RISCs shared
many of these commonalities, while earlier architectures didn't.

True. But I think that was as much because they all started around the
same time and were familiar with the same kinds of computers than any
deep reason.

|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. ...
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

5. register machine

John Mashey has taken the criteria-based approach quite a bit further
in his great postings on the question ><https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
John Mashey ignored ARM.

If we apply the criteria of Patterson (and mine to the PDP-8, we get):

I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

I'd say that it makes no sense to try to evaluate a 1960s single register machine to see whether it's a RISC. The IBM 650 had only one addressing
mode, but it also used a spinning drum as main memory and the word size
was 10 decimal digits. Was that a RISC? The question makes no sense.

1. No (AFAIK), in particular not the indirect addressing

Given that there's only one register, load/store architecture doesn't
work. If you don't have an ADD instruction that references memory, how
are you going to do any arithmetic?

3. Don't know, but that criterion has only partially stood the test of
time.

The PDP-8 had one instruction format for all of the memory reference instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.

All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.

4. No, but that criterion has not stood the test of time.

Pipeline? What's that?

5. No.

Well, it had one register.

Still, I think that a criteria-based way of classifying something as a
RISC (or not) is more practical, because criteria are generally easier
to determine than the process, and because the properties considered
by the criteria are what the implementors of the architecture have to
deal with and what the programmers play with.

The criteria are fine so long as you limit the scope to machines designed
when the criteria make sense.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Fri Jan 12 01:33:45 2024

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.

All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.

As I understand it, the PDP-8 just put the entire instruction word on
the bus so all cards saw it, and the card that decoded the instruction
would respond. That's how the IOT instructions worked, anyway.

Not sure how that could be used to support three word instructions,
unless they were in the IOT space and the card could increment the
PC over the backplane somehow.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Fri Jan 12 06:47:40 2024

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If we apply the criteria of Patterson (and mine to the PDP-8, we get):

I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

I'd say that it makes no sense to try to evaluate a 1960s single register >machine to see whether it's a RISC.

Why not? If the result is what we would arrive at with other methods,
it certainly makes sense. If the result is different, one may wonder
whether the criteria are wrong or the other methods are wrong, and, as
a result, may gain additional insights.

The IBM 650 had only one addressing
mode, but it also used a spinning drum as main memory and the word size
was 10 decimal digits. Was that a RISC? The question makes no sense.

Again, why not. Maybe, like I added the register-machine criterion
that Patterson had not considered in 1985 because the machines that he
compared were all register machines, one might want to add criteria
about random-access memory (I expect that the drum memory resulted in
each instruction having a next-instruction field, right?) and binary
data.

1. No (AFAIK), in particular not the indirect addressing

Given that there's only one register, load/store architecture doesn't
work. If you don't have an ADD instruction that references memory, how
are you going to do any arithmetic?

So definitely "No".

3. Don't know, but that criterion has only partially stood the test of
time.

The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.

All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.

So: Yes for the base instruction set, no with the 680 TTY multiplexer.
In a way like RISC-V, which is "yes" for the base instruction set,
"no" with the C extension, and it has provisions for longer
instruction encodings.

4. No, but that criterion has not stood the test of time.

Pipeline? What's that?

A technique that was used ILLIAC II (1962), in the IBM Stretch (1961)
and the CDC 6600 (1964). But that's not an architectural criterion,
except the existence of branch-delay slots.

5. No.

Well, it had one register.

Does that make it a register machine? Ok, the one register has all
the purposes that registers have in that architecture, so one can
argue that it is a general-purpose register. However, as far as the
way to use it is concerned, the point of a register machine is that
you have multiple GPRs so that the programmer or compiler can just use
another one when one is occupied, and if you run out, you spill and
refill. So one register is definitely too few to make it a register
machine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Fri Jan 12 09:21:48 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.

Ok, let's see. Taking his criteria from <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. It is worth
reading the posting in full (in particular, he explains why certain
features are not in RISC architectures), but here I cite the parts
that are the criteria he listed.

|MOST RISCs:
|
| 3.
| a. Have 1 size of instruction in an instruction stream
| b. And that size is 4 bytes
| c. Have a handful (1-4) addressing modes) (* it is VERY hard to count
| these things; will discuss later).
| d. Have NO indirect addressing in any form (i.e., where you need one
| memory access to get the address of another operand in memory)
| 4.
| a. Have NO operations that combine load/store with arithmetic, i.e.,
| like add from memory, or add to memory. (note: this means
| especially avoiding operations that use the value of a load as
| input to an ALU operation, especially when that operation can
| cause an exception. Loads/stores with address modification can
| often be OK as they don't have some of the bad effects)
| b. Have no more than 1 memory-addressed operand per instruction
| 5.
| a. Do NOT support arbitrary alignment of data for loads/stores
| b. Use an MMU for a data address no more than once per instruction
| c. Have >=5 bits per integer register specifier
| d. Have >= 4 bits per FP register specifier
[...]
|So, here's a table:
|
| * C: number of years since first implementation sold in this family (or
| first thing which with this is binary compatible). Note: this table was
| first done in 1991, so year = 1991-(age in table).
| * 3a: # instruction sizes
| * 3b: maximum instruction size in bytes
| * 3c: number of distinct addressing modes for accessing data (not jumps)
| I didn't count register orliteral, but only ones that referenced
| memory, and I counted different formats with different offset sizes
| separately. This was hard work ... Also, even when a machine had
| different modes for register-relative and PC_relative addressing, I
| counted them only once.
| * 3d: indirect addressing (0 - no, 1 - yes)
| * 4a: load/store combined with arithmetic (0 - no, 1 - yes)
| * 4b: maximum number of memory operands
| * 5a: unaligned addressing of memory references allowed in load/store,
| without specific instructions ( 0 - no never [MIPS, SPARC, etc], 1 -
| sometimes [as in RS/6000], 2 - just about any time)
| * 5b: maximum number of MMU uses for data operands in an instruction
| * 6a: number of bits for integer register specifier
| * 6b:number of bits for 64-bit or more FP register specifier, distinct
| from integer registers
[...]
|So, here's a table of 12 implementations of various architectures, one per |architecture, with the attributes above. [...] I'm
|going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and |also, at the head of each column, I'm going to put a rule, which, in that |column, most of the RISCs obey. Any RISC that does not obey it is marked
|with a +; any CISC that DOES obey it is marked with a *. So ...

... here's the table, with the entries vor Clipper, i960KB and CDC6600
inserted and that for the additional instruction sets appended
(starting with ARM1):

CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
(1991)
RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
RISC
E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
-10 1 5+ 1 0 0 1 0 1 7 7 1/10 IA-64
-12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
-12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
-22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
-28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

Notes:

I did not want to pre-classify the architectures, but used + for all
the criteria that Mashey considered non-RISC (and consequently nothing
rather than * for those that he considered RISC). The ODD column
contains two values, first the number of criteria according to which
the architecture is non-RISC, then the criteria according to which the architecture is RISC.

CDC6600: there are some mistakes in entry that I fixed, and I also
added +s to those cases that did not satisfy Mashey's RISC criteria.
I classified 30-bit instructions as 4-byte instructions. I used
Mashey's entry for 5b.

Apparently you could buy some ARM1 in 1985 as an add-on board, but the
wide release came with ARM2 in 1987. I used the 1985 date; Anyway,
the Age criterion is just ageism.

IA-64: Specifying instruction size is interesting, but in any case, it
does not satisfy 3b

In ARMv7 T32 is required, and the M profile only includes T32.
Alignment always works for some loads, and never for others, resulting
in 2 MMU accesses being required for some memory accesses. But the
same has been true for PowerPC long before ARMv6 required unaligned
accesses to work. There is VFP[345]-D16 with only 16 FP registers on
some implementations, but that would also satisfy 6b; most use 32
64-bit FP registers.

AMD64 can have instructions up to 15 bytes (with prefixes and all).
All addressing modes with displacement are counted twice according to
Mashey's rule, RIP-relative is not counted extra. The MOVS
instruction has two memory operands; I did not count REP MOVS/STOS
etc. as separate, although they arguably are; anyway, AMD64 is
non-RISC according to 4b. SSE2 instructions require alignment; that
blunder was fixed in AVX, but base AMD64 only has SSE2.

ARM A64: I used the availability data of the iPhone 5s in 2013 for
Age. A64 has not only interesting offset options, but also addressing
modes that sign/zero-extend the index operand; how to count them? In
any case, A64 does not satisfy 3c. I counted the one memory operand
of a load/store-pair instruction as one memory operand. There are a
very few cases where unaligned accesses produce an alignment fault, so
I gave 1 for 5a. As usual, unaligned accesses may cause two MMU uses.

It's hard to get a date for RISC-V, with things like the Rocket in
2016, but the document with the ratified architecture parts of RV64GC
only available in 2019; so I used the latter year for the age. 5a:
atomics are required to be aligned; everything else may be unaligned.

So, looking at the table, ARM1, ARMv7 T32, ARM A64, and RV64GC are
more RISC than non-RISC according to these criteria, and AMD64 is more
non-RISC than RISC, which is exactly what I would have said without
this table. So Mashey's criteria still seem to be mostly valid 32
years later.

However, I would not classify the CDC6600 as RISC, because it does not
have general-purpose registers. It can be seen as a precursor,
though, and the fact that it shares many of the RISC criteria shows
that despite changing technology, when you want to design an
architecture for performance, the architectural features you design in
and especially those that you don't design in have stayed similar
across 59 years.

Some criteria, though, despite making implementation difficult, have
been designed into relatively recent architectures despite being
classified as non-RISC by Mashey; in particular, allowing unaligned
accesses has won; and consequently, they all require up to 2 MMU uses
per instruction for data operands.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Fri Jan 12 13:55:02 2024

This is a reposting of <2024Jan12.102148@mips.complang.tuwien.ac.at>
with some corrections: I added Alpha, and made IA-64 corrections.

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.

Ok, let's see. Taking his criteria from <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. It is worth
reading the posting in full (in particular, he explains why certain
features are not in RISC architectures), but here I cite the parts
that are the criteria he listed.

|MOST RISCs:
|
| 3.
| a. Have 1 size of instruction in an instruction stream
| b. And that size is 4 bytes
| c. Have a handful (1-4) addressing modes) (* it is VERY hard to count
| these things; will discuss later).
| d. Have NO indirect addressing in any form (i.e., where you need one
| memory access to get the address of another operand in memory)
| 4.
| a. Have NO operations that combine load/store with arithmetic, i.e.,
| like add from memory, or add to memory. (note: this means
| especially avoiding operations that use the value of a load as
| input to an ALU operation, especially when that operation can
| cause an exception. Loads/stores with address modification can
| often be OK as they don't have some of the bad effects)
| b. Have no more than 1 memory-addressed operand per instruction
| 5.
| a. Do NOT support arbitrary alignment of data for loads/stores
| b. Use an MMU for a data address no more than once per instruction
| c. Have >=5 bits per integer register specifier
| d. Have >= 4 bits per FP register specifier
[...]
|So, here's a table:
|
| * C: number of years since first implementation sold in this family (or
| first thing which with this is binary compatible). Note: this table was
| first done in 1991, so year = 1991-(age in table).
| * 3a: # instruction sizes
| * 3b: maximum instruction size in bytes
| * 3c: number of distinct addressing modes for accessing data (not jumps)
| I didn't count register orliteral, but only ones that referenced
| memory, and I counted different formats with different offset sizes
| separately. This was hard work ... Also, even when a machine had
| different modes for register-relative and PC_relative addressing, I
| counted them only once.
| * 3d: indirect addressing (0 - no, 1 - yes)
| * 4a: load/store combined with arithmetic (0 - no, 1 - yes)
| * 4b: maximum number of memory operands
| * 5a: unaligned addressing of memory references allowed in load/store,
| without specific instructions ( 0 - no never [MIPS, SPARC, etc], 1 -
| sometimes [as in RS/6000], 2 - just about any time)
| * 5b: maximum number of MMU uses for data operands in an instruction
| * 6a: number of bits for integer register specifier
| * 6b:number of bits for 64-bit or more FP register specifier, distinct
| from integer registers
[...]
|So, here's a table of 12 implementations of various architectures, one per |architecture, with the attributes above. [...] I'm
|going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and |also, at the head of each column, I'm going to put a rule, which, in that |column, most of the RISCs obey. Any RISC that does not obey it is marked
|with a +; any CISC that DOES obey it is marked with a *. So ...

... here's the table, with the entries vor Clipper, i960KB and CDC6600
inserted and that for the additional instruction sets appended
(starting with ARM1):

CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
(1991)
RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
RISC
E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
-1 1 4 1 0 0 1 0 1 5 5 0/11 Alpha
-10 2+ 10+1 0 0 1 0 1 7 7 2/9 IA-64
-12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
-12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
-22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
-28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

Notes:

I did not want to pre-classify the architectures, but used + for all
the criteria that Mashey considered non-RISC (and consequently nothing
rather than * for those that he considered RISC). The ODD column
contains two values, first the number of criteria according to which
the architecture is non-RISC, then the criteria according to which the architecture is RISC.

CDC6600: there are some mistakes in entry that I fixed, and I also
added +s to those cases that did not satisfy Mashey's RISC criteria.
I classified 30-bit instructions as 4-byte instructions. I used
Mashey's entry for 5b.

Apparently you could buy some ARM1 in 1985 as an add-on board, but the
wide release came with ARM2 in 1987. I used the 1985 date; Anyway,
the Age criterion is just ageism.

IA-64: There are instructions that occupy 2/3 of a bundle, while most
occupy 1/3, so I counted two instruction sizes. Specifying the
maximum instruction size is interesting (how do you count the 5 extra
bits in a bundle?), but in any case, IA-64 does not satisfy 3b.

In ARMv7 T32 is required, and the M profile only includes T32.
Alignment always works for some loads, and never for others, resulting
in 2 MMU accesses being required for some memory accesses. But the
same has been true for PowerPC long before ARMv6 required unaligned
accesses to work. There is VFP[345]-D16 with only 16 FP registers on
some implementations, but that would also satisfy 6b; most use 32
64-bit FP registers.

AMD64 can have instructions up to 15 bytes (with prefixes and all).
All addressing modes with displacement are counted twice according to
Mashey's rule, RIP-relative is not counted extra. The MOVS
instruction has two memory operands; I did not count REP MOVS/STOS
etc. as separate, although they arguably are; anyway, AMD64 is
non-RISC according to 4b. SSE2 instructions require alignment; that
blunder was fixed in AVX, but base AMD64 only has SSE2.

ARM A64: I used the availability data of the iPhone 5s in 2013 for
Age. A64 has not only interesting offset options, but also addressing
modes that sign/zero-extend the index operand; how to count them? In
any case, A64 does not satisfy 3c. I counted the one memory operand
of a load/store-pair instruction as one memory operand. There are a
very few cases where unaligned accesses produce an alignment fault, so
I gave 1 for 5a. As usual, unaligned accesses may cause two MMU uses.

It's hard to get a date for RISC-V, with things like the Rocket in
2016, but the document with the ratified architecture parts of RV64GC
only available in 2019; so I used the latter year for the age. 5a:
atomics are required to be aligned; everything else may be unaligned.

So, looking at the table, ARM1, ARMv7 T32, ARM A64, and RV64GC are
more RISC than non-RISC according to these criteria, and AMD64 is more
non-RISC than RISC, which is exactly what I would have said without
this table. So Mashey's criteria still seem to be mostly valid 32
years later.

However, I would not classify the CDC6600 as RISC, because it does not
have general-purpose registers. It can be seen as a precursor,
though, and the fact that it shares many of the RISC criteria shows
that despite changing technology, when you want to design an
architecture for performance, the architectural features you design in
and especially those that you don't design in have stayed similar
across 59 years.

Some criteria, though, despite making implementation difficult, have
been designed into relatively recent architectures despite being
classified as non-RISC by Mashey; in particular, allowing unaligned
accesses has won; and consequently, they all require up to 2 MMU uses
per instruction for data operands.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Fri Jan 12 12:27:45 2024

Anton Ertl wrote:

John Levine <johnl@taugh.com> writes:

According to Stefan Monnier <monnier@iro.umontreal.ca>:

FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument.

Yeah. It occurs to me that part of the problem is that RISC is a process, >> not just a checklist of what's in an architecture.

The R is key. Each of the projects we think of as RISC (Berkeley RISC,
Stanford MIPS, IBM 801) were familiar with existing architectures and
then started making tradeoffs to try to get better performance at
lower design cost. They had less complexity in the hardware, usually
balanced by more complexity in the compiler, with the less complex
hardware allowing global performance increases like bigger cache,
deeper pipeline, or faster clock rate.

Yes, but the results of these three projects had commonalities, and
the other architectures that are commonly identified as RISCs shared
many of these commonalities, while earlier architectures didn't.

In 1980 Patterson and Ditzel ("The Case for the Reduced Instruction
Set Computer") indeed did not give any criteria for what constitutes a
RISC, which supports the process view.

In 1985 Patterson wrote "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, and there he
identified 4 criteria:

|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. Operations
| between registers complete in one cycle, permitting a simpler,
| hardwired control for each RISC, instead of
| microcode. Multiple-cycle instructions such as floating-point
| arithmetic are either executed in software or in a special-purpose
| coprocessor. (Without a coprocessor, RISCs have mediocre
| floating-point performance.) Only two simple addressing modes,
| indexed and PC-relative, are provided. More complicated addressing
| modes can be synthesized from the simple ones.
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

which I discussed in <2023Dec9.093314@mips.complang.tuwien.ac.at>,
leaving mainly 1 and a relaxed version of 3. I also added

5. register machine

John Mashey has taken the criteria-based approach quite a bit further
in his great postings on the question <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
John Mashey ignored ARM.

If we apply the criteria of Patterson (and mine to the PDP-8, we get):

1. No (AFAIK), in particular not the indirect addressing

2. No, but that criterion has not stood the test of time.

3. Don't know, but that criterion has only partially stood the test of
time.

4. No, but that criterion has not stood the test of time.

5. No.

I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

The PDP-8 wasn't like that. They started with the PDP-4, then asked
themselves how much of this 18 bit machine can we cram into a 12 bit
machine that we can afford to build and still be good enough to do
useful work, ending up with the PDP-5. There were tradeoffs but of a
very different kind.

Yes, a very good point in the process view. And looking at the
descendants of the PDP-8 (Nova, 16-bit Eclipse, 32-bit Eclipse), you
also see there that the process was not one that led to RISCs.

Still, I think that a criteria-based way of classifying something as a
RISC (or not) is more practical, because criteria are generally easier
to determine than the process, and because the properties considered
by the criteria are what the implementors of the architecture have to
deal with and what the programmers play with.

It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.

- anton

The basic difference between RISC and CISC is that, with some exceptions,
CISC cores are all a single monolithic state machines serially executing multiple states for each instruction, whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact
some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view.

Although it doesn't say this, the Patterson and Ditzel paper gives criteria
to guide HW designers used to looking at the problem from the monolithic
view into how to decompose the problem from a concurrent view.

If you look at MIPS-1, RISC-1 and ARM-1, the instructions and their
formats are all chosen to fit nicely into the concurrent HW view.

If one looks at it this way, then one can access how well an ISA would fit
into that RISC concurrent HW model. Its not whether it has a single
accumulator register, but what effect does a single accumulator have
on HW concurrency - it creates RAW, WAW and WAR dependencies that don't
need to exist and are costly to eliminate later in HW.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:13:44 2024

Anton Ertl wrote:

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If we apply the criteria of Patterson (and mine to the PDP-8, we get):

I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria. >>

I'd say that it makes no sense to try to evaluate a 1960s single register >>machine to see whether it's a RISC.

Why not? If the result is what we would arrive at with other methods,
it certainly makes sense. If the result is different, one may wonder
whether the criteria are wrong or the other methods are wrong, and, as
a result, may gain additional insights.

The IBM 650 had only one addressing
mode, but it also used a spinning drum as main memory and the word size
was 10 decimal digits. Was that a RISC? The question makes no sense.

Again, why not. Maybe, like I added the register-machine criterion
that Patterson had not considered in 1985 because the machines that he compared were all register machines, one might want to add criteria
about random-access memory (I expect that the drum memory resulted in
each instruction having a next-instruction field, right?) and binary
data.

1. No (AFAIK), in particular not the indirect addressing

Given that there's only one register, load/store architecture doesn't
work. If you don't have an ADD instruction that references memory, how
are you going to do any arithmetic?

So definitely "No".

3. Don't know, but that criterion has only partially stood the test of
time.

The PDP-8 had one instruction format for all of the memory reference >>instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.

All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.

So: Yes for the base instruction set, no with the 680 TTY multiplexer.
In a way like RISC-V, which is "yes" for the base instruction set,
"no" with the C extension, and it has provisions for longer
instruction encodings.

4. No, but that criterion has not stood the test of time.

Pipeline? What's that?

A technique that was used ILLIAC II (1962), in the IBM Stretch (1961)
and the CDC 6600 (1964). But that's not an architectural criterion,
except the existence of branch-delay slots.

CDC 6600 has concurrent but not pipelined*
CDC 7600 was pipelined.

(*) Instruction fetch was pipelined but calculations were not.

5. No.

Well, it had one register.

Does that make it a register machine? Ok, the one register has all
the purposes that registers have in that architecture, so one can
argue that it is a general-purpose register. However, as far as the
way to use it is concerned, the point of a register machine is that
you have multiple GPRs so that the programmer or compiler can just use another one when one is occupied, and if you run out, you spill and
refill. So one register is definitely too few to make it a register
machine.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Fri Jan 12 18:41:50 2024

From:: https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC

1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 1718 1920 21 22
r r
r r r +d1 +d1
r r r r r r r+ +d+d1 I +s
r r r +d+x +s s+ s+ s+ +d +d r+ +dI I I +s I
r +d +x+s >r>r >r r+ -r a a r+ -r +x +s I I +s +s+d2 +d2 +d2
AMD 29K 1 . . . . . . . . . . . . . . . . . . . . .
Rxxx . 1 . . . . . . . . . . . . . . . . . . . .
SPARC . 1 1. . . . . . . . . . . . . . . . . . .
88K . 1 1 1 . . . . . . . . . . . . . . . . . .
HP PA . 2 1 1 4 1 1 . . . . . . . . . . . . . . .
ROMP 1 2 . . . . . . . . . . . . . . . . . . . .
POWER . 1 1. 1 1 . . . . . . . . . . . . . . . .
i860 . 1 1. 1 1 . . . . . . . . . . . . . . . .
Swrdfish1 1 1. . . . . . 1. . . . . . . . . . . .
ARM 2 2 . 2 1. 1 1 1 . . . . . . . . . . . . .
Clipper 1 3 1. . . . 1 1 2. . . . . . . . . . . .
i960KB 1 1 1 1 . . . . . 2 2 . . 1 . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
S/360 . 1 . . . . . . . . . . . 1 . . . . . . . .
i486 1 3 1 1 . . . 1 1 2. . . 2 3 . . . . . . .
NSC32K . 3 . . . . . 1 1 3 3 . . . 3 . . . . 9 . .
MC68000 1 1 . . . . . 1 1 2. . . 2 . . . . . . . .
MC68020 1 1 . . . . . 1 1 2. . . 2 4 . . . . . 16 16
VAX 1 3 . 1 . . . 1 1 1 1 1 1 . 3 1 3 1 3 . . .

My 66000 . 3 . 1 1 . . . . . . . . . 2 . . . . . . .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:50:44 2024

Anton Ertl wrote:

Just adding My 66000 data::

CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
(1991)
RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000

?? ? 5 20 4 0 0 1 2 1 5 0 ? My 66000

RISC
E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
-10 1 5+ 1 0 0 1 0 1 7 7 1/10 IA-64
-12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
-12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
-22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
-28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

?? ? 5 20 4 0 0 1 2 1 5 0 ? My 66000

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to George Neuner on Sat Jan 13 10:44:10 2024

George Neuner <gneuner2@comcast.net> schrieb:

On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

..., where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium >>timeline all compilers were already converting division by constant to >>reciprocal MUL, so it was only the few remaining variable divisor DIVs >>which remained, and those could only fail if you had one of the very few >>patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?

I've had a look at the gcc 2.4.5 sources (around when the Pentium came
out), and it seems it didn't do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jan 13 18:33:57 2024

Thomas Koenig wrote:

George Neuner <gneuner2@comcast.net> schrieb:

On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

..., where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium
timeline all compilers were already converting division by constant to
reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few >>> patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

But were i486 compilers at that time routinely converting division by
constant to reciprocal MUL?

I've had a look at the gcc 2.4.5 sources (around when the Pentium came
out), and it seems it didn't do it.

I guess I'm mixing up my own (re-)discovery of the technique which I
then promptly used in a couple(*) of my most favorite asm algorithms and
the timing of when it became standard for compiled code.

*) Those were the julian day number to Y-M-D and the unsigned binary to
ascii conversions.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Jan 13 21:31:12 2024

According to EricP <ThatWouldBeTelling@thevillage.com>:

The basic difference between RISC and CISC is that, with some exceptions, >CISC cores are all a single monolithic state machines serially executing >multiple states for each instruction, whereas RISC is a bunch of concurrent >state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact >some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view. ...

That's a great insight. It's easy to see how stuff like multiple registers, and load/store memory references follow from that.

You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything
interlocks on that register.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to EricP on Sat Jan 13 22:40:33 2024

EricP wrote:

The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction,

I can tell you that 68000, 68010, 68020, 68030 all used 3 microcode pointers simultaneously, 1 running the address section, 1 running the Data section, and 1 running the Fetch-Decode section. The 3 pointers could access different lines in µcode and have the 3 reads ORed together as they exit µstore.

whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Sun Jan 14 11:14:03 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

George Neuner <gneuner2@comcast.net> schrieb:

On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

..., where are you using DIV in integer code?

Some modulus operations, a few perspective calculations? By the Pentium >>>> timeline all compilers were already converting division by constant to >>>> reciprocal MUL, so it was only the few remaining variable divisor DIVs >>>> which remained, and those could only fail if you had one of the very few >>>> patterns (leading bits) in the divisor that we had to check for in the >>>> FDIV workaround.

I.e. it could well have gone undetected for a long time.

Terje

But were i486 compilers at that time routinely converting division by
constant to reciprocal MUL?

I've had a look at the gcc 2.4.5 sources (around when the Pentium came
out), and it seems it didn't do it.

I guess I'm mixing up my own (re-)discovery of the technique which I
then promptly used in a couple(*) of my most favorite asm algorithms and
the timing of when it became standard for compiled code.

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to John Levine on Sun Jan 14 11:47:33 2024

On 13 Jan 2024, John Levine wrote
(in article <unuvf0$qko$1@gal.iecc.com>):

According to EricP<ThatWouldBeTelling@thevillage.com>:

The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction, whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from the monolithic point of view. ...

That's a great insight. It's easy to see how stuff like multiple registers, and load/store memory references follow from that.

You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything
interlocks on that register.

While respecting the caveat "with some exceptions",
KDF9 (designed 1960-62) was made of as many as 24
concurrently running state machines.

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sun Jan 14 14:30:51 2024

MitchAlsup wrote:

EricP wrote:

The basic difference between RISC and CISC is that, with some exceptions,
CISC cores are all a single monolithic state machines serially executing
multiple states for each instruction,

I can tell you that 68000, 68010, 68020, 68030 all used 3 microcode
pointers
simultaneously, 1 running the address section, 1 running the Data
section, and
1 running the Fetch-Decode section. The 3 pointers could access
different lines
in µcode and have the 3 reads ORed together as they exit µstore.

Well... I don't know about the others but for the 68000 and its 68008
brother any such concurrency is extremely limited. It had I think an 8 byte instruction prefetch queue, and Decode could look up the next microcode
start address in parallel with the end of a prior macro instruction.
But it has *only one micro sequencer* and can only execute one macro instruction at once.

And while it does have some parallel resources like separate address and
data registers and separate address and data buses, their use is sequenced
by microcode so any concurrent use is *hard coded* into a micro sequence.

Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared, so it takes multiple sequential cycles
to perform those overlapped address and data movements anyway.
(Move the high data and low address at once, then low data and high address,
so it takes 2 cycles to move an address and data pair instead of 4).

But the single data bus means that each data operand register needs
separate sequential access, so 4 cycles for 2 operands,
then 2 cycles to write the result.

68000 used a 2 level control store of microcode and nanocode as a means
of compressing the microword. Basically they took a flat, wide uWord,
and moved the control bits common to multiple uWords out to a single
nano word, then put the address of the nWord in the uWord.
The sequential micro sequencer addresses the uCode ROM which addresses
the nCode ROM which drives the execute function unit.

It is possible to add pipeline stage registers in such designs,
at the uCode ROM and nCode ROM outputs to access ROMs concurrently,
and apparently the 68000 has such ROM pipeline registers.
But the inherently serial nature of uCode means you wind up with uCode
branch delay slots for each new pipeline stage, *two* in this case,
so a conditional uCode branch performed at time T1 takes effect at T4.
Any uCode words that might be folded into the branch delay slots would be
from the same macro instruction sequence, otherwise they are NOPs.

So yes there are opportunities for overlapping some actions.
But it is limited and hard coded into the microcode sequence.
And it is one microcode sequence for one macro instruction at once.

whereas RISC is a bunch of
concurrent
state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact
some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jan 14 19:38:19 2024

According to Bill Findlay <findlaybill@blueyonder.co.uk>:

On 13 Jan 2024, John Levine wrote
(in article <unuvf0$qko$1@gal.iecc.com>):

According to EricP<ThatWouldBeTelling@thevillage.com>:
That's a great insight. It's easy to see how stuff like multiple registers, >> and load/store memory references follow from that.

You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything
interlocks on that register.

While respecting the caveat "with some exceptions",
KDF9 (designed 1960-62) was made of as many as 24
concurrently running state machines.

I believe you but from what I can see it had hardware stacks and 16
index registers, so it was hardly a single register machine.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Thomas Koenig on Sun Jan 14 19:50:05 2024

On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

Division code is often scary,
Long division's downright scary.

Shouldn't one of the occurrences of "scary", most
probably the first, be replaced by "hairy"?

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Sun Jan 14 19:54:59 2024

On Sun, 14 Jan 2024 19:50:05 +0000, Quadibloc wrote:

On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

Division code is often scary,
Long division's downright scary.

Shouldn't one of the occurrences of "scary", most
probably the first, be replaced by "hairy"?

I finally did locate the original source, and indeed
my guess was correct.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Quadibloc on Sun Jan 14 20:45:03 2024

Quadibloc <quadibloc@servername.invalid> schrieb:

On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

Division code is often scary,
Long division's downright scary.

Shouldn't one of the occurrences of "scary", most
probably the first, be replaced by "hairy"?

Yes, I mistyped that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bill Findlay@21:1/5 to John Levine on Sun Jan 14 23:48:16 2024

On 14 Jan 2024, John Levine wrote
(in article <uo1d7b$284u$1@gal.iecc.com>):

According to Bill Findlay<findlaybill@blueyonder.co.uk>:

On 13 Jan 2024, John Levine wrote
(in article <unuvf0$qko$1@gal.iecc.com>):

According to EricP<ThatWouldBeTelling@thevillage.com>:
That's a great insight. It's easy to see how stuff like multiple registers,
and load/store memory references follow from that.

You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything interlocks on that register.

While respecting the caveat "with some exceptions",
KDF9 (designed 1960-62) was made of as many as 24
concurrently running state machines.

I believe you but from what I can see it had hardware stacks and 16
index registers, so it was hardly a single register machine.

More to the point, it was not a RISC.

--
Bill Findlay

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Mon Feb 12 20:19:26 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Stefan Monnier on Mon Feb 12 20:44:29 2024

Stefan Monnier <monnier@iro.umontreal.ca> writes:

FWIW, I think these arguments remind me of the "No true Scottsman"
kind of argument. [...]

They shouldn't. Different people here have expressed different
ideas, but each person has expressed more or less definite ideas.
The essential element of "No true Scotsman" is that whatever the
distinguishing property or quality is supposed to be is never
identified, and cannot be, because it is chosen after the fact to
make the "prediction" be correct. That's not what's happening in
the RISC discussions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Anton Ertl on Mon Feb 12 20:25:30 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

surely people don't view the Itanium as being a RISC?

It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

|It's basically a RISC with lots of special features:

I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.

And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]

Begging the question.

Ok, let's make this more precise:

|And I don't think so, either. That's because among the features that
|the IBM 801 left away are instructions that access memory, but are not
|just a load or a store, such as the EDIT instruction. OTOH, none of
|the special features of IA-64 add instructions that access memory, but
|are not just a load or a store, [...]

IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).

It appears we have different sources for our respective ideas
of what qualities or properties are the essential elements of
RISC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 13 15:22:39 2024

Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

Even in DFP you can keep the mantissa in binary, in which case the
problem is exactly the same (modulo som minor differences in how to round).

Assuming DPD (BCD/base 1000 more or less) you could still do division
with an approximate reciprocal and a loop, or via a sufficiently precise reciprocal.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Tim Rentsch on Tue Feb 13 07:52:30 2024

On 2/12/2024 8:19 PM, Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

Shouldn't one of those two "scary"s be a "hairy"?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Tue Feb 13 17:51:13 2024

Robert Finch wrote:

On 2024-02-13 2:35 a.m., BGB wrote:

On 2/12/2024 10:19 PM, Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

One thing I have noted is that for floating-point division via
Newton-Raphson, there is often a need for a first stage that converges
less aggressively.

Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a first stage
of, say:
y=y*((2.0-x*y)*0.375+0.625);

Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

...

Without something to nudge the value closer than the initial guess, the
iteration might sometimes become unstable and "fly off into space"
instead of converging towards the answer.

...

Could an initial guess come from estimating the reciprocal then doing a multiply, then dong the NR- iterations?

Goldschmidt division generally starts with 9-bits from the HoBs of the denominator and gets 11-bits from a tabled indexed by those HoBs. Then
the first multiply is known to have 8-bits of precision. We use 11-bits
here so that the first multiplication drives the denominator into [8'B01111111..8'B100000000] an 9-bit in 9-bit out table generates
the range [8'B01111111..8'B10000001] which has oscillatory convergence
whereas the 9-bit in 11-bit always converges from the same side.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Feb 13 19:21:40 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

... which Hacker's Delight also describes (especially the
signed version, which is not quite as straigtforward)
as the unsigned version).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Tim Rentsch on Tue Feb 13 10:45:56 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

They shouldn't. Different people here have expressed different
ideas, but each person has expressed more or less definite ideas.
The essential element of "No true Scotsman" is that whatever the distinguishing property or quality is supposed to be is never
identified, and cannot be, because it is chosen after the fact to
make the "prediction" be correct. That's not what's happening in
the RISC discussions.

I had impression from John
https://en.wikipedia.org/wiki/John_Cocke
that 801/risc was to do the opposite of the failed Future System effort http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html
but there is also account of some risc work overlapping with FS https://www.ibm.com/history/risc
In the early 1970s, telephone calls didn't instantly bounce between
handheld devices and cell towers. Back then, the connection process
required human operators to laboriously plug cords into the holes of a switchboard. Come 1974, a team of IBM researchers led by John Cocke set
out in search of ways to automate the process. They envisioned a
telephone exchange controller that would connect 300 calls per second (1 million per hour). Hitting that mark would require tripling or even
quadrupling the performance of the company's fastest mainframe at the
time -- which would require fundamentally reimagining high-performance computing.

... snip ...

End of 70s, 801/risc Iliad chip was going to be microprocessor for
running 370 (for low&mid range 370 computers) & other architecture
emulators ... effort floundered and even had some 801/risc engineers
leaving IBM for other vendor risc efforts (like AMD 29k).

801/risc ROMP chip was going to be for the displaywriter followon
... but when that was killed, they decided to pivot to unix workstation
market ... and got the company that did PC/IX for the IBM/PC to do port
for ROMP ... announced as AIX for PC/RT.

Then there was six chip RIOS for RS/6000 ... we were doing HA/6000
originally for NYTimes to move their newspaper system (ATEX) off
VAXcluster to RS/6000. I rename it HA/CMP when start doing
technical/scientific cluster scaleup with national labs and commerical
cluster scaleup with RDBMS vendors (oracle, sybase, informix,
ingres). At the time 801/risc didn't have cache coherency for
multiprocessor scaleup

The executive we were reporting to, then went over to head up Somerset
... single chip for AIM
https://en.wikipedia.org/wiki/AIM_alliance https://en.wikipedia.org/wiki/IBM_Power_microprocessors https://en.wikipedia.org/wiki/Motorola_88000
In the early 1990s Motorola joined the AIM effort to create a new RISC architecture based on the IBM POWER architecture. They worked a few
features of the 88000 (such as a compatible bus interface[10]) into the
new PowerPC architecture to offer their customer base some sort of
upgrade path. At that point the 88000 was dumped as soon as possible

... snip ...

https://en.wikipedia.org/wiki/PowerPC https://en.wikipedia.org/wiki/IBM_Power_microprocessors#PowerPC
After two years of development, the resulting PowerPC ISA was introduced
in 1993. A modified version of the RSC architecture, PowerPC added single-precision floating point instructions and general
register-to-register multiply and divide instructions, and removed some
POWER features. It also added a 64-bit version of the ISA and support
for SMP.

trivia, telco work postdates ACS/360 ... folklore is IBM killed the
effort because they were afraid that it would advance the
state-of-the-art too fast and they would loose control of the market
... also references features that would show up more than two decades
later in the 90s with ES/9000 https://people.computing.clemson.edu/~mark/acs_end.html

trivia2: We had early Jan92 meeting with Oracle CEO Ellison and
AWD/Hester where Hester tells Ellison HA/CMP would have 16-way clusters
by mid92 and 128-way clusters by ye92. Then end of Jan92, the official
Kingston supercomputer group pivtos and HA/CMP cluster scaleup is
transferred to Kingston (for announce as IBM supercomputer for technical/scientific *ONLY*) and we are told we can't work on anything
with more than four processors. Then Computerworld news 17feb1992 (from
wayback machne) ... IBM establishes laboratory to develop parallel
systems (pg8)
https://archive.org/details/sim_computerworld_1992-02-17_26_7

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to BGB on Wed Feb 14 07:49:57 2024

BGB <cr88192@gmail.com> writes:

On 2/12/2024 10:19 PM, Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

One thing I have noted is that for floating-point division via Newton-Raphson, there is often a need for a first stage that
converges less aggressively.

Right. It's important to be inside the radius of convergence
before using a more accelerating form that is also less stable.

Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a
first stage of, say:
y=y*((2.0-x*y)*0.375+0.625);

Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

The left hand side of this assignment has undefined behavior,
by virtue of violating effective type rules.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Feb 14 07:53:17 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

Even in DFP you can keep the mantissa in binary, in which case the
problem is exactly the same (modulo som minor differences in how
to round).

Assuming DPD (BCD/base 1000 more or less) you could still do
division with an approximate reciprocal and a loop, or via a
sufficiently precise reciprocal.

In some sense this is part of the point I'm making, namely,
division isn't hard if we don't mind using a simple algorithm.
It's only if we insist on it being fast that division becomes
complicated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Lynn Wheeler on Wed Feb 14 08:17:54 2024

Lynn Wheeler <lynn@garlic.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

They shouldn't. Different people here have expressed different
ideas, but each person has expressed more or less definite ideas.
The essential element of "No true Scotsman" is that whatever the
distinguishing property or quality is supposed to be is never
identified, and cannot be, because it is chosen after the fact to
make the "prediction" be correct. That's not what's happening in
the RISC discussions.

I had impression from John
https://en.wikipedia.org/wiki/John_Cocke
that 801/risc was to do the opposite of the failed Future System
effort
http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html
but there is also account of some risc work overlapping with FS https://www.ibm.com/history/risc [...]

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Wed Feb 14 17:37:13 2024

Tim Rentsch wrote:

Lynn Wheeler <lynn@garlic.com> writes:

I had impression from John
https://en.wikipedia.org/wiki/John_Cocke
that 801/risc was to do the opposite of the failed Future System
effort
http://www.jfsowa.com/computer/memo125.htm
https://people.computing.clemson.edu/~mark/fs.html
but there is also account of some risc work overlapping with FS
https://www.ibm.com/history/risc [...]

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

Architecture is as much about what you leave out as what you put in.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer architecture).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to BGB on Wed Feb 14 10:24:01 2024

BGB <cr88192@gmail.com> writes:

On 2/14/2024 9:49 AM, Tim Rentsch wrote:

BGB <cr88192@gmail.com> writes:

On 2/12/2024 10:19 PM, Tim Rentsch wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

[... division ...]

From Hacker's Delight:

I think I shall never envision
an op unlovely as division.

An op whose answer must be guessed
and then, through multiply, assessed.

An op for which we dearly pay,
In cycles wasted every day.

Division code is often scary,
Long division's downright scary.

The proofs can overtax your brain,
The ceiling and floor may drive you insane.

Good code to divide takes a Knuthian hero,
but even God can't divide by zero!

Division in base 2 is quite straightforward.

One thing I have noted is that for floating-point division via
Newton-Raphson, there is often a need for a first stage that
converges less aggressively.

Right. It's important to be inside the radius of convergence
before using a more accelerating form that is also less stable.

Not sure how big the radius is, only that the first-stage
approximation may fall outside of it...

Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a
first stage of, say:
y=y*((2.0-x*y)*0.375+0.625);

Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

The left hand side of this assignment has undefined behavior,
by virtue of violating effective type rules.

Yeah, but:
Relying on the underlying representation of 'double' is UB;

No, it isn't. The representation of double, along with every
other type, is implementation-defined. Furthermore the bit-level representation of double can be verified at startup with code
that is portable and 100% safe. These two properties mean
relying on the representation of double is a whole other kettle
of fish than violating effective type rules.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Tim Rentsch on Wed Feb 14 13:26:05 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer architecture).

one of the final nails in the FS coffin was study by the IBM Houston
Science Center if 370/195 apps were redone for FS machine made out of
the fastest available technology, they would have throughput of 370/145
(about fractor of 30 times drop in throughput).

during the FS period, which was completely different than 370 and was
going to completely replace it, internal politics was killing off 370
efforts ... the lack of new 370 during the period is credited with giving
clone 370 makers their market foothold. when FS finally implodes there
as mad rush getting stuff back into the 370 product pipelines.

trivia: I continued to work on 360&370 stuff all through FS period, even periodically ridiculing what they were doing (drawing analogy with a
long running cult film playing at theater down the street in central
sq), which wasn't exactly career enhancing activity ... it was as if
there was nobody that bothered to think about how all the "wonderful"
stuff might actually be implemented (or even if was possible).

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lynn Wheeler on Thu Feb 15 00:59:08 2024

Lynn Wheeler wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).

one of the final nails in the FS coffin was study by the IBM Houston
Science Center if 370/195 apps were redone for FS machine made out of
the fastest available technology, they would have throughput of 370/145 (about fractor of 30 times drop in throughput).

during the FS period, which was completely different than 370 and was
going to completely replace it, internal politics was killing off 370
efforts ... the lack of new 370 during the period is credited with giving clone 370 makers their market foothold. when FS finally implodes there
as mad rush getting stuff back into the 370 product pipelines.

trivia: I continued to work on 360&370 stuff all through FS period, even periodically ridiculing what they were doing (drawing analogy with a
long running cult film playing at theater down the street in central
sq), which wasn't exactly career enhancing activity ... it was as if
there was nobody that bothered to think about how all the "wonderful"
stuff might actually be implemented (or even if was possible).

Sounds similar to Intel 432 ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Thu Feb 15 21:26:00 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

Lynn Wheeler wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).

one of the final nails in the FS coffin was study by the IBM Houston
Science Center if 370/195 apps were redone for FS machine made out of
the fastest available technology, they would have throughput of 370/145
(about fractor of 30 times drop in throughput).

during the FS period, which was completely different than 370 and was
going to completely replace it, internal politics was killing off 370
efforts ... the lack of new 370 during the period is credited with giving
clone 370 makers their market foothold. when FS finally implodes there
as mad rush getting stuff back into the 370 product pipelines.

trivia: I continued to work on 360&370 stuff all through FS period, even
periodically ridiculing what they were doing (drawing analogy with a
long running cult film playing at theater down the street in central
sq), which wasn't exactly career enhancing activity ... it was as if
there was nobody that bothered to think about how all the "wonderful"
stuff might actually be implemented (or even if was possible).

Sounds similar to Intel 432 ...

A page with a bunch of links on IBM future systems:

https://people.computing.clemson.edu/~mark/fs.html#:~:text=The%20IBM%20Future%20System%20(FS,store%20with%20automatic%20data%20management.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Brett on Fri Feb 16 09:36:05 2024

Brett <ggtgp@yahoo.com> writes:

A page with a bunch of links on IBM future systems:

https://people.computing.clemson.edu/~mark/fs.html#:~:text=The%20IBM%20Future%20System%20(FS,store%20with%20automatic%20data%20management.

trivia: upthread post I also mention web page https://people.computing.clemson.edu/~mark/fs.html
and Smotherman references archive of my posts that mention future system http://www.garlic.com/~lynn/subtopic.html#futuresys
but around two decades ago, I split subtopic.html web page into several,
now
http://www.garlic.com/~lynn/submain.html#futuresys

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Lynn Wheeler on Sat Feb 17 15:34:28 2024

Lynn Wheeler <lynn@garlic.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).

one of the final nails in the FS coffin was study by the IBM
Houston Science Center if 370/195 apps were redone for FS machine
made out of the fastest available technology, they would have
throughput of 370/145 (about fractor of 30 times drop in
throughput). [...]

Looks like they should have called it Back to the Future Systems. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 17:51:22 2024

Tim Rentsch wrote:

Lynn Wheeler <lynn@garlic.com> writes:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).

one of the final nails in the FS coffin was study by the IBM
Houston Science Center if 370/195 apps were redone for FS machine
made out of the fastest available technology, they would have
throughput of 370/145 (about fractor of 30 times drop in
throughput). [...]

Looks like they should have called it Back to the Future Systems. :)

And power it with 2.2 GW ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Rentsch on Thu Feb 29 11:21:00 2024

In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim Rentsch) wrote:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

Page 73 of the Addison Wesley paperback. I don't know much about Future Systems, but it seems to have had a problem that I first encountered with IA-64: complexity presented as /completeness/, reassuring many people
that it must be good because it has everything you could want. My doubts started when I was skimming the IA-64 instruction set reference and ran
into instructions that did not seem to make any sense. I went back to
them a few times, but could not figure them out.

In contrast, the most recent weird instructions I ran into were Aarch64's Branch Target Indicator family. They are not well described in the ISA reference, but after a couple of readings, they made sense. AArch64 has annoying complexity in its more obscure corners, but that's better than
the seductive complexity of IA-64.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Dallman on Thu Feb 29 08:27:09 2024

John Dallman wrote:

In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim Rentsch) wrote:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

Page 73 of the Addison Wesley paperback. I don't know much about Future Systems, but it seems to have had a problem that I first encountered with IA-64: complexity presented as /completeness/, reassuring many people
that it must be good because it has everything you could want. My doubts started when I was skimming the IA-64 instruction set reference and ran
into instructions that did not seem to make any sense. I went back to
them a few times, but could not figure them out.

In contrast, the most recent weird instructions I ran into were Aarch64's Branch Target Indicator family. They are not well described in the ISA reference, but after a couple of readings, they made sense. AArch64 has annoying complexity in its more obscure corners, but that's better than
the seductive complexity of IA-64.

John

I believe you mean the Branch Target Identification BTI instruction.
Looks like a landing pad for branch/calls to catch
Return Oriented Programming exploits.

I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:

"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate value that
is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of pattern and their bitwise inverse.

Note
Values that consist of only zeros or only ones cannot be described
in this way."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Thu Feb 29 15:45:42 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

John Dallman wrote:

In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim
Rentsch) wrote:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

Page 73 of the Addison Wesley paperback. I don't know much about Future
Systems, but it seems to have had a problem that I first encountered with
IA-64: complexity presented as /completeness/, reassuring many people
that it must be good because it has everything you could want. My doubts
started when I was skimming the IA-64 instruction set reference and ran
into instructions that did not seem to make any sense. I went back to
them a few times, but could not figure them out.

In contrast, the most recent weird instructions I ran into were Aarch64's
Branch Target Indicator family. They are not well described in the ISA
reference, but after a couple of readings, they made sense. AArch64 has
annoying complexity in its more obscure corners, but that's better than
the seductive complexity of IA-64.

John

I believe you mean the Branch Target Identification BTI instruction.
Looks like a landing pad for branch/calls to catch
Return Oriented Programming exploits.

I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:

"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate value that
is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical >elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of >pattern and their bitwise inverse.

That was a fun one to implement in our ARM64 simulator. AArch32 also
has some odd logical operand instruction encoding as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to EricP on Thu Feb 29 17:49:00 2024

In article <wD%DN.356177$vFZa.109267@fx13.iad>, ThatWouldBeTelling@thevillage.com (EricP) wrote:

John Dallman wrote:

In contrast, the most recent weird instructions I ran into were
Aarch64's Branch Target Indicator family.

I believe you mean the Branch Target Identification BTI instruction.

I do, I keep getting that name wrong.

Looks like a landing pad for branch/calls to catch
Return Oriented Programming exploits.

That is what they are for, but the different forms, the way they only
apply to branches that take addresses from registers, and the necessity
for support in the object file format was quite confusing at first.

I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:

"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate
value that is a 32-bit pattern or a 64-bit pattern viewed as a
vector of identical elements of size e = 2, 4, 8, 16, 32 or,
64 bits. Each element contains the same sub-pattern, that is a
single run of 1 to (e - 1) nonzero bits from bit 0 followed by
zero bits, then rotated by 0 to (e - 1) bits. This mechanism
can generate 5334 unique 64-bit patterns as 2667 pairs of
pattern and their bitwise inverse.

OK, that is twisty. I'm glad I don't have to write an emulator for it.
I'm only likely to run into it in debugging compiler-generated code,
where I can hope the disassembler and debugger will take care of it.

Note
Values that consist of only zeros or only ones cannot be described
in this way."

Since that's confined to AND/OR/XOR/Test instructions, are values of all
zeroes or all ones particularly useful?

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Feb 29 19:32:19 2024

EricP wrote:

John Dallman wrote:

I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:

"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate value that
is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of pattern and their bitwise inverse.

My 66000 took a 12-bit immediate and uses it to specify two 6-bit constants
the later is the shift count range [0:63] the former is the result width
also [0:63] but 0 means 64-bits in width.

Note
Values that consist of only zeros or only ones cannot be described
in this way."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Dallman on Sat Mar 2 20:00:49 2024

On Thu, 29 Feb 2024 11:21 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com
(Tim Rentsch) wrote:

By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:

I knew then that the project was doomed.

Page 73 of the Addison Wesley paperback. I don't know much about
Future Systems, but it seems to have had a problem that I first
encountered with IA-64: complexity presented as /completeness/,
reassuring many people that it must be good because it has everything
you could want. My doubts started when I was skimming the IA-64
instruction set reference and ran into instructions that did not seem
to make any sense. I went back to them a few times, but could not
figure them out.

In contrast, the most recent weird instructions I ran into were
Aarch64's Branch Target Indicator family. They are not well described
in the ISA reference, but after a couple of readings, they made
sense. AArch64 has annoying complexity in its more obscure corners,
but that's better than the seductive complexity of IA-64.

John

Obviously, I know nothing specific about Future Systems, but my
impression is that it was both more complicated than IPF and, more
importantly, the sort of complexity was different. If we want to compare
FS to Intel products then I'd expect Future Systems complexity to be
similar to i432 complexity or to complexity of 80286 additions to x86 architecture or to those parts of BiiN that did not become i960.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Sat Mar 2 19:35:00 2024

In article <20240302200049.00000d7c@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

If we want to compare FS to Intel products then I'd expect Future
Systems complexity to be similar to i432 complexity or to complexity
of 80286 additions to x86 architecture or to those parts of BiiN
that did not become i960.

i432 is more like the impressions I have of FS. An idealised architecture,
with little thought for the practicalities of implementation.

My own view has been for a couple of decades, that architecture is the
art of compromise between "acceptably fast with the initial
implementation technology" and "capable of maintaining software
compatibility and acceptable performance in unpredictable future
implementation technologies."

Since about 1980 that has amounted to "making use of increasing
transistor counts to increase speed while being useful with existing
software, at least at source level."

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Tue Apr 16 00:35:06 2024

On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...

No, in the 68000 family the A- and D- registers are 32 bits.

If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial 16-bit products. Going full 32-bit was just a matter of filling in the gaps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Tue Apr 16 08:23:47 2024

On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:

On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...

No, in the 68000 family the A- and D- registers are 32 bits.

If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial 16-bit products. Going full 32-bit was just a matter of filling in the gaps.

Yes, the 68000 was designed to have full support for 32-bit types and a
32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
16-bit buses internally and externally. Some 68000 compilers had 16-bit
int, some had 32-bit int, and some let you choose either, since 16-bit
types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to David Brown on Tue Apr 16 07:30:32 2024

David Brown wrote:

On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:

On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...

No, in the 68000 family the A- and D- registers are 32 bits.

If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial 16-bit
products. Going full 32-bit was just a matter of filling in the gaps.

Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
16-bit buses internally and externally. Some 68000 compilers had 16-bit
int, some had 32-bit int, and some let you choose either, since 16-bit
types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

Yes, I was referring to its 16-bit internal bus structure.
This M68000 patent from 1978 shows it in Fig 2:

Patent US4296469 Execution unit for data processor using
segmented bus structure, 1978
https://patents.google.com/patent/US4296469A/en

Other M68000 patents (the last one US4514803 appears to be for
when it was reworked into the IBM XT/370 PC in 1983):

Patent US4307445 Microprogrammed control apparatus having
a two-level control store for data processor, 1978

Patent US4312034 ALU and Condition code control unit for
data processor, 1979

Patent US4325121 Two-level control store for microprogrammed
data processor, 1979

Patent US4514803 Methods for partitioning mainframe instruction sets, 1982

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Tue Apr 16 20:26:09 2024

EricP wrote:

Yes, I was referring to its 16-bit internal bus structure.
This M68000 patent from 1978 shows it in Fig 2:

Patent US4296469 Execution unit for data processor using
segmented bus structure, 1978
https://patents.google.com/patent/US4296469A/en

There are a number of interesting things about those segmented busses::
a) the busses were true-complement
b) the busses were precharged
c) the busses were coupled with 2 pass gates on either side of a
3 transistor sense amplifier
d) to copy data from one bus to the next buss one
1) opened up the pass gates on the active bus
2) fired the sense amplifier
3) opened up the pass gate to the next bus

So, in 7 transistors, one got::
a) bus to bus isolation
b) bus to bus data copy in either direction
c) and a bus flip-flop (fired sense amplifier)

This would take at least 30 transistors with todays technology
to do what they did in 7.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Sat May 25 22:02:13 2024

EricP wrote:

David Brown wrote:

Yes, the 68000 was designed to have full support for 32-bit types and
a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU
and 16-bit buses internally and externally. Some 68000 compilers had
16-bit int, some had 32-bit int, and some let you choose either, since
16-bit types could be significantly faster on the 68000 even though
the general-purpose registers were 32-bit.

Yes, I was referring to its 16-bit internal bus structure.
This M68000 patent from 1978 shows it in Fig 2:

Patent US4296469 Execution unit for data processor using
segmented bus structure, 1978
https://patents.google.com/patent/US4296469A/en

I found a book on the uArch design of the 68000 and Micro/370
written by their senior designer Nick Tredennick.

Microprocessor Logic Design, Tredennick, 1987 https://archive.org/download/tredennick-microprocessor-logic-design/Tredennick-Microprocessor-logic-Design_text.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun May 26 03:16:12 2024

EricP wrote:

I found a book on the uArch design of the 68000 and Micro/370
written by their senior designer Nick Tredennick.

Microprocessor Logic Design, Tredennick, 1987 https://archive.org/download/tredennick-microprocessor-logic-design/Tredennick-Microprocessor-logic-Design_text.pdf

Reading the text you can almost hear Nick talking--he had a very
peculiar and very recognizable talking style, even after not hearing
it for 3 decades...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Tue Oct 1 19:02:23 2024

On Tue, 16 Apr 2024 6:23:47 +0000, David Brown wrote:

On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:

On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...

No, in the 68000 family the A- and D- registers are 32 bits.

If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial
16-bit
products. Going full 32-bit was just a matter of filling in the gaps.

Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
16-bit buses internally and externally.

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Some 68000 compilers had 16-bit
int, some had 32-bit int, and some let you choose either, since 16-bit
types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

Just count the bus cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Oct 1 20:00:28 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Oct 1 21:04:55 2024

On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

It was pin limited and internal area limited; and close to being
power limited (NMOS).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Tue Oct 1 23:38:10 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

It was pin limited and internal area limited; and close to being
power limited (NMOS).

My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to Thomas Koenig on Wed Oct 2 10:07:25 2024

On 10/1/24 3:00 PM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

Saving an extra pass through the 16 bit ALU for a 32 bit operation would
be faster. Assuming that you didn't have to wait for another bus cycle
to get the other half of an operand.

Making it faster for register to register operations and not much else.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Schultz on Wed Oct 2 16:08:43 2024

David Schultz <david.schultz@earthlink.net> wrote:

On 10/1/24 3:00 PM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

Saving an extra pass through the 16 bit ALU for a 32 bit operation would
be faster. Assuming that you didn't have to wait for another bus cycle
to get the other half of an operand.

Making it faster for register to register operations and not much else.

A 16 bit barrel roller does not make sense, and Motorola had no idea that shifts would be so important.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to Brett on Wed Oct 2 13:51:24 2024

On 10/2/24 11:08 AM, Brett wrote:

A 16 bit barrel roller does not make sense, and Motorola had no idea that shifts would be so important.

They might have guessed. The Xerox Alto had been around for a while.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Oct 2 20:23:53 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

It was pin limited and internal area limited; and close to being
power limited (NMOS).

Somebody did a good job of optimizing it, then, at the limit of
several constraints. Not necessarily a global optimium, though,
that could have been closer to the (much later) ARM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Wed Oct 2 21:34:33 2024

On Wed, 2 Oct 2024 16:08:43 +0000, Brett wrote:

David Schultz <david.schultz@earthlink.net> wrote:

On 10/1/24 3:00 PM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.

Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?

Saving an extra pass through the 16 bit ALU for a 32 bit operation would
be faster. Assuming that you didn't have to wait for another bus cycle
to get the other half of an operand.

Making it faster for register to register operations and not much else.

A 16 bit barrel roller does not make sense, and Motorola had no idea
that shifts would be so important.

In the original 68000, a barrel shifter would have blown the area
budget--it would have been about equal to the d-section; even in
16-bit form. Remember this was a 1 layer metal design before poly
silicon was in the process.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to All on Wed Oct 2 18:55:14 2024

On 10/2/24 4:34 PM, MitchAlsup1 wrote:

In the original 68000, a barrel shifter would have blown the area
budget--it would have been about equal to the d-section; even in
16-bit form. Remember this was a 1 layer metal design before poly
silicon was in the process.

Before polysilicon? I find that hard to believe. Looking at the die shot
at 6502.org I see either two layers of metal or metal plus polysilicon.

http://www.visual6502.org/images/pages/Motorola_68000.html

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Thu Oct 3 00:30:21 2024

On Tue, 1 Oct 2024 20:00:28 -0000 (UTC), Thomas Koenig wrote:

Would have an external 16-bit bus and an internal 32-bit bus have been advantageous, or would this have blown a likely transistor budget for
little gain?

I thought that’s exactly how the original 68000 chip worked.

I recall being told that the main factor in determining the cost of a CPU
chip was the number of pins it had.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Brett on Thu Oct 3 00:31:59 2024

On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.

The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.

And in typical big-endian fashion, they added yet another inconsistency to
the way the bits were numbered ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 01:26:44 2024

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.

The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.

And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...

I noticed that, there is a tiny chance that Apple wanted that to make
QuickDraw one cycle faster. But other changes hint that the clue train was
not sharp at Motorola.

Everyone was muddling back in that era, and being quick to market was the
rule, hard to blame them. I can see myself making the very same choices, or much more likely seeing my coworkers make bad choices and not being able to
do anything about it. You do what you can with what you have.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 06:28:21 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.

And in typical big-endian fashion, they added yet another inconsistency to >the way the bits were numbered ...

How typical is that?

Certainly the Power(PC) numbers its bits in big-endian fashion. But
does PowerPC have instructions where that matters? If so, I expect
that's not so great with the switch to little-endian in Linux-PPC
(IIRC AIX is still big-endian).

I expect that the s390x uses the same bit numbering as Power. Does it
have instructions where that matters?

The 88000 has instructions where that matters and has little-endian bit-ordering (like the 68000, so Motorola is stubborn in its mistakes;
OTOH, with Apple as a prospective customer, they probbly did not want
to require software changes, even if those changes would make the code shorter). It supports both byte orders, but AFAIK all 88K systems are big-endian.

MIPS-I has no instructions where the bit numbering plays a role. It
was available in big- and little-endian systems.

SPARC uses little-endian bit ordering, but AFAICS has no instructions
where that plays a role. AFAIK there is no little-endian SPARC
system.

So, yes, using little-endian bit ordering with big-endian byte
ordering is frequent, but OTOH instructions that actually use bit
numbers exist only in few instruction sets.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 09:21:13 2024

On 03/10/2024 02:31, Lawrence D'Oliveiro wrote:

On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.

The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.

And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...

Since you mentioned POWER and PowerPC elsewhere, the bit numbering
challenges of the m68k world are nothing compared to the PowerPC world. Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
down to 31 as the LSB. So your 32-bit address bus had lines from A0
down to A31. Then it got extended to 64-bit (some devices had only
partial 64-bit extensions), and the chips got a wider address bus (you
never need all 64-bit lines physically) - the pins for the higher
address lines were numbered A-1, A-2, and so on. For the internal
registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
were original 32-bit and got extended to 64-bit, and so are numbered bit
-32 down to bit 31 for consistency. Others are 32-bit but numbered from
bit 32 down to bit 63.

I am not really complaining - one of our customers hired us precisely
because they wanted to use a particular PowerPC microcontroller but
after one look at the ten thousand page reference manual full of this
kind of thing, they paid us to write a library and drivers for the
peripherals they wanted so that they never had to think about that mess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Thu Oct 3 09:39:03 2024

David Brown <david.brown@hesbynett.no> writes:

Since you mentioned POWER and PowerPC elsewhere, the bit numbering
challenges of the m68k world are nothing compared to the PowerPC world. >Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
down to 31 as the LSB. So your 32-bit address bus had lines from A0
down to A31. Then it got extended to 64-bit (some devices had only
partial 64-bit extensions), and the chips got a wider address bus (you
never need all 64-bit lines physically) - the pins for the higher
address lines were numbered A-1, A-2, and so on. For the internal
registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
were original 32-bit and got extended to 64-bit, and so are numbered bit
-32 down to bit 31 for consistency. Others are 32-bit but numbered from
bit 32 down to bit 63.

Maybe they should have started with the MSB as bit -31 or -63, which
would have allowed them to always use bit 0 for the LSB while having
big-endian bit ordering.

For bit ordering big-endian (as in the PowerPC manual) looked more
wrong to me than for byte ordering; I thought that that was just a
matter of getting used to the unfamiliar bit ordering, but maybe the
advantage of little-endian becomes more apparent in bit ordering, and
maybe that's why Motorola and Sun chose little-endian bit ordering
despite having big-endian byte ordering.

For both bit and byte ordering, the advantage of little-endian shows
up when there are several widths involved. So why is it more obvious
for bit-ordering?

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is
a 64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Thu Oct 3 14:34:35 2024

On 03/10/2024 11:39, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Since you mentioned POWER and PowerPC elsewhere, the bit numbering
challenges of the m68k world are nothing compared to the PowerPC world.
Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
down to 31 as the LSB. So your 32-bit address bus had lines from A0
down to A31. Then it got extended to 64-bit (some devices had only
partial 64-bit extensions), and the chips got a wider address bus (you
never need all 64-bit lines physically) - the pins for the higher
address lines were numbered A-1, A-2, and so on. For the internal
registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
were original 32-bit and got extended to 64-bit, and so are numbered bit
-32 down to bit 31 for consistency. Others are 32-bit but numbered from
bit 32 down to bit 63.

Maybe they should have started with the MSB as bit -31 or -63, which
would have allowed them to always use bit 0 for the LSB while having big-endian bit ordering.

That's a very "outside the box thinking" solution!

For bit ordering big-endian (as in the PowerPC manual) looked more
wrong to me than for byte ordering; I thought that that was just a
matter of getting used to the unfamiliar bit ordering, but maybe the advantage of little-endian becomes more apparent in bit ordering, and
maybe that's why Motorola and Sun chose little-endian bit ordering
despite having big-endian byte ordering.

For both bit and byte ordering, the advantage of little-endian shows
up when there are several widths involved. So why is it more obvious
for bit-ordering?

I have certainly found big-endian bit numbering harder to get my head
around than big-endian byte ordering. One possible explanation is that
with little-endian ordering, (1 << bit_no) gives you a 1 in the right
bit number. Another is that with little-endian bit ordering, the same
bit number has the same value regardless of the size of the type. And I
work with electronics as well as software - virtually everything in
hardware (except PowerPC microcontrollers!) uses little-endian bit
numbering. Smallest to largest, counting upwards from 0 or 1, is just
more natural.

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is
a 64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.

As I understand it - and my history here might not be completely
accurate - PowerPC was specified for both 32-bit and 64-bit from early
on, but first made in the 32-bit version. There were quite a number of optional parts of the PowerPC architecture, including 64-bit width,
floating point units, and support for little-endian data modes - IIRC
these were referred to as books of various colours. And yes, you could
then have weird things like it being a 64-bit architecture where none of
the 64-bit features were actually implemented. One microcontroller I
used had 64-bit GPRs, but almost no 64-bit instructions - I don't think
you could even load or save all 64 bits at a time. The only use of them
was for transferring to and from the 64-bit double precision floating
point registers (which could be loaded and saved in full).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Thu Oct 3 23:49:00 2024

In article <2024Oct3.113903@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I expect that the s390x uses the same bit numbering as Power.

You're correct. I started reading up on the architecture a few years ago,
and found this very confusing.

Does it have instructions where that matters?

I didn't get far enough to find out.

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
is a 64-bit architecture, and that the manual describes only the
32-bit subset. Maybe the original Power was 32-bit.

It was. It stayed that way for a while, but grew 64-bit extensions in the
late 1990s.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Oct 3 22:17:23 2024

On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a 64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.

I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

The PowerPC 601 was first shown publicly in 1993; I can’t remember when
the fully 64-bit 620 came out, but it can’t have been long after.

Motorola did a similar thing with the 68000 family: if you compare the
original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.

Compare this with the pain the x86 world went through, over a much longer
time, to move to 32-bit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 15:33:54 2024

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

The PowerPC 601 was first shown publicly in 1993; I can’t remember when
the fully 64-bit 620 came out, but it can’t have been long after.

Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.

Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.

power/pc was done at Somerset ... part of AIM; apple, ibm, motorola ...
and some of amount of motorola risc 88k contributed to power/pc https://en.wikipedia.org/wiki/PowerPC
https://en.wikipedia.org/wiki/PowerPC_600 https://en.wikipedia.org/wiki/PowerPC_600#60x_bus
Using the 88110 bus as the basis for the 60x bus helped schedules in a
number of ways. It helped the Apple Power Macintosh team by reducing the
amount of redesign of their support ASICs and it reduced the amount of
time required for the processor designers and architects to propose,
document, negotiate, and close a new bus interface (successfully
avoiding the "Bus Wars" expected by the 601 management team if the 88110
bus or the previous RSC buses hadn't been adopted). Worthy to note is
that accepting the 88110 bus for the benefit of Apple's efforts and the alliance was at the expense of the first IBM RS/6000 system design
team's efforts who had their support ASICs already implemented around
the RSC's totally different bus structure.

... note that RS/6000 didn't have design that supported cache
consistency, shared-memory multiprocessing ... (one of the reason ha/cmp
had to resort to cluster operation for scale-up)

https://en.wikipedia.org/wiki/PowerPC_600#PowerPC_620 https://wiki.preterhuman.net/The_Somerset_Design_Center

the executive we reported to when we were doing HA/CMP https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing went over to head up Somerset

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Oct 4 11:23:50 2024

On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:

On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a
64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.

I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

The PowerPC 601 was first shown publicly in 1993; I can’t remember when
the fully 64-bit 620 came out, but it can’t have been long after.

I don't remember the history well enough here.

Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.

The m68k was always designed as a 32-bit ISA. But the first 68000 implementation used a 16-bit ALU and internal buses for size and cost
reasons. I would not describe the additional instructions in the 68020
as "filling gaps to 32-bit", but merely a natural expansion of the ISA
with a few more useful instructions.

Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.

The x86 started from 8-bit roots, and increased width over time, which
is a very different path.

And much of the reason for it being a slow development is that the world
was held back by MS's lack of progress in using new features. The 80386
was produced in 1986, but the MS world was firmly at 16-bit under it
gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit
from 1993, and Win32s was from around the same time, but these were
relatively small in the market.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Fri Oct 4 17:30:07 2024

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:

Compare this with the pain the x86 world went through, over a much longer
time, to move to 32-bit.

The x86 started from 8-bit roots, and increased width over time, which
is a very different path.

Still, the question is why they did the 286 (released 1982) with its
protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).

And much of the reason for it being a slow development is that the world
was held back by MS's lack of progress in using new features. The 80386
was produced in 1986, but the MS world was firmly at 16-bit under it
gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit
from 1993, and Win32s was from around the same time, but these were >relatively small in the market.)

At that time the market was moving much slower than nowadays. Systems
with a 286 (and maybe even the 8088) were sold for a long time after
the 386 was introduced. E.g., the IBM PS/1 Model 2011 was released in
1990 with a 10MHz 286, and the successor Model 2121 with a 386SX was
not introduced until 1992. I think it's hard to blame MS for
targeting the machines that were out there. And looking at <https://en.wikipedia.org/wiki/Windows_2.1x>, Windows 2.1 in 1988
already was available in a Windows/386 version (but the programs were
running in virtual 8086 mode, i.e., were still 16-bit programs).

And it was not just MS who was going in that direction. MS and IBM
worked on OS/2, and despite ambitious goals IBM insisted that the
software had to run on a 286.

The fact that the 386SX only appeared in 1988 also did not help.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Oct 4 23:06:21 2024

On Fri, 4 Oct 2024 19:05:15 +0000, BGB wrote:

On 10/4/2024 12:30 PM, Anton Ertl wrote:

Say, pretty much none of the modern graphics programs (that I am aware
of) really support working with 16-color and 256-color bitmap images
with a manually specified color palette.

Typically, any modern programs are true-color internally, typically only supporting 256-color as an import/export format with an automatically generated "optimized" palette, and often not bothering with 16-color
images at all. Not so useful if one is doing something that does
actually have a need for an explicit color palette (and does not have so
much need for any "photo manipulation" features).

1996 version of CorelDraw 3 suffers from none of this, supporting all
sorts of pallets {RGB, CYM, CYMK, at least 3 more) with various
user specified limitations, 24-bit, 32-bit, ... with all sorts of
fillers mixing any 2 colors previous mentioned with various patterns
{gradient, polka dot, you define which pixel gets from which color}.

Still have the CD-ROM if anyone wants to try.

And, most people generally haven't bothered with this stuff since the
Win16 era (even the people doing "pixel art" are still generally doing
so using true-color PNGs or similar).

Blame PowerPoint ... No more evil tool ever existed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 5 06:34:28 2024

On Fri, 4 Oct 2024 23:06:21 +0000, MitchAlsup1 wrote:

Blame PowerPoint ... No more evil tool ever existed.

Competitors existed, at one time, e.g. Adobe Persuasion, Harvard Graphics, others I’ve forgotten.

Somehow Microsoft made PowerPoint the most attractive of the lot ... were
the others even worse?

Actually, it’s not that it doesn’t produce pretty graphics, it’s that people end up believing in the prettiness of the graphics, instead of considering the facts they’re supposed to (mis)represent.

Edward Tufte, come back, all is forgiven!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sat Oct 5 06:31:02 2024

On Fri, 04 Oct 2024 17:30:07 GMT, Anton Ertl wrote:

The fact that the 386SX only appeared in 1988 also did not help.

As a software guy, I liked the idea of the 386SX, and encouraged friends/ colleagues to choose it over a 286.

Of course, they wanted to compare price/performance, but I saw things in
terms of future software compatibility, and the sooner the move away from braindead x86 segmentation towards a nice, flat, expansive, linear address space, the better for everybody.

Sometimes I felt like a voice crying in the wilderness ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Oct 5 06:35:56 2024

On Fri, 4 Oct 2024 19:44:40 -0500, BGB wrote:

MS PaintBrush became MS Paint and seemingly mostly got dumbed down as
time went on.

Side excursion into 3D Paint (or is that Paint 3D?), which failed to take
off, and is now being abandoned.

Closest modern alternative is Paint.NET, but still doesn't allow manual palette control in the same way as BitEdit.

Inkscape has good palette control. It does scalable vector graphics
natively. Give it a try.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Lawrence D'Oliveiro on Sat Oct 5 17:52:09 2024

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Fri, 04 Oct 2024 17:30:07 GMT, Anton Ertl wrote:

The fact that the 386SX only appeared in 1988 also did not help.

As a software guy, I liked the idea of the 386SX, and encouraged friends/ colleagues to choose it over a 286.

Of course, they wanted to compare price/performance, but I saw things in terms of future software compatibility, and the sooner the move away from braindead x86 segmentation towards a nice, flat, expansive, linear address space, the better for everybody.

Sometimes I felt like a voice crying in the wilderness ...

Didn’t it take a decade for the 386 to get a 32 bit OS, by which time the early machines were long since in the garbage bin, making the extra cost a waste.

The AMD 286 was faster and cheaper, better lifetime value for the money.

You were a voice crying in the wilderness, because you were wrong. ;)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Brett on Sat Oct 5 18:11:55 2024

Brett <ggtgp@yahoo.com> writes:

Didn’t it take a decade for the 386 to get a 32 bit OS

386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and Windows/386 appeared in 1987.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Sat Oct 5 22:53:35 2024

On Sat, 05 Oct 2024 18:11:55 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Brett <ggtgp@yahoo.com> writes:

Didnâ€™t it take a decade for the 386 to get a 32 bit OS

386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and Windows/386 appeared in 1987.

- anton

SunOS for i386 in 1988.
Netware 3x in 1990.
The later sold in very high volumes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Sun Oct 6 11:58:01 2024

On 04/10/2024 19:30, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:

Compare this with the pain the x86 world went through, over a much longer >>> time, to move to 32-bit.

The x86 started from 8-bit roots, and increased width over time, which
is a very different path.

Still, the question is why they did the 286 (released 1982) with its protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).

I can only guess the obvious - it is what some big customer(s) were
asking for. Maybe Intel didn't see the need for 32-bit computing in the markets they were targeting, or at least didn't see it as worth the cost.

And much of the reason for it being a slow development is that the world
was held back by MS's lack of progress in using new features. The 80386
was produced in 1986, but the MS world was firmly at 16-bit under it
gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit >>from 1993, and Win32s was from around the same time, but these were
relatively small in the market.)

At that time the market was moving much slower than nowadays. Systems
with a 286 (and maybe even the 8088) were sold for a long time after
the 386 was introduced. E.g., the IBM PS/1 Model 2011 was released in
1990 with a 10MHz 286, and the successor Model 2121 with a 386SX was
not introduced until 1992. I think it's hard to blame MS for
targeting the machines that were out there.

It is fair enough to target the existing market, but they were also slow
(IMHO) to take advantage of new opportunities in hardware, re-enforcing
the situation. I think MS and their monopoly on markets caused a
stagnation - lack of real competition meant lack of progress.

And looking at
<https://en.wikipedia.org/wiki/Windows_2.1x>, Windows 2.1 in 1988
already was available in a Windows/386 version (but the programs were
running in virtual 8086 mode, i.e., were still 16-bit programs).

And it was not just MS who was going in that direction. MS and IBM
worked on OS/2, and despite ambitious goals IBM insisted that the
software had to run on a 286.

IBM were famous for poor (and perhaps cowardly) decisions at the time,
and MS happily screwed them over again and again in regards to OS/2. It
takes a special kind of bad management for a company of IBM's size to
make PC's, and to make a PC OS, and yet that OS could not run on their
own PC's. Later, once OS/2 /did/ run on IBM PC's, they would not sell computers with their own OS pre-installed - you had to first by the
machine with the competitor's OS, then buy IBM's OS at retail prices,
and install it yourself (from some 50-60 floppy disks, IIRC).

The fact that the 386SX only appeared in 1988 also did not help.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sun Oct 6 13:04:15 2024

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 19:30, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:

Compare this with the pain the x86 world went through, over a much longer >>>> time, to move to 32-bit.

The x86 started from 8-bit roots, and increased width over time, which
is a very different path.

Still, the question is why they did the 286 (released 1982) with its
protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).

I can only guess the obvious - it is what some big customer(s) were
asking for. Maybe Intel didn't see the need for 32-bit computing in the >markets they were targeting, or at least didn't see it as worth the cost.

Anyone could see the problems that the PDP-11 had with its 16-bit
limitation. Intel saw it in the iAPX 432 starting in 1975. It is
obvious that, as soon as memory grows beyond 64KB (and already the
8086 catered for that), the protected mode of the 80286 would be more
of a hindrance than even the real mode of the 8086. I find it hard to
believe that many customers would ask Intel for something the 80286
protected mode with segments limited to 64KB, and even if, that Intel
would listen to them. This looks much more like an idee fixe to me
that one or more of the 286 project leaders had, and all customer
input was made to fit into this idea, or was ignored.

Concerning the cost, ther 80286 has 134,000 transistors, compared to
supposedly 68,000 for the 68000, and the 190,000 of the 68020. I am
sure that Intel could have managed a 32-bit 8086 (maybe even with the
nice addressing modes that the 386 has in 32-bit mode) with those
134,000 transistors if Motorola could build the 68000 with half of
that.

It is fair enough to target the existing market, but they were also slow >(IMHO) to take advantage of new opportunities in hardware, re-enforcing
the situation.

They introduced Windows/386 in 1987.

I think MS and their monopoly on markets caused a
stagnation - lack of real competition meant lack of progress.

Monopoly? These were the times with lots of competition from
different hardware and software manufacturers. Apple with the Apple
II, Lisa and MacIntosh, Atari with their 8-bit line and ther Atari ST
line, Commodore with their 8-bit line and their Amiga line, and, on
the software side, Digital Research with CP/M(-86/68K) and GEM, and
various Unix offerings, including Xenix. Were they all no real
competition? Not in my book. It's just that Microsoft eventually
won, maybe accidentially (as it happens in a winner-takes-all market).

IBM were famous for poor (and perhaps cowardly) decisions at the time,
and MS happily screwed them over again and again in regards to OS/2.

Another interpretation is that MS went faithfully into OS/2, assigning
not just their Xenix team to it (although according to Wikipedia the
Xenix abandonment by MS was due to AT&T entering the Unix market) and reportedly also assigned the best MS-DOS developers to OS/2. They
tried to stick to OS/2 for several years, but eventually were fed up
with all the bad decisions coming from IBM, and bowed out.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sun Oct 6 16:34:00 2024

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.

Another interpretation is that MS went faithfully into OS/2,
assigning not just their Xenix team to it (although according
to Wikipedia the Xenix abandonment by MS was due to AT&T
entering the Unix market) and reportedly also assigned the best
MS-DOS developers to OS/2. They tried to stick to OS/2 for
several years, but eventually were fed up with all the bad
decisions coming from IBM, and bowed out.

It's known that they split the work with IBM, such the MS would do a
redesigned OS/2 that was intended to be version 3.0, while IBM
concentrated on 2.0. A friend of mine was working on OS/2 within IBM at
the time, until he left with serious stress and depression: the people management was not good.

Then MS switched emphasis, so that the Windows API was the primary
personality of OS/2 3.0, and renamed it Windows NT. That also had an OS/2 personality at the start, along with a POSIX personality.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 18:50:01 2024

On Thu, 3 Oct 2024 22:17:23 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
is a 64-bit architecture, and that the manual describes only the
32-bit subset. Maybe the original Power was 32-bit.

I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

The PowerPC 601 was first shown publicly in 1993; I can’t remember
when the fully 64-bit 620 came out, but it can’t have been long after.

For all practical purposes, PPC/MPC620 never came out.
That is, the chip was formally shipped 2-3 years later than originally
planned in order to fulfill contractual obligations by somebody I don't remember to somebody else I also don't remember. But that 2nd somebody
barely used it.
By that time, IBM itself had another 64-bit POWER CPU working. That one
had less ambitious microarchitecture than 620, but also had one advanced feature that didn't appear in "big" POWER CPUs until POWER5 7 years
later - multi-threading.

https://en.wikipedia.org/wiki/IBM_RS64

Motorola did a similar thing with the 68000 family: if you compare
the original 68000 instruction set with the 68020, you will see the
latter only needed to fill in a few gaps to become fully 32-bit.

Not similar at all.

Compare this with the pain the x86 world went through, over a much
longer time, to move to 32-bit.

As far as Intel is responsible, only one year longer - 8 years vs 7
years.
And if we count only main line CPUs then it would be 8 years vs 8 years
(from POWER to POWER3).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Sun Oct 6 22:07:11 2024

Michael S wrote:

On Sat, 05 Oct 2024 18:11:55 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Brett <ggtgp@yahoo.com> writes:

DidnÃ¢â‚¬â„¢t it take a decade for the 386 to get a 32 bit OS

386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
Windows/386 appeared in 1987.

- anton

SunOS for i386 in 1988.
Netware 3x in 1990.
The later sold in very high volumes.

It deserved to do so:

For what it was doing (file/print service), it was by far the most
efficient product I've even heard of!

Drew Major managed to get the total latency of the "ack network
interrupt, parse incoming packet, determine that it is a read request
for which the client has the required access rights, locate the
requested data somewhere in the memory cache, construct a response
packet and hand it off to the network card" down to 300 clock cycles.

Those clock cycles _might_ have been measured on a 486 with mostly single-cycle instructions, instead of the original 386 which needed 2+
clock cycles for lots of stuff. The point still stands, it was amazingly efficient.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Michael S on Sun Oct 6 21:53:36 2024

Michael S <already5chosen@yahoo.com> wrote:

On Sat, 05 Oct 2024 18:11:55 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Brett <ggtgp@yahoo.com> writes:

Didnâ€™t it take a decade for the 386 to get a 32 bit OS

386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
Windows/386 appeared in 1987.

- anton

SunOS for i386 in 1988.
Netware 3x in 1990.
The later sold in very high volumes.

The first 32 bit windows was Windows 95, a full decade later.
Windows 386 was 16 bit as was Windows 2.x.

I do concede to being wrong about the unix ports.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Brett on Mon Oct 7 06:29:12 2024

On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

The first 32 bit windows was Windows 95 ...

Windows NT 3.1, 1993.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Oct 7 06:32:46 2024

On Sun, 6 Oct 2024 16:34 +0100 (BST), John Dallman wrote:

Then MS switched emphasis, so that the Windows API was the primary personality of OS/2 3.0, and renamed it Windows NT.

Dave Cutler came from DEC (where he was one of the resident Unix-haters)
to mastermind the Windows NT project in 1988. When did the OS/2→NT pivot
take place?

That also had an OS/2 personality at the start, along with a POSIX personality.

Funny, you’d think they would use that same “personality” system to implement WSL1, the Linux-emulation layer. But they didn’t.

I think the whole “personality” concept, along with the supposed portability to non-x86 architectures, had just bit-rotted away by that
point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 7 06:33:59 2024

On Sun, 6 Oct 2024 18:50:01 +0300, Michael S wrote:

Motorola did a similar thing with the 68000 family: if you compare the
original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.

Not similar at all.

Planning for the next major leap forward, and building the current
generation as a cut-down version of that? Of course it is similar.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Mon Oct 7 07:33:14 2024

jgd@cix.co.uk (John Dallman) writes:

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and >considered viable a decade after the original inventors had learned
better.

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lars Poulsen@21:1/5 to Anton Ertl on Mon Oct 7 12:42:20 2024

On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.

I completely agree. Back when the 8086 was designed, 640K seemed like a
lot. They never expected it to grow beyond the minframes of their time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lars Poulsen on Mon Oct 7 15:17:36 2024

Lars Poulsen wrote:

On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.

I completely agree. Back when the 8086 was designed, 640K seemed like a
lot. They never expected it to grow beyond the minframes of their time.

640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.

Afair the PC also mishandled interrupt handling?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon Oct 7 17:45:19 2024

On Mon, 7 Oct 2024 15:17:36 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Lars Poulsen wrote:

On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode,
and in the 80286 they finally got around to it. And the idea was
(like AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory
models were possible, but not really intended. The Huge memory
model was completely alien to protected mode, as was direct
hardware access, as was common on the IBM PC. And computing with
segment register contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.

I completely agree. Back when the 8086 was designed, 640K seemed
like a lot. They never expected it to grow beyond the minframes of
their time.

640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.

Afair the PC also mishandled interrupt handling?

Terje

Yes it did.
IIRC, Intel recommended interrupt slots 0 to 31 to be reserved for
hardware interrupts, but IBM ignored their recommendation and put
various BIOS at slots 16 to 31.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 16:16:32 2024

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

The first 32 bit windows was Windows 95 ...

Windows NT 3.1, 1993.

So 8 years, that PC would still be in the trash can by then.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Mon Oct 7 16:32:34 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.

I have for decades pointed out that the four bit offset of 8086 segments
was planned obsolescence. An 8 bit offset with 16 megabytes of address
space would have kept the low end alive for too long in Intels eyes. To
control the market you need to drive complexity onto the users, which weeds
out licensed competition.

Everything Intel did drove needless patentable complexity into the follow
on CPUs.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Brett on Mon Oct 7 19:57:44 2024

On Mon, 7 Oct 2024 16:16:32 -0000 (UTC)
Brett <ggtgp@yahoo.com> wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

The first 32 bit windows was Windows 95 ...

Windows NT 3.1, 1993.

So 8 years, that PC would still be

Wikipedia:
Development of i386 technology began in 1982 under the internal name of
P3.[4] The tape-out of the 80386 development was finalized in July
1985.[4] The 80386 was introduced as pre-production samples for
software development workstations in October 1985.[5] Manufacturing of
the chips in significant quantities commenced in June 1986.

in the trash can by then.

Not every PC made in those years was crap. Some of them were quite
reliable and lasted long.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Brett on Mon Oct 7 20:03:35 2024

On Mon, 7 Oct 2024 16:32:34 -0000 (UTC)
Brett <ggtgp@yahoo.com> wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and
in the 80286 they finally got around to it. And the idea was (like
AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory models
were possible, but not really intended. The Huge memory model was completely alien to protected mode, as was direct hardware access,
as was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.

I have for decades pointed out that the four bit offset of 8086
segments was planned obsolescence. An 8 bit offset with 16 megabytes
of address space would have kept the low end alive for too long in
Intels eyes. To control the market you need to drive complexity onto
the users, which weeds out licensed competition.

Everything Intel did drove needless patentable complexity into the
follow on CPUs.

- anton

You forget that Intel didn't and couldn't expect that 8088 would be
such stunning success. Not just that. According to Oral history they
didn't realize what they have in hands until 1983.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Michael S on Mon Oct 7 17:40:04 2024

Michael S <already5chosen@yahoo.com> wrote:

On Mon, 7 Oct 2024 16:32:34 -0000 (UTC)
Brett <ggtgp@yahoo.com> wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and
in the 80286 they finally got around to it. And the idea was (like
AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory models
were possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access,
as was common on the IBM PC. And computing with segment register
contents was also not intended.

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.

I have for decades pointed out that the four bit offset of 8086
segments was planned obsolescence. An 8 bit offset with 16 megabytes
of address space would have kept the low end alive for too long in
Intels eyes. To control the market you need to drive complexity onto
the users, which weeds out licensed competition.

Everything Intel did drove needless patentable complexity into the
follow on CPUs.

You forget that Intel didn't and couldn't expect that 8088 would be
such stunning success. Not just that. According to Oral history they
didn't realize what they have in hands until 1983.

Today the 8088 is a joke microcontroller, that was not the case when it was introduced. The 8088 was a major project with major profits, not some afterthought.

Yes the success eventually dwarfed expectations, but that was a lightning strike, the plan was in place and so the lightning strike could be taken advantage of to build a monopoly, instead of the small walled fortress with moat that was planned.

You saw what happened to the MC680x0 series that did not have a moat or a
good plan.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Oct 7 16:00:38 2024

Not every PC made in those years was crap. Some of them were quite
reliable and lasted long.

But back then, Dennard scaling meant that an 8 year-old PC was so much
slower than a current PC that it was difficult to find people willing to
still use it.

Nowadays, for a large proportion of tasks, you can't really tell the
difference between a last-generation CPU and an 8 year-old CPU, so the reliability is much more of a factor.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stefan Monnier on Tue Oct 8 00:11:31 2024

On Mon, 07 Oct 2024 16:00:38 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:

Not every PC made in those years was crap. Some of them were quite reliable and lasted long.

But back then, Dennard scaling meant that an 8 year-old PC was so much
slower than a current PC that it was difficult to find people willing
to still use it.

Nowadays, for a large proportion of tasks, you can't really tell the difference between a last-generation CPU and an 8 year-old CPU, so the reliability is much more of a factor.

Stefan

In March 1992 as a new employee I was given a PC based on 386SX.
I don't remember if the clock was 16 MHz or 20 MHz, but no more than 20.
1.5 years later when I started to work at client's site for the most of
my time, this PC was still my only desktop when I was coming back to
office.
High-end PC made in 1986, e.g. Compaq Deskpro 386, would be
non-trivially faster than this cheap, but far from the cheapest,
computer that I used daily 7.5 years later.

Did it feel so slow that was difficult to use? No, for what I was doing
it wasn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 7 21:46:28 2024

On Mon, 7 Oct 2024 19:57:44 +0300, Michael S wrote:

The 80386 was introduced as pre-production samples for software
development workstations in October 1985.[5] Manufacturing of the chips
in significant quantities commenced in June 1986.

And the first vendor to offer a Microsoft-compatible PC product based on
that chip? Compaq, with its “Deskpro 386” that same year, I believe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Oct 7 21:52:31 2024

On Mon, 07 Oct 2024 07:33:14 GMT, Anton Ertl wrote:

Here's another speculation: The 286 protected mode was what they already
had in mind when they built the 8086, but there were not enough
transistors to do it in the 8086, so they did real mode, and in the
80286 they finally got around to it.

Nah. Intel were never capable of thinking that far ahead. Each bodge to
the x80/x86 line was made just to take the product one step further,
without regard to any future growth. The 8086/8088 was designed to make it
easy to port across 8080/8085 code, with the segment registers tacked on
to give you a bit more address space if you needed it, if you could figure
out how to use it -- that was their idea of “technological progress”.

And then the next step was to add this new-fangled “hardware memory protection” that the folks using the Big Computers were always going on about, so they bodged the 8086 segmentation scheme into kind of a memory- management scheme in the 80286.

Finally, in the 80386, they gave everybody the large, linear address space
they had been crying out for. That is, everybody who wasn’t already using more sensibly-designed chips from companies like Motorola, NatSemi and the
RISC vendors.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Mon Oct 7 21:55:05 2024

On Mon, 7 Oct 2024 15:17:36 +0200, Terje Mathisen wrote:

640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.

Another MS-DOS machine, the DEC Rainbow, could be upgraded to 896KiB of
RAM. I know because our Comp Sci department had one.

That was the one with the dual Z80 and 8086 (8088?) chips, so it could run
3 different OSes: CP/M-80, CP/M-86, and MS-DOS. Not more than one at once, though (that would have been some trick).

But it was not fully hardware-compatible with the IBM PC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Oct 7 23:13:25 2024

On Mon, 7 Oct 2024 7:33:14 +0000, Anton Ertl wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.

Either half-remembered from older architectures, or re-invented and >>considered viable a decade after the original inventors had learned
better.

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.

Is protected mode not "how Pascal" thinks of memory and objects
in memory ??

If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.

Whereas by the time 286 got out, everybody was wanting flat
memory ala C.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 8 06:16:12 2024

On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

Is protected mode not "how Pascal" thinks of memory and objects in
memory ??

How is that different from C?

Whereas by the time 286 got out, everybody was wanting flat memory ala
C.

When did they not want that?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Oct 8 07:28:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Is protected mode not "how Pascal" thinks of memory and objects
in memory ??

You can map the object of standard C (essentially what one malloc()
call gives you, or what a variable contains) and the equivalent of
Pascal into a segment on the 286, but then that object is limited to
64KB in size, and the program is limited to 8192 objects. And the
program runs very slowly. So I doubt that Pascal compiler
implementors though that the 80286 is their dream machine. Especially
given that you can implement Pascal just as well with less performance disadvantages on hardware with flat memory (and still perform bounds
checking where necessary). In particular, I doubt that Turbo Pascal
used the 286 in this way.

Whereas by the time 286 got out, everybody was wanting flat
memory ala C.

It's interesting that, when C was standardized, the segmentation found
its way into it by disallowing subtracting and comparing between
addresses in different objects. This disallows performing certain
forms of induction variable elimination by hand. So while flat memory
is C culture so much that you write "flat memory ala C", the
standardized subset of C (what standard C fanatics claim is the only
meaning of "C") actually specifies a segmented memory model.

An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 10:40:09 2024

Lawrence D'Oliveiro wrote:

On Mon, 7 Oct 2024 19:57:44 +0300, Michael S wrote:

The 80386 was introduced as pre-production samples for software
development workstations in October 1985.[5] Manufacturing of the chips
in significant quantities commenced in June 1986.

And the first vendor to offer a Microsoft-compatible PC product based on
that chip? Compaq, with its “Deskpro 386” that same year, I believe.

I got one of those that fall, most impressive was the fact hhat you
could order it with a 130 MB hard drive, an almost unheard of size at
the time:

Even though this was an expensive PC, it cost no more with that drive
(i.e. the highest end version) than a Micropolis hard drive of the same
size. I.e. the PC was effectively free. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 10:44:54 2024

Lawrence D'Oliveiro wrote:

On Mon, 7 Oct 2024 15:17:36 +0200, Terje Mathisen wrote:

640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.

Another MS-DOS machine, the DEC Rainbow, could be upgraded to 896KiB of
RAM. I know because our Comp Sci department had one.

That was the one with the dual Z80 and 8086 (8088?) chips, so it could run
3 different OSes: CP/M-80, CP/M-86, and MS-DOS. Not more than one at once, though (that would have been some trick).

But it was not fully hardware-compatible with the IBM PC.

When I was hired by Hydro in 1984, my boss decided that he liked the
Rainbow best, so he took responsibility for that model, while I got all
IBM compatibles: Hardware/Software/add-on cards/etc for a company with
77K employees in 130 countries.

By the time Hydro was broken up into 3-4 separate companies (around
2008?) I had to share that responsibility with 2-300 other people.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Oct 8 16:00:17 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model.

If you look at the 8086 manuals, that's clearly what they had in mind.

What I don't get is that the 286's segment stuff was so slow. Even
if you wanted to write code that put objects in segments, you really
couldn't. You had to minimize the number of segment switches to get
decent performance.

Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, ...

That's all true, but I'd think the slowness of segment switches would
be even more of a problem.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Tue Oct 8 16:23:32 2024

John Levine <johnl@taugh.com> writes:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model.

If you look at the 8086 manuals, that's clearly what they had in mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 20:53:00 2024

On Tue, 8 Oct 2024 6:16:12 +0000, Lawrence D'Oliveiro wrote:

On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

Is protected mode not "how Pascal" thinks of memory and objects in
memory ??

How is that different from C?

In pascal you cannot subtract pointers to different objects,
in C you can.

Whereas by the time 286 got out, everybody was wanting flat memory ala
C.

When did they not want that?

The Algol family of block structure gave the illusion that flat
was less necessary and it could all be done with lexical address-
ing and block scoping rules.

Then malloc() and mmap() came along.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Tue Oct 8 21:03:40 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If you look at the 8086 manuals, that's clearly what they had in mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

Right, and they appeared not to care or realize it was a performance problem.

They didn't even do obvious things like see if you're reloading the same value into the segment register and skip the rest of the setup. Sure, you could put checks in your code and skip the segment load but that would make your code a lot bigger and uglier.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 8 22:28:00 2024

In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

Dave Cutler came from DEC (where he was one of the resident
Unix-haters) to mastermind the Windows NT project in 1988. When did
the OS/2-NT pivot take place?

1990, after the release of Windows 3.0, which was an immediate commercial success. It was the first version that you could get serious work out of.
It's been compared to a camel: a vicious brute at times, but capable of
doing a lot of carrying.

<https://en.wikipedia.org/wiki/OS/2#1990:_Breakup>

Funny, you'd think they would use that same _personality_ system to
implement WSL1, the Linux-emulation layer. But they didn't.

They were called subsystems in Windows NT, and ran on top of the NT
kernel. The POSIX one came first, and was very limited, followed by the
Interix one that was called Windows Services for Unix. Programs for both
of these were in PE-COFF format, not ELF. There was also the OS/2
subsystem, but it only ran text-mode programs.

The POSIX subsystem was there to meet US government purchasing
requirements, not to be used for anything serious. I can't imagine Dave
Cutler was keen on it.

WSL1 seems to have been something odd: rather than a single subsystem, a
bunch of mini-subsystems. However, VMS/NT kernels just have different assumptions about programs from Unix-style kernels, so they went to
lightweight virtualisation in WSL2.

<https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux#History>

The same problem seems to have messed up all the attempts to provide good
Unix emulation on VMS. It's notable that MICA started out trying to
provide both VMS and Unix APIs, but this was dropped in favour of a
separate Unix OS before MICA was cancelled.

<https://en.wikipedia.org/wiki/DEC_MICA#Design_goals>

I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by
that point.

Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Oct 9 08:48:26 2024

On 08/10/2024 22:53, MitchAlsup1 wrote:

On Tue, 8 Oct 2024 6:16:12 +0000, Lawrence D'Oliveiro wrote:

On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

Is protected mode not "how Pascal" thinks of memory and objects in
memory ??

How is that different from C?

In pascal you cannot subtract pointers to different objects,
in C you can.

No, you can't - unless the pointers are of compatible types, and each
points to a sub-object within the same encompassing object. So if you
have two pointers that point within the same array, you can subtract
them. If they point to different objects, trying to subtract them is
undefined behaviour.

Whereas by the time 286 got out, everybody was wanting flat memory ala
C.

When did they not want that?

The Algol family of block structure gave the illusion that flat
was less necessary and it could all be done with lexical address-
ing and block scoping rules.

Then malloc() and mmap() came along.

malloc() does not need a flat memory space. C does not need a flat
memory space. Indeed, people use C all the time on systems where memory
is disjoint.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Wed Oct 9 10:24:34 2024

On 08/10/2024 09:28, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Whereas by the time 286 got out, everybody was wanting flat
memory ala C.

It's interesting that, when C was standardized, the segmentation found
its way into it by disallowing subtracting and comparing between
addresses in different objects.

It is difficult to talk about the timing of features (either things that
are allowed, or things explicitly disallowed) before the standardisation
of C, as there was no single language "C". Different variants supported
by different compilers had different rules.

This disallows performing certain
forms of induction variable elimination by hand. So while flat memory
is C culture so much that you write "flat memory ala C", the
standardized subset of C (what standard C fanatics claim is the only
meaning of "C") actually specifies a segmented memory model.

No, the C standard does not in any sense specify a segmented memory
model. Nor does it specify a non-segmented or flat or contiguous memory.

The nearest it gets is the description of converting between pointers
and integers, where it says that the conversion of a pointer to an
integer might not fit in any integer type, in which case the conversions
are undefined behaviour - but if they /are/ convertible, the intention
is that the value (of type "uintptr_t") should be consistent with "the addressing structure of the execution environment".

The way C is specified is intended to be strong enough to allow
programmers to do all they generally need to do using portable code
(i.e., code that doesn't rely on anything other than standard
behaviour), without unnecessarily restricting the kinds of systems that
can implement C, and without unnecessarily restricting what people can
write in non-portable code.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

In practice, on all but the most niche or specialised platforms, if you
do feel you need to compare random pointers, you can cast them to
uintptr_t and compare these. That will generally work on segmented, non-contiguous or flat memories.

An interesting case is the Forth standard. It specifies "contiguous regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block identifier and part of it as an index into that block, just as a C implementation can.

A flat address model is almost certainly more /efficient/, for C, Forth
and many other languages. But that does not mean a particular model is /required/ or specified by the language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Oct 9 16:28:19 2024

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 9 16:42:38 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

In most every mainstream implementation, memmove() is written
in assembler in order to inject the appropriate prefeches and
follow the recommended instruction usage per the target architecture
software optimization guide.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Wed Oct 9 18:10:44 2024

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Dallman on Wed Oct 9 13:37:41 2024

John Dallman wrote:

In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

Dave Cutler came from DEC (where he was one of the resident
Unix-haters) to mastermind the Windows NT project in 1988. When did
the OS/2-NT pivot take place?

1990, after the release of Windows 3.0, which was an immediate commercial success. It was the first version that you could get serious work out of. It's been compared to a camel: a vicious brute at times, but capable of
doing a lot of carrying.

<https://en.wikipedia.org/wiki/OS/2#1990:_Breakup>

Funny, you'd think they would use that same _personality_ system to
implement WSL1, the Linux-emulation layer. But they didn't.

They were called subsystems in Windows NT, and ran on top of the NT
kernel. The POSIX one came first, and was very limited, followed by the Interix one that was called Windows Services for Unix. Programs for both
of these were in PE-COFF format, not ELF. There was also the OS/2
subsystem, but it only ran text-mode programs.

The POSIX subsystem was there to meet US government purchasing
requirements, not to be used for anything serious. I can't imagine Dave Cutler was keen on it.

The Posix interface support was there so *MS* could bid on US government
and military contracts which, at that time frame, were making noise about
it being standard for all their contracts.
The Posix DLLs didn't come with WinNT, you had to ask MS for them specially.

The US government eventually stopped pushing for Posix and Windows
support for it quietly disappeared.

WinNT's OS2 subsystem also quietly disappeared.

WSL1 seems to have been something odd: rather than a single subsystem, a bunch of mini-subsystems. However, VMS/NT kernels just have different assumptions about programs from Unix-style kernels, so they went to lightweight virtualisation in WSL2.

Yes. VMS and WinNT handle memory sections differently than *nix.
That difference makes fork() system call essentially impossible to
implement on VMS/WinNT except by copying the address space.

Note that back then Posix did not require fork be supported,
just fork-exec (aka spawn) which does not require duplicating memory space, just carrying file and socket handles to the child process which
NT handles natively.

In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations
in page table and page file handling.

Whereas in *nix a process can map a file and there is just one section user, then fork and now there are multiple section users. Then that child can
change the address space and fork again. *nix needs to maintain various
data structures to support forking memory just in case it happens.

WSL1 was an _emulation_ of Linux essentially as a subsystem like OS2 and
Posix were supported. WSL1 apparently supported fork() but did so by
copying memory space making it slow, whereas fork-exec/spawn would be fast. Trying to emulate Linux with a privileged subsystem of helper processes
was likely (I never used it) a lot of work, slow, and flaky.

WSL2 sounds like they tossed the whole WSL1 thing and built a hyper-V
virtual machine to run native Linux on top of WinNT as a host.

<https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux#History>

The same problem seems to have messed up all the attempts to provide good Unix emulation on VMS. It's notable that MICA started out trying to
provide both VMS and Unix APIs, but this was dropped in favour of a
separate Unix OS before MICA was cancelled.

<https://en.wikipedia.org/wiki/DEC_MICA#Design_goals>

I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by
that point.

Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

John

Back then "object oriented" and "micro-kernel" buzzwords were all the rage.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Oct 9 16:01:42 2024

In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations
in page table and page file handling.

Interesting. Do you happen to have a pointer for further reading
about it?

*nix needs to maintain various data structures to support forking
memory just in case it happens.

I can't imagine what those datastructures would be (which might be just
another way to say that I was brought up on POSIX and can't imagine the
world differently).

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Wed Oct 9 22:22:16 2024

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Oct 9 22:20:42 2024

On 09/10/2024 18:28, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

They don't have to write it in standard, portable C. Standard libraries
will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they want.

You will find that most implementations of memmove() are done by
converting the pointers to a unsigned integer type and comparing those
values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
writers can use C99 for theirs).

Such implementations will not be portable to all systems. They won't
work on a target that has some kind of "fat" pointers or segmented
pointers that can't be translated properly to integers.

That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

The avrlibc library used by gcc for the AVR has its memmove()
implemented in assembly for speed, as does musl for some architectures.

There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Wed Oct 9 21:37:30 2024

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to David Brown on Wed Oct 9 14:52:39 2024

On 10/9/2024 1:20 PM, David Brown wrote:

On 09/10/2024 18:28, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

On 08/10/2024 09:28, Anton Ertl wrote:.

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)

Somebody has to write memmove() and they want to use C to do it.

They don't have to write it in standard, portable C. Standard libraries will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they
want.

You will find that most implementations of memmove() are done by
converting the pointers to a unsigned integer type and comparing those values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
writers can use C99 for theirs).

Such implementations will not be portable to all systems. They won't
work on a target that has some kind of "fat" pointers or segmented
pointers that can't be translated properly to integers.

That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

The avrlibc library used by gcc for the AVR has its memmove()
implemented in assembly for speed, as does musl for some architectures.

There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

I agree with everything you say up until the last sentence. There are
several languages, mostly older ones like Fortran and COBOL, where the
file handling/I/O are defined portably within the language proper, not
in a separate library. It just moves the non-portable stuff from the
library writer (as in C) to the compiler writer (as in Fortran, COBOL, etc.)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Wed Oct 9 23:16:34 2024

Stefan Monnier <monnier@iro.umontreal.ca> writes:

In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations
in page table and page file handling.

Interesting. Do you happen to have a pointer for further reading
about it?

*nix needs to maintain various data structures to support forking
memory just in case it happens.

I can't imagine what those datastructures would be (which might be just >another way to say that I was brought up on POSIX and can't imagine the
world differently).

http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Oct 10 00:33:41 2024

On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be written
completely in portable standard C. (How would you write a function that
handles files?

Do you mean things other than open(), close(), read(), write(), lseek()
??

You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Oct 10 08:30:37 2024

On 10/10/2024 02:33, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be written >>> completely in portable standard C. (How would you write a function that >>> handles files?

Do you mean things other than open(), close(), read(), write(), lseek()
??

The C standard library provides functions like fopen(), fclose(),
fwrite(), etc. It provides them because programs often need such functionality, and you cannot write them yourself in portable standard
C. (As Stephen pointed out, C could have had them built into the
language - for many good reasons, C did not go that route.)

The functions you list here are the POSIX names - not the C standard
library names. Those POSIX functions cannot be implemented in portable standard C either if you exclude making wrappers around the standard
library functions.

In both cases - implementing the standard library functions or
implementing the POSIX functions - you need something beyond standard C,
such as a way to call OS API's.

You need non-portable OS calls.) That's why these
things are in the standard library in the first place.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stephen Fuld on Thu Oct 10 08:24:32 2024

On 09/10/2024 23:52, Stephen Fuld wrote:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write a
function that handles files? You need non-portable OS calls.) That's
why these things are in the standard library in the first place.

I agree with everything you say up until the last sentence. There are several languages, mostly older ones like Fortran and COBOL, where the
file handling/I/O are defined portably within the language proper, not
in a separate library. It just moves the non-portable stuff from the library writer (as in C) to the compiler writer (as in Fortran, COBOL,
etc.)

I meant that this is why these features have to be provided, rather than
left for the user to implement themselves. They could also have been
provided in the language itself (as was done in many other languages) -
the point is that you cannot write the file access functions in pure
standard C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Oct 10 08:31:52 2024

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Thu Oct 10 18:38:55 2024

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Thu Oct 10 20:00:29 2024

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no >connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be >eliminated when the compiler optimises the functions inline - when the >compiler knows the size of the move/copy, it can optimise directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code generator,
its run-time support library, and C standard libraries that can work
better if they are optimised for each new generation of processor.
Sometimes you just need to re-compile the library with a newer compiler
and appropriate flags, other times you need to modify the library source >code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle memcpy
and memset.

They're three-instruction sets; prolog/body/epilog. There are separate
sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Thu Oct 10 21:21:20 2024

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the
compiler knows the size of the move/copy, it can optimise directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code generator,
its run-time support library, and C standard libraries that can work
better if they are optimised for each new generation of processor.
Sometimes you just need to re-compile the library with a newer compiler
and appropriate flags, other times you need to modify the library source
code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu Oct 10 23:54:15 2024

On Thu, 10 Oct 2024 20:00:29 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise >directly.

The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().

But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.

They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.

People that have more clue about Arm Inc schedule than myself
expect Arm Cortex cores that implement these instructions to be
announced next May and to appear in actual [expensive] phones in 2026.
Which probably means 2027 at best for Neoverse cores.

It's hard to make an educated guess about schedule of other Arm core
designers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Oct 10 21:03:33 2024

Michael S <already5chosen@yahoo.com> writes:

On Thu, 10 Oct 2024 20:00:29 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

David Brown <david.brown@hesbynett.no> writes:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().

But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.

Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.

They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.

The prolog instruction preconditions the copy and copies
an IMPDEF portion.

The body instruction performs an IMPDEF Portion and

the epilog instruction finalizes the copy.

The three instructions are issued consecutively.

People that have more clue about Arm Inc schedule than myself
expect Arm Cortex cores that implement these instructions to be
announced next May and to appear in actual [expensive] phones in 2026.
Which probably means 2027 at best for Neoverse cores.

It's hard to make an educated guess about schedule of other Arm core >designers.

In the mean time, they've have "DC ZVA" for the special case of
memset(,0,) since ARMv8.0.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Thu Oct 10 21:30:38 2024

On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.

{
memmove( p, q, size );
}

Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
for( int i = 0, i < size; i++ )
p[i] = q[i];

Which gets compiled to memcpy()--also looks to be standard C.
OR

p_struct = q_struct;

gets compiled to::

memmove( &p_struct, &q_struct, sizeof( q_struct ) );

also looks to be std C.

The thing is you are no longer writing memmove(), you are simply
teaching the compiler to recognizes its _use_ cases directly. In
addition, these will always be within spitting distance of as fast
as one can perform those activities.

That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers.

Given that we are talking about GBOoO machines here, the several
AGEN units[1,2,3] have plenty of calculation BW to determine order
without wasting cycles getting started.

But given LBIO machine, the ability to process memory to memory moves
at cache port width is always an advantage except for cases needing
only 1 read or 1 write--if you build the HW with these in mind.

Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

In HW they should always be optimized.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Fri Oct 11 14:10:13 2024

On 10/10/2024 23:30, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not. It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.

      {
           memmove( p, q, size );
      }

What is that circular reference supposed to do? The whole discussion
has been about the /fact/ that you cannot implement the "memmove"
function in a C standard library using fully portable standard C code.

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

You can implement "memcpy" in portable standard C, using a loop and
array or pointer syntax (somewhat like your loop below, but with the
correct type for the index). But you cannot do so for memmove() because
you cannot identify the direction you need to run your loop in an
efficient and fully portable manner.

It does not matter what the target is - the target is totally irrelevant
for /portable/ standard C code. If the target made a difference, it
would not be portable!

I can't understand why this is causing you difficulty.

Perhaps you simply didn't understand what you wrote a few posts back,
when you claimed that the reason people writing portable standard C code
cannot write an efficient memmove() implementation is "a symptom of bad
ISA design".

Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
      for( int i = 0, i < size; i++ )
           p[i] = q[i];

Which gets compiled to memcpy()--also looks to be standard C.
OR

      p_struct = q_struct;

gets compiled to::

      memmove( &p_struct, &q_struct, sizeof( q_struct ) );

also looks to be std C.

Those are standard C, yes. And a good compiler will optimise such code.
And if the target has some kind of scalable vector support or other
dedicated instructions for moving or copying memory, it can do a better
job of optimising the code.

That has /nothing/ to do with the point under discussion.

I think you are simply confused about what you are talking about here.
Either you don't know what is meant by writing portable standard C, or
you don't know what is meant by implementing a C standard library, or
you haven't actually been reading the posts you replied to. You seem determined to make the point that /your/ ISA has useful and efficient instructions and features for memory copy functionality, while the x86
ISA does not, and that means /your/ ISA is good design and the x86 ISA
is bad design.

Now, I will fully agree with you that the x86 is not a good design. The
modern x86 processor devices are proof that you /can/ polish a turd.
And I fully agree with you that instructions for arbitrary length vector instructions of various sorts (of which memory copying is the simplest operation) have many advantages over SIMD using fixed-size vector
registers. (ARM and RISC-V also agree with you there.)

But that is all irrelevant to the discussion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brian G. Lucas on Fri Oct 11 13:37:03 2024

On 10/10/2024 23:19, Brian G. Lucas wrote:

On 10/10/24 2:21 PM, David Brown wrote:
[ SNIP]

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

If the compiler generates the memmove instruction, then one doesn't
have to write memmove() is C - it is never called/used.

The common case is that a good compiler will generate inline code for
some cases - typically known (at compile-time) small sizes - and call a
generic library function when the size is not known or is over a certain
size. Then there are some targets where it will always call the library
code, and some where it will always generate inline code.

Even if the compiler /can/ generate inline code, there can be
circumstances when it will not do so - such as if you have not enabled optimisation, or are optimising for size, or using a weaker compiler, or calling the function indirectly.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.

In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries that
can work better if they are optimised for each new generation of
processor. Sometimes you just need to re-compile the library with a
newer compiler and appropriate flags, other times you need to modify
the library source code. None of this is specific to memmove().

But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.

Not applicable.

I don't understand what you mean by that. /What/ is not applicable to
/what/ ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Fri Oct 11 15:13:17 2024

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Fri Oct 11 16:54:13 2024

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.

That explanation helps a little, but only a little. I wasn't suggesting anything - or if I was, it was several posts ago and the context has
long since been snipped. Can you be more explicit about what you think
I was suggesting, and why it might not be a good idea for targeting a
"my66k" ISA? (That is not a processor I have heard of, so you'll have
to give a brief summary of any particular features that are relevant here.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Stephen Fuld on Fri Oct 11 08:15:29 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 10/9/2024 1:20 PM, David Brown wrote:

There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write
a function that handles files? You need non-portable OS calls.)
That's why these things are in the standard library in the first
place.

I agree with everything you say up until the last sentence. There
are several languages, mostly older ones like Fortran and COBOL,
where the file handling/I/O are defined portably within the
language proper, not in a separate library. It just moves the
non-portable stuff from the library writer (as in C) to the
compiler writer (as in Fortran, COBOL, etc.)

What I think you mean is that I/O and file handling are defined as
part of the language rather than being written in the language.
Assuming that's true, what you're saying is not at odds with what
David said. I/O and so forth cannot be written in unaugmented
standard C without changing the language. Given the language as
it is, these things must be put in the standard library, because
they cannot be provided in the existing language.

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library. In
particular, it makes for a very clean distinction between two
kinds of implementation, what the C standard calls a freestanding implementation (which excludes most of the library) and a hosted
implementation (which includes the whole library). This facility
is what allows C to run easily on very small processors, because
there is no overhead for non-essential language features. That is
not to say such things couldn't be arranged for Fortran or COBOL,
but it would be harder, because those languages are not designed
to be separable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Fri Oct 11 18:55:29 2024

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Fri Oct 11 15:21:47 2024

Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations >>> in page table and page file handling.

Interesting. Do you happen to have a pointer for further reading
about it?

*nix needs to maintain various data structures to support forking
memory just in case it happens.

I can't imagine what those datastructures would be (which might be just
another way to say that I was brought up on POSIX and can't imagine the
world differently).

http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

Yeah, that's a great book on how VMS works in detail.
My copy is v1.0 from 1981.
It describes the various data structures, some down to the bit level.
Then chapter 15 Paging Dynamics walks through the details of how
paging works.

A book of comparable detail on Linux (but dated) would be:

Understanding the Linux Virtual Memory Manager, Gorman, 2007 https://www.kernel.org/doc/gorman/pdf/understand.pdf

Of a similar nature on Windows but without the detail of the above two is:

(this appears to be two volumes jammed together)
Windows Internals 6th ed vol 1&2, 2012 https://empyreal96.github.io/nt-info-depot/Windows-Internals-PDFs/Windows%20Internals%206e%20Part1%2B2.pdf

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sat Oct 12 00:02:32 2024

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

      .global memmove
memmove:
      MM     R2,R1,R3
      RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Fri Oct 11 23:32:20 2024

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

      .global memmove
memmove:
      MM     R2,R1,R3
      RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sat Oct 12 05:06:05 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Oct 12 05:11:44 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different
objects? For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they
can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard
library memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

Throughout this long thread you keep missing the point. Having
different instructions available doesn't change the definition
of the C language. It is possible to write code in standard C
(which means, C that does NOT depend on any internal details of
any implementation) to copy bytes from one place to another with
semantics matching those of memmove(), BUT that code is clunky.
To get a decent implementation of memmove() semantics requires
knowledge of some internal implementation details that are not
part of standard C. Whether those details are part of the
compiler or part of the runtime environment (the library) is
irrelevant - they still aren't part of standard C. Adding new
instructions to the ISA, no matter what those new instructions
are, cannot change that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Sat Oct 12 15:20:13 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations >>>> in page table and page file handling.

Interesting. Do you happen to have a pointer for further reading
about it?

*nix needs to maintain various data structures to support forking
memory just in case it happens.

I can't imagine what those datastructures would be (which might be just
another way to say that I was brought up on POSIX and can't imagine the
world differently).

http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

Yeah, that's a great book on how VMS works in detail.
My copy is v1.0 from 1981.

I also have a printed copy from 1981, along with the
internals class notes and the microfiche.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Sat Oct 12 17:16:44 2024

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bernd Linsel@21:1/5 to David Brown on Sat Oct 12 19:26:30 2024

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?

--
Bernd Linsel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Brian G. Lucas on Sat Oct 12 18:17:18 2024

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not
available?

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5. Lots of work and time
discarded, so you play the odds, perhaps to the low side and over prefetch
to cover being wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Sat Oct 12 18:33:17 2024

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you
are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

There are only two decisions to make in memcpy, are the copies less than
copy sized aligned, and do the pointers overlap in copy size.

For hardware this simplifies down to perhaps two types of copies, easy and hard.

If you make hard fast, and you will, then two versions is all you need, not
the dozens of choices with 1k of code you need in C.

Often you know which of the two you want at compile time from the pointer
type.

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Oct 12 18:32:48 2024

On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

That is what Predication is for.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Brett on Sat Oct 12 18:37:35 2024

On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not available?

There is always a count available; it can come from a register or an
immediate.

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is loaded and that count value could be say 5.

The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.

Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to mitchalsup@aol.com on Sun Oct 13 01:25:13 2024

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

Brian G. Lucas <bagel99@gmail.com> wrote:

On 10/12/24 12:06 AM, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.

Yes.
#include <string.h>

void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}

void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret

Excellent!

Though I guess forwarding a const is probably a thing today to improve >>>> branch prediction, which is normally HORRIBLE for short branch counts.

What is the default virtual loop count if the register count is not
available?

There is always a count available; it can come from a register or an immediate.

Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5.

The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.

That simplifies a lot of issues, thanks!

Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Sun Oct 13 10:56:20 2024

On Sat, 12 Oct 2024 18:32:48 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}

in your library's source?

.global memmove
memmove:
MM R2,R1,R3
RET

sure !

Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.

I don't understand this paragraph.
Does constant as a 3rd operand cause restartability problem?
Or does it not?
If it does not, then how?
Do you have a private field in thread state? Saved on stack by by
interrupt uCode ?
OS people would not like it. They prefer to have full control even when
they don't use it 99.999% of the time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Brett on Sun Oct 13 10:31:49 2024

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that David
is defending is that memmove() cannot be implemented "efficiently" in /standard/ C source code, on /any/ HW, because it would require
comparing /C pointers/ that point to potentially different /C objects/,
which is not defined behavior in standard C, whether compiled to machine
code, or executed by an interpreter of C code, or executed by a human programmer performing what was called "desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and David
is not disputing that. But Mitch seems not to understand or not to see
the issue about standard C vs memmove().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Sun Oct 13 12:00:48 2024

On Fri, 11 Oct 2024 16:54:13 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable" architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.

That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.

You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)

The proper spelling appears to be My 66000.
For starter, My 66000 has no SIMD. It does not even have dedicated FP
register file. Both FP and Int share common 32x64bit register space.

More importantly, it has dedicate instruction with exactly the same
semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
that in at least several out of multitude of implementations it will
suck.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Niklas Holsti on Sun Oct 13 12:26:22 2024

On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I
know you are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply
to? Are you interested in replying, and engaging in the
discussion? Or are you just looking for a chance to promote your
own architecture, no matter how tenuous the connection might be to
other posts?

Again, let me say that I agree with what you are saying - I agree
that an ISA should have instructions that are efficient for what
people actually want to do. I agree that it is a good thing to
have instructions that let performance scale with advances in
hardware ideally without needing changes in compiled binaries, and
at least without needing changes in source code.

I believe there is an interesting discussion to be had here, and I
would enjoy hearing about comparisons of different ways things
functions like memcpy() and memset() can be implemented in
different architectures and optimised for different sizes, or how
scalable vector instructions can work in comparison to fixed-size
SIMD instructions.

But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().

Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.

In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:

void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}

I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}

However, implementing the first variant efficiently is well within
abilities of good compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bernd Linsel on Sun Oct 13 12:57:06 2024

On 12/10/2024 19:26, Bernd Linsel wrote:

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?

Absolutely, yes.

But in a thread branch discussing C, details of C are relevant.

I don't expect any random regular here to know "language lawyer" details
of the C standards. I don't expect people here to care about them.
People in comp.lang.c care about them - for people here, the main
interest in C is for programs to run on the computer architectures that
are the real interest.

But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
what other posters write. The point under discussion was that you
cannot implement an efficient "memmove()" function in fully portable
standard C. That's a fact - it is a well-established fact. Another
clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the
advantage and the disadvantage of writing code in portable standard C.

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
architecture discussions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Michael S on Sun Oct 13 13:33:55 2024

On 2024-10-13 12:26, Michael S wrote:

On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2024-10-12 21:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

[ snip ]

But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.

[ snip discussion of HW ]

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.

Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU
instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().

Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.

Sure.

In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:

void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}

Yes.

I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}

However, implementing the first variant efficiently is well within
abilities of good compiler.

Yes, but it is not required by the C standard, so the fact remains that
there is no standard way of implementing memmove() in a way that is
"efficient" in the sense that it ensures that a copy to and from a
temporary will /not/ happen.

In practice, of course, memmove() is implemented in a non-portable way
or by in-line code, as everybody understands.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Sun Oct 13 14:10:20 2024

On 13/10/2024 11:00, Michael S wrote:

On Fri, 11 Oct 2024 16:54:13 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 11/10/2024 14:13, Michael S wrote:

On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 10/10/2024 23:19, Brian G. Lucas wrote:

Not applicable.

I don't understand what you mean by that. /What/ is not applicable
to /what/ ?

Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.

But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.

That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.

You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

I certainly suggested that they have some advantages, yes. I don't know
nearly enough details about implementations and practical usage to know
if scalable vector instructions are /always/ better than non-scalable
SIMD with fixed-size registers, either from the viewpoint of their
efficiency at runtime or their implementation in hardware.

It seems to me that if the compiler knows the size of a memcpy/memmove,
then the best results would probably be achieved by the compiler
inlining the copy using fixed size registers of a suitable size. If it
does not know the size, then I would expect (but I don't know for sure)
that a hardware scalable vector instruction should be more efficient
than using fixed-size registers. If that were not the case, then I
wonder why scalable vector hardware has become popular recently in ISAs.

If you - or someone else - knows enough to say more about this, then I'd
be glad to learn about it.

Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)

The proper spelling appears to be My 66000.
For starter, My 66000 has no SIMD. It does not even have dedicated FP register file. Both FP and Int share common 32x64bit register space.

OK.

More importantly, it has dedicate instruction with exactly the same
semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
that in at least several out of multitude of implementations it will
suck.

So if I understand you correctly, your argument is that scalable vector instructions - at least for copying memory - is slow in hardware implementations, and thus it would be better to simply copy memory in a
loop using larger fixed-size registers? I would find that surprising,
but as I said, I don't know the details of implementations.

(I do know that in the 68k family, the hardware division instruction was dropped for later devices after it was realised that a software division routine was faster than the hardware instruction. So such strange
things have happened.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Brett on Sun Oct 13 13:58:14 2024

On 12/10/2024 20:33, Brett wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 01:32, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

On 11/10/2024 20:55, MitchAlsup1 wrote:

On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:

Do you think you can just write this :

void * memmove(void * s1, const void * s2, size_t n)
{
    return memmove(s1, s2, n);
}

in your library's source?

       .global memmove
memmove:
       MM     R2,R1,R3
       RET

sure !

You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.

This discussion has become pointless.

The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".

One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.

Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.

Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?

Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.

I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.

But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.

There are only two decisions to make in memcpy, are the copies less than
copy sized aligned, and do the pointers overlap in copy size.

Are you confused about memcpy() and memmove()? If so, let's clear that
one up from the start. For memcpy(), there are no overlap issues - the
person using it promises that the source and destination areas do not
overlap, and no one cares what might happen if they do. For memmove(),
the areas /may/ overlap, and the copy is done as though the source were
copied first to a temporary area, and then from the temporary area to
the destination.

For memcpy(), there can be several issues to consider for efficient implementations that can be skipped for a simple loop copying byte for
byte. An efficient implementation will probably want to copy with
larger sizes, such as using 32-bit, 64-bit, or bigger registers. For
some targets, that is only possible for aligned data (and for some,
unaligned accesses may be allowed but emulated by traps, making them
massively slower than byte-by-byte accesses). The best choice of size
will be implementation and target dependent, as will methods of
determining alignment (if that is relevant). I'm guessing that by your somewhat muddled phrase "are the copies less than copy sized aligned",
you meant something on those lines.

For memmove(), you generally also need to decide if your copy loop
should run upwards or downwards, and that must be done in an implementation-dependent manner. It is conceivable that for a target
with more complex memory setups - perhaps allowing the same memory to be accessible in different ways via different segments - that this is not
enough.

For hardware this simplifies down to perhaps two types of copies, easy and hard.

For most targets, yes.

If you make hard fast, and you will, then two versions is all you need, not the dozens of choices with 1k of code you need in C.

That makes little sense. What "1k of code" do you need in C?
Implementations of memcpy() and memmove() are implementation and target-specific, not general portable standard C. There is no single C implementation of these functions.

It is an obvious truism that if you have hardware instructions that can implement an efficient memcpy() and/or memmove() on a target, then the implementation-specific implementations of these functions on that
target will be small, simple and efficient.

Often you know which of the two you want at compile time from the pointer type.

In short your complaints are wrong headed in not understanding what
hardware memcpy can do.

What complaints? I haven't made any complains about implementing these functions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Sun Oct 13 15:45:37 2024

David Brown <david.brown@hesbynett.no> writes:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.

When you implements something like, say

vsum(double *a, double *b, double *c, size_t n);

where a, b, and c may point to arrays in different objects, or may
point to overlapping parts of the same object, and the result vector c
in the overlap case should be the same as in the no-overlap case
(similar to memmove()), being able to compare pointers to possibly
different objects also comes in handy.

Another example is when the programmer uses the address as a key in,
e.g., a binary search tree. And, as you write, casting to intptr_t is
not guarenteed to work by the C standard, either.

An example that probably compares pointers to the same object as far
as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
When you have two free variables, and you unify them, in the
implementation one variable points to the other one. Now which should
point to which? The younger variable should point to the older one,
because it will die sooner. How do you know which variable is
younger? You compare the addresses; the variables reside on a stack,
so the younger one is closer to the top.

If that stack is one object as far as the C standard is concerned,
there is no problem with that solution. If the stack is implemented
as several objects (to make it easier growable; I don't know if there
is a Prolog implementation that does that), you first have to check in
which piece it is (maybe with a binary search), and then possibly
compare within the stack piece at hand.

An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block >identifier and part of it as an index into that block, just as a C >implementation can.

I.e., what you are saying is that one can simulate a flat-memory model
on a segmented memory model. Certainly. In the case of the 8086 (and
even more so on the 286) the costs of that are so high that no
widely-used Forth system went there.

One can also simulate segmented memory (a natural fit for many
programming languages) on flat memory. In this case the cost is much
smaller, plus it gives the maximum flexibility about segment/object
sizes and numbers. That is why flat memory has won.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Sun Oct 13 19:36:04 2024

David Brown <david.brown@hesbynett.no> wrote:

On 12/10/2024 19:26, Bernd Linsel wrote:

On 12.10.24 17:16, David Brown wrote:

[snip rant]

You are aware that this is c.arch, not c.lang.c?

Absolutely, yes.

But in a thread branch discussing C, details of C are relevant.

I don't expect any random regular here to know "language lawyer" details
of the C standards. I don't expect people here to care about them.
People in comp.lang.c care about them - for people here, the main
interest in C is for programs to run on the computer architectures that
are the real interest.

But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
what other posters write. The point under discussion was that you
cannot implement an efficient "memmove()" function in fully portable
standard C. That's a fact - it is a well-established fact. Another
clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
architecture discussions.

MemMove in C is fundamentally two void pointers and a count of bytes to
move.

C does not care what the alignment of those two void pointers is.

ALU’s are so cheap as to be free, a dedicated MM unit can have a shifter
and mask with a buffer, and happily copy aligned chunks from the source and write aligned chunks to the dest, even though both are odd aligned in
different ways, and overlapping the same buffer.

Note that writes have byte enables, you can write 5 bytes in one go to
cache, to finish off the end of a series of aligned writes.

My 66000 only has one MM instruction because when you throw enough hardware
at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy version.

I detailed the hardware to do this several years ago on Real World Tech.
And such hardware has been available for many decades in DMA units.

The 6502 based GameBoy had a MemMove DMA unit as it was many times faster copying bytes than the 6502 was, and doubled the overall performance of the GameBoy.

One ring to rule them all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Sun Oct 13 21:21:11 2024

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different objects? >>>>>>> For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL.Â A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details.Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Â It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.

I.e. totally removing the need for compiler tricks or wide register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today), and
then the memmove() calls will usually be inlined.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Sun Oct 13 19:43:34 2024

Brett <ggtgp@yahoo.com> writes:

David Brown <david.brown@hesbynett.no> wrote:

All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable
standard C is weaknesses in the x86 ISA), so that we can clear up his
misunderstandings and move on to the more interesting computer
architecture discussions.

<snip>

My 66000 only has one MM instruction because when you throw enough hardware >at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy version.

I detailed the hardware to do this several years ago on Real World Tech.

Such hardware (memcpy/memmove/memfill) was available in 1965 on the Burroughs medium systems mainframes. In the 80s, support was added for hashing
strings as well.

It's not a new concept. In fact, there were some tricks that could
be used with overlapping source and destination buffers that would
replicate chunks of data).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Sun Oct 13 23:01:53 2024

On Sun, 13 Oct 2024 19:43:34 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Brett <ggtgp@yahoo.com> writes:

David Brown <david.brown@hesbynett.no> wrote:

All I am asking Mitch to do is to understand this, and to stop
saying silly things (such as implementing memmove() by calling
memmove(), or that the /reason/ you can't implement memmove()
efficiently in portable standard C is weaknesses in the x86 ISA),
so that we can clear up his misunderstandings and move on to the
more interesting computer architecture discussions.

<snip>

My 66000 only has one MM instruction because when you throw enough
hardware at the problem, one instruction is all you need.

And it also covers MemCopy, and yes there is a backwards copy
version.

I detailed the hardware to do this several years ago on Real World
Tech.

Such hardware (memcpy/memmove/memfill) was available in 1965 on the
Burroughs medium systems mainframes. In the 80s, support was added
for hashing strings as well.

It's not a new concept. In fact, there were some tricks that could
be used with overlapping source and destination buffers that would
replicate chunks of data).

The difference is that today for strings of certain size, say from 200
bytes to half of your L1D cache, if your precios HW copies less than 50
bytes per clock then people would complain that it is slower than snail.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Mon Oct 14 15:19:32 2024

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different
objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers, >>>>>>> rather than having only a valid pointer or NULL.Â A compiler, >>>>>>> for example, might want to store the fact that an error occurred >>>>>>> while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can >>>>>>> rely on what application programmers cannot, their implementation >>>>>>> details.Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Â It has absolutely /nothing/ to do with the ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and
memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today), and
then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
is independent of any ISA, any specialist instructions for memory moves,
and any compiler optimisations. And it is independent of the fact that
some good compilers can inline at least some calls to memcpy() and
memmove() today, using whatever instructions are most efficient for the
target.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Mon Oct 14 17:04:28 2024

On 13/10/2024 17:45, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.

When you implements something like, say

vsum(double *a, double *b, double *c, size_t n);

where a, b, and c may point to arrays in different objects, or may
point to overlapping parts of the same object, and the result vector c
in the overlap case should be the same as in the no-overlap case
(similar to memmove()), being able to compare pointers to possibly
different objects also comes in handy.

OK, I can agree with that - /if/ you need such a function. I'd suggest
that when you are writing code that might call such a function, you've a
very good idea whether you want to do "vec_c = vec_a + vec_b;", or
"vec_c += vec_a;" (that is, "b" and "c" are the same). In other words,
the programmer calling vsum already knows if there are overlaps, and
you'd get the best results if you had different functions for the
separate cases.

It is conceivable that you don't know if there is an overlap, especially
if you are only dealing with parts of arrays rather than full arrays,
but I think such cases will be rare.

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it. Since a fully
defined portable method might not be possible (or at least, not
efficiently possible) for some weird targets, and it's a good thing that
C supports weird targets, I think perhaps the ideal would be to have
some feature that exists if and only if you can do sensible comparisons.
This could be an additional <stdint.h> pointer type, or some pointer
compare macros, or a pre-defined macro to say if you can simply use
uintptr_t for the purpose (as you can on most modern C implementations).

Another example is when the programmer uses the address as a key in,
e.g., a binary search tree. And, as you write, casting to intptr_t is
not guarenteed to work by the C standard, either.

Casting to uintptr_t (why would one want a /signed/ address?) is all you
need for most systems - and for any target where casting to uintptr_t
will not be sufficient here, the type uintptr_t will not exist and you
get a nice, safe hard compile-time error rather than silently UB code.
For uses like this, you don't need to compare pointers - comparing the
integers converted from the pointers is fine. (Imagine a system where converted addresses consist of a 16-bit segment number and a 16-bit
offset, where the absolute address is the segment number times a scale
factor, plus the offset. You can't easily compare two pointers for real address ordering by converting them to an integer type, but the result
of casting to uintptr_t is still fine for your binary tree.)

An example that probably compares pointers to the same object as far
as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
When you have two free variables, and you unify them, in the
implementation one variable points to the other one. Now which should
point to which? The younger variable should point to the older one,
because it will die sooner. How do you know which variable is
younger? You compare the addresses; the variables reside on a stack,
so the younger one is closer to the top.

If that stack is one object as far as the C standard is concerned,
there is no problem with that solution. If the stack is implemented
as several objects (to make it easier growable; I don't know if there
is a Prolog implementation that does that), you first have to check in
which piece it is (maybe with a binary search), and then possibly
compare within the stack piece at hand.

My only experience of Prolog was working through a short tutorial
article when I was a teenager - I have no idea about implementations!

But again I come back to the same conclusion - there are situations
where being able to compare addresses can be useful, but it is very rare
for most programmers to ever actually need to do so. And I think it is
good that there is a widely portable way to achieve this, by casting to uintptr_t and comparing those integers. There are things that people
want to do with C programming that can be done with
implementation-specific code, but which cannot be done with fully
portable standard code. While it is always nice if you /can/ use fully portable solutions (while still being clear and efficient), it's okay to
have non-portable code when you need it.

An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.

Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block
identifier and part of it as an index into that block, just as a C
implementation can.

I.e., what you are saying is that one can simulate a flat-memory model
on a segmented memory model.

Yes.

Certainly. In the case of the 8086 (and
even more so on the 286) the costs of that are so high that no
widely-used Forth system went there.

OK.

That's much the same as C on segmented targets.

One can also simulate segmented memory (a natural fit for many
programming languages) on flat memory. In this case the cost is much smaller, plus it gives the maximum flexibility about segment/object
sizes and numbers. That is why flat memory has won.

Sure, flat memory is nicer in many ways.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Mon Oct 14 16:40:26 2024

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different >>>>>>>>> objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>> rather than having only a valid pointer or NULL.Ã‚Â A compiler, >>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>> while parsing a subexpression as a special pointer constant.

Compilers often have the unfair advantage, though, that they can >>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>> details.Ã‚Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they can >>>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>>> memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the ISA. >>>>

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C.Â That's why I said there >>> was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers.Â Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has
the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and memmove. (For my own kind of work, I'd worry about such looping instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register
operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
is independent of any ISA, any specialist instructions for memory moves,
and any compiler optimisations. And it is independent of the fact that some good compilers can inline at least some calls to memcpy() and
memmove() today, using whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on c.arch,
I really don't think any of us really disagree, it is just that we have
been discussing two (mostly) orthogonal issues.

a) memmove/memcpy are so important that people have been spending a lot
of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
comparison of arbitrary pointers).

b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
fact so fast that it obviates any need for tricky coding to replace it.

Ideally, it should be able to copy a single object, up to a cache line
in size, in the same or less time needed to do so manually with a SIMD
512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)

REP MOVSB on x86 does the canonical memcpy() operation, originally by
moving single bytes, and this was so slow that we also had REP MOVSW
(moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on
64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in fact
handle any kinds of alignments and sizes, while doing the actual
transfer at maximum bus speeds, i.e. at least one cache line/cycle for
things already in $L1.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Mon Oct 14 19:08:56 2024

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL.Ã‚Â A compiler, for example, might want to store the >>>>>>>>> fact that an error occurred while parsing a subexpression
as a special pointer constant.

Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>> implementation details.Ã‚Â (Some do not, such as f2c). >>>>>>>>

Standard library authors have the same superpowers, so that
they can
implement an efficient memmove() even though a pure standard >>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the >>>>>> ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C.Â That's why I
said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)

And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.

I.e. totally removing the need for compiler tricks or wide
register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.

a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep" instructions.

No, that's not true. And according to my understanding, that's not what
Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in PSW
instead of being part of the opcode).
REP MOVSW/D/Q were introduced because back then processors were small
and stupid. When your processor is big and smart you don't need them
any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin to
REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the next
logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

Now, is all that a good idea? I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary for
1st-class implementation of these instructions is useful not only for
memory copy.
So, may be, it makes sense to expose this hardware in more generic ways.
May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better ways
that I was not thinking about.

And I fully agree that these would be useful features
in general-purpose processors.

My only point of contention is that the existence or lack of such instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.
One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Mon Oct 14 17:19:40 2024

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to different >>>>>>>>>> objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>>> rather than having only a valid pointer or NULL.Ã‚Â A compiler, >>>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>>> while parsing a subexpression as a special pointer constant. >>>>>>>>>
Compilers often have the unfair advantage, though, that they can >>>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>>> details.Ã‚Â (Some do not, such as f2c).

Standard library authors have the same superpowers, so that they >>>>>>>> can
implement an efficient memmove() even though a pure standard C >>>>>>>> programmer cannot (other than by simply calling the standard
library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the ISA. >>>>>

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C.Â That's why I said
there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers.Â Often that >>>> can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has
the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very
close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced hardware
MM instruction could be a very efficient way to implement both memcpy
and memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)

And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/
get benefits from doing so, but it is not as simple as Mitch made out.

I.e. totally removing the need for compiler tricks or wide register
operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to write
an efficient memmove() implementation using pure portable standard C.
That is independent of any ISA, any specialist instructions for memory
moves, and any compiler optimisations. And it is independent of the
fact that some good compilers can inline at least some calls to
memcpy() and memmove() today, using whatever instructions are most
efficient for the target.

David, you and Mitch are among my most cherished writers here on c.arch,
I really don't think any of us really disagree, it is just that we have
been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and his willingness to share that freely with others. That's why I have found
this very frustrating.

a) memmove/memcpy are so important that people have been spending a lot
of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires
implementation dependent behaviour to determine alignments, or it must
rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
fact so fast that it obviates any need for tricky coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache line
in size, in the same or less time needed to do so manually with a SIMD 512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally by
moving single bytes, and this was so slow that we also had REP MOVSW
(moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in fact handle any kinds of alignments and sizes, while doing the actual
transfer at maximum bus speeds, i.e. at least one cache line/cycle for
things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do these
basic operations faster than a software loop or the x86 "rep"
instructions. And I fully agree that these would be useful features in general-purpose processors.

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C. They would make it easier to write efficient
implementations of these standard library functions for targets that had
such instructions - but that would be implementation-specific code. And
that is one of the reasons that C standard library implementations are
tied to the specific compiler and target, and the writers of these
libraries have "superpowers" and are not limited to standard C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Mon Oct 14 19:02:51 2024

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Mon Oct 14 22:20:42 2024

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer because
of odd birds.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Oct 14 19:39:41 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Oct 14 23:46:10 2024

On Tue, 8 Oct 2024 20:53:00 +0000, MitchAlsup1 wrote:

The Algol family of block structure gave the illusion that flat was less necessary and it could all be done with lexical address-
ing and block scoping rules.

Then malloc() and mmap() came along.

Algol-68 already had heap allocation and flex arrays. (The folks over in MULTICS land were working on mmap.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Oct 14 23:55:59 2024

On Wed, 09 Oct 2024 13:37:41 -0400, EricP wrote:

The Posix interface support was there so *MS* could bid on US government
and military contracts which, at that time frame, were making noise
about it being standard for all their contracts.
The Posix DLLs didn't come with WinNT, you had to ask MS for them
specially.

And that whole POSIX subsystem was so sadistically, unusably awful, it
just had to be intended for show as a box-ticking exercise, nothing more.

<https://www.youtube.com/watch?v=BOeku3hDzrM>

Back then "object oriented" and "micro-kernel" buzzwords were all the
rage.

OO still lives on in higher-level languages. Microsoft’s one attempt to incorporate its OO architecture--Dotnet--into the lower layers of the OS,
in Windows Vista, was an abject, embarrassing failure which hopefully
nobody will try to repeat.

On the other hand, some stubborn holdouts are still fond of microkernels
-- you just have to say the whole idea is pointless, and they come out of
the woodwork in a futile attempt to disagree ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Oct 14 23:51:27 2024

On Tue, 8 Oct 2024 22:28 +0100 (BST), John Dallman wrote:

The same problem seems to have messed up all the attempts to provide
good Unix emulation on VMS.

Was it the Perl build scripts that, at some point their compatibility
tests on a *nix system, would announce “Congratulations! You’re not
running EUNICE!”.

In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by that
point.

Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

It was the Microsoft management that did it -- the culmination of a whole sequence of short-term, profit-oriented decisions over many years ...
decades. What may have started out as an “elegant design” finally became unrecognizable as such.

Compare what was happening to Linux over the same time interval, where the programmers were (largely) not beholden to managers and bean counters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Oct 15 00:14:25 2024

On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer because
of odd birds.

So, you are saying that 286 in its hey-day was/is odd ?!?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 15 00:17:04 2024

On Mon, 14 Oct 2024 23:51:27 +0000, Lawrence D'Oliveiro wrote:

On Tue, 8 Oct 2024 22:28 +0100 (BST), John Dallman wrote:

The same problem seems to have messed up all the attempts to provide
good Unix emulation on VMS.

Was it the Perl build scripts that, at some point their compatibility
tests on a *nix system, would announce “Congratulations! You’re not running EUNICE!”.

In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by that
point.

Some combination of that, Microsoft confidence that "of course we can do
something better now!" - they are very prone to overconfidence - and the
terrible tendency of programmers to ignore the details of the old code.

It was the Microsoft management that did it -- the culmination of a
whole
sequence of short-term, profit-oriented decisions over many years ... decades. What may have started out as an “elegant design” finally became unrecognizable as such.

Compare what was happening to Linux over the same time interval, where
the
programmers were (largely) not beholden to managers and bean counters.

Last 5 words are unnecessary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Oct 15 00:15:49 2024

On Mon, 14 Oct 2024 19:39:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

Stick to the question asked. Registers were 16-binary digits,
and segment registers enabled access to 24-bit address space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Oct 15 05:20:10 2024

On Tue, 8 Oct 2024 21:03:40 -0000 (UTC), John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If you look at the 8086 manuals, that's clearly what they had in mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

Right, and they appeared not to care or realize it was a performance
problem.

They didn’t expect anybody to make serious use of it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Oct 15 10:41:41 2024

On Tue, 15 Oct 2024 00:14:25 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

That's their problem. The rest of the C world shouldn't suffer
because of odd birds.

So, you are saying that 286 in its hey-day was/is odd ?!?

In its heyday 80286 was used as MUCH faster 8088.
286-as-286 was/is odd creature. I'd dare to say that it had no heyday.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Oct 15 11:16:55 2024

On Mon, 14 Oct 2024 23:55:59 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Wed, 09 Oct 2024 13:37:41 -0400, EricP wrote:

The Posix interface support was there so *MS* could bid on US
government and military contracts which, at that time frame, were
making noise about it being standard for all their contracts.
The Posix DLLs didn't come with WinNT, you had to ask MS for them specially.

And that whole POSIX subsystem was so sadistically, unusably awful,
it just had to be intended for show as a box-ticking exercise,
nothing more.

<https://www.youtube.com/watch?v=BOeku3hDzrM>

Back then "object oriented" and "micro-kernel" buzzwords were all
the rage.

OO still lives on in higher-level languages. Microsoft’s one attempt
to incorporate its OO architecture--Dotnet--into the lower layers of
the OS, in Windows Vista, was an abject, embarrassing failure which
hopefully nobody will try to repeat.

It sounds like you confusing .net with something unrelated.
Probably with Microsoft's failed WinFS filesystem.
WinFS was *not* object-oriented.

AFAIK, .net is hugely successful application development technology
that was never incorporated into lower layers of the OS.

If you are interested in failed attempts to incorporate .net into
something it does not fit then please consider Silverlight.
But then, the story of Silverlight is not dissimilar to the story of
in-browser Java, with main difference that the latter was more harmful
to the industry.

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless, and
they come out of the woodwork in a futile attempt to disagree ...

Seems, you are not ashamed to admit your trolling tactics.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Tue Oct 15 10:53:30 2024

On 14/10/2024 18:08, Michael S wrote:

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 16:40, Terje Mathisen wrote:

David Brown wrote:

On 13/10/2024 21:21, Terje Mathisen wrote:

David Brown wrote:

On 10/10/2024 20:38, MitchAlsup1 wrote:

On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:

On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".

Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL.Ã‚Â A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.

Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>>>> implementation details.Ã‚Â (Some do not, such as f2c). >>>>>>>>>>

Standard library authors have the same superpowers, so that >>>>>>>>>> they can
implement an efficient memmove() even though a pure standard >>>>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).

This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.

No, it is not.Ã‚Â It has absolutely /nothing/ to do with the >>>>>>>> ISA.

For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C.Â That's why I
said there was no connection between the two concepts.

For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.

It is not that simple.

There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.

What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.

I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)

And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.

What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.

I.e. totally removing the need for compiler tricks or wide
register operations.

Also apropos the compiler library issue:

You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.

The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.

David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.

I agree. It's a "god dag mann, økseskaft" situation.

I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.

a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).

Yes.

(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires
implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)

b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.

Yes.

Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)

Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.

With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.

I agree on all of that.

I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep"
instructions.

No, that's not true. And according to my understanding, that's not what
Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in PSW instead of being part of the opcode).

My understanding of what Terje wrote is that REP MOVSB /could/ be an
efficient solution if it were backed by a hardware block to run well
(i.e., transferring as many bytes per cycle as memory bus bandwidth
allows). But REP MOVSB is /not/ efficient - and rather than making it
work faster, Intel introduced variants with wider fixed sizes.

Could REP MOVSB realistically be improved to be as efficient as the instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I don't
know. Intel and AMD have had many decades to do so, so I assume it's
not an easy improvement.

REP MOVSW/D/Q were introduced because back then processors were small
and stupid. When your processor is big and smart you don't need them
any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin to
REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the next logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

Now, is all that a good idea?

That's a very important question.

I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary for 1st-class implementation of these instructions is useful not only for
memory copy.
So, may be, it makes sense to expose this hardware in more generic ways.

I believe that is the idea of "scalable vector" instructions as an
alternative philosophy to wide explicit SIMD registers. My expectation
is that SVE implementations will be more effort in the hardware than
SIMD for any specific SIMD-friendly size point (i.e., power-of-two
widths). That usually corresponds to lower clock rates and/or higher
latency and more coordination from extra pipeline stages.

But once you have SVE support in place, then memcpy() and memset() are
just examples of vector operations that you get almost for free when you
have hardware for vector MACs and other operations.

May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better ways
that I was not thinking about.

And I fully agree that these would be useful features
in general-purpose processors.

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.

No, my goalposts have been in the same place all the time. Some others
have been kicking the ball at a completely different set of goalposts,
but I have kept the same point all along.

One does not need "good implementation" in a sense you have in mind.

Maybe not - but /that/ would be moving the goalposts.

All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

But that would /not/ be an efficient implementation of memmove() in
plain portable standard C.

What do I mean by an "efficient" implementation in fully portable
standard C? There are two possible ways to think about that. One is
that the operations on the abstract machine are efficient. The other is
that the code is likely to result in efficient code over a wide range of real-world compilers, options, and targets. And I think it goes without
saying that the implementation must not rely on any
implementation-defined behaviour or anything beyond the minimal limits
given in the C standards, and it must not introduce any new real or
potential UB.

Your "memmove()" implementation fails on several counts. It is
inefficient in the abstract machine - it copies everything twice instead
of once. It is inefficient in real-world implementations of all sorts
and countless targets - being efficient for some compilers with some
options on some targets (most of them hypothetical) does /not/ qualify
as an efficient implementation. And quite clearly it risks causing
failures from stack overflow in situations where the user would normally
expect memmove() to function safely (on implementations other than those
few that turn it into efficient object code).

They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Tue Oct 15 11:59:27 2024

On Tue, 8 Oct 2024 21:03:40 -0000 (UTC)
John Levine <johnl@taugh.com> wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

If you look at the 8086 manuals, that's clearly what they had in
mind.

What I don't get is that the 286's segment stuff was so slow.

It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.

Right, and they appeared not to care or realize it was a performance
problem.

They didn't even do obvious things like see if you're reloading the
same value into the segment register and skip the rest of the setup.
Sure, you could put checks in your code and skip the segment load but
that would make your code a lot bigger and uglier.

The question is how slowness of 80286 segments compares to
contemporaries that used segment-based protected memory.
Wikipedia lists following machines as examples of segmentation:
- Burroughs B5000 and following Burroughs Large Systems
- GE 645 -> Honeywell 6080
- Prime 400 and successors
- IBM System/38
They also mention S/370, but to me segmentation in S/370 looks very
different and probably not intended for fine-grained protection.

Of those Burroughs B5900 looks to me as the most comparable to 80286.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue Oct 15 12:38:40 2024

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be the
same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different segments.
Then the comparison here might not give the same result as a full
virtual address comparison - but that does not matter. If the pointers
came from different mallocs, they could not overlap and memmove() can
run either direction.

The same applies to other uses, such as indexing in a binary search tree
or a hash map - the comparison above will be correct when it matters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Tue Oct 15 14:22:46 2024

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!

---
* https://theretroweb.com/motherboards/s/compaq-deskpro-286e-p-n-001226

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Tue Oct 15 14:09:58 2024

On 15/10/2024 13:22, Michael S wrote:

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!

I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

But I would expect that in almost any practical system where you can use
"p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

The exceptions would be systems where pointers hold more than just
addresses, such as access control information or bounds that mean they
are larger than the largest integer type on the target.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 15 18:40:00 2024

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of the advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS components, but not many people actually want that.

<https://en.wikipedia.org/wiki/Hybrid_kernel>

Windows NT and Apple's XNU, used in all their operating systems, are both hybrid kernels, so the idea is somewhat practical.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Michael S on Tue Oct 15 18:40:00 2024

In article <20241015111655.000064b3@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

OO still lives on in higher-level languages. Microsoft_s one
attempt to incorporate its OO architecture--Dotnet--into the
lower layers of the OS, in Windows Vista, was an abject,
embarrassing failure which hopefully nobody will try to repeat.

AFAIK, .net is hugely successful application development technology
that was never incorporated into lower layers of the OS.

You're correct. There was an experimental Microsoft OS that was almost
entirely written in .NET but it was never commercialised.

<https://en.wikipedia.org/wiki/Singularity_(operating_system)>

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Tue Oct 15 18:57:07 2024

jgd@cix.co.uk (John Dallman) writes:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.

It's useful to note that the primary shortcoming of a
microkernel (domain crossing latency) is mostly not a problem
on RISC processors (like ARM64) where the ring change
takes about the same amount of time as a function call.

One might also argue that in many aspects, a hypervisor is
a 'microkernel' with some hardware support on most modern
CPUs.

Disclaimer: I spent most of the 90's working with the
Chorus microkernel.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to David Brown on Tue Oct 15 19:46:23 2024

David Brown <david.brown@hesbynett.no> wrote:

On 15/10/2024 13:22, Michael S wrote:

On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 21:02, MitchAlsup1 wrote:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

void * p = ...
void * q = ...

uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;

if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}

If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?

Think about using this comparison for memmove().

Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.

The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.

It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!

I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

But I would expect that in almost any practical system where you can use "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

The exceptions would be systems where pointers hold more than just
addresses, such as access control information or bounds that mean they
are larger than the largest integer type on the target.

EGA graphics had more than 64k, smart software would group one or more scan lines into segments for bit mapping the array. A bit mapper works a scan
line at a time so segment changes were not that expensive. This was
profoundly faster than using pixel pokes and the other default methods of changing bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Oct 15 17:26:29 2024

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Oct 15 21:55:44 2024

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Oct 15 22:05:56 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

POSIX adds some extensions (marked 'CX').

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Tue Oct 15 19:51:27 2024

On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.

Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.

YMMV.

<https://en.wikipedia.org/wiki/Hybrid_kernel>

Windows NT and Apple's XNU, used in all their operating systems, are both >hybrid kernels, so the idea is somewhat practical.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Oct 16 00:24:07 2024

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

POSIX adds some extensions (marked 'CX').

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to George Neuner on Wed Oct 16 07:36:29 2024

George Neuner wrote:

On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of the
advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS
components, but not many people actually want that.

Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.

YMMV.

This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular
server/service, then you replace them in groups so that the service sees
zero downtime even though all the servers have been updated/replaced.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Wed Oct 16 09:17:03 2024

On 16/10/2024 07:36, Terje Mathisen wrote:

George Neuner wrote:

On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of
the
advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >>> components, but not many people actually want that.

Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.

YMMV.

This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular server/service, then you replace them in groups so that the service sees
zero downtime even though all the servers have been updated/replaced.

That's fine - /if/ you have a service that can easily be spread across
multiple systems, and you can justify the cost of that. Setting up a
database server is simple enough.

Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder
still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to duplicate the whole thing so you can do test runs. And if the database
server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.

Clouds do nothing to help any of that.

But clouds /do/ mean that your virtual machine can be migrated (with
zero, or almost zero, downtime) to another physical server if there are hardware problems or during hardware maintenance. And if you can do
easy snapshots with your cloud / VM infrastructure, then you can roll
back if things go badly wrong. So you have a single server instance,
you plan a short period of downtime, take a snapshot, stop the service, upgrade, restart. That's what almost everyone does, other than the
/really/ big or /really/ critical service providers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stefan Monnier on Wed Oct 16 09:21:59 2024

On 15/10/2024 23:26, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific
time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Wed Oct 16 09:38:20 2024

On 15/10/2024 23:55, MitchAlsup1 wrote:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

It's a very good philosophy in programming language design that the core language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need non-standard C - the standard library is part of the implementation.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was standardised, and AFAIK it was in K&R C. But no one (in authority) ever claimed it could be implemented purely in standard C. What do you think
has changed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Oct 16 11:18:19 2024

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.
E.g. say if your application wants to use a region/pool/zone-based
memory management.

The fact that malloc can't be implemented in standard C is evidence
that standard C may not be general-purpose enough to accommodate an
application that wants to use a custom-designed allocator.

I don't disagree with you, from a practical perspective:

- in practice, C serves us well for Emacs's GC, even though that can't
be written in standard C.
- it's not like there are lots of other languages out there that offer
you portability together with the ability to define your own `malloc`.

But it's still a weakness, just a fairly minor one.

The reason why you might want your own special memmove, or your own special malloc, is that you are doing niche and specialised software.

Region/pool/zone-based memory management is common enough that I would
not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
Can't think of a practical reason to implement my own `memove`, OTOH.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 16 15:38:47 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Stefan Monnier on Wed Oct 16 19:57:03 2024

(Please do not snip or omit attributions. There are Usenet standards
for a reason.)

On 16/10/2024 17:18, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.

That makes no sense to me. We are talking about implementing standard
library functions. If you want to implement other functions, go ahead.

Or do you mean that it is only possible to implement related functions
(such as memory pools) if you also can implement malloc in fully
portable standard C? That would make a little more sense if it were
true, but it is not. First, you can implement such functions in implementation-specific code, so you are not hindered from writing the
code you want. Secondly, standard C provides functions such as malloc()
and aligned_alloc() that give you the parts you need - the fact that you
need something outside of standard C to implement malloc() does not
imply that you need those same features to implement your additional
functions.

E.g. say if your application wants to use a region/pool/zone-based
memory management.

The fact that malloc can't be implemented in standard C is evidence
that standard C may not be general-purpose enough to accommodate an application that wants to use a custom-designed allocator.

No, it is not - see above.

And remember how C was designed and how it was intended to be used. The
aim was to be able to write portable code that could be reused on many
systems, and /also/ implementation, OS and target specific code for
maximum efficiency, systems programming, and other non-portable work. A typical C program combines these - some parts can be fully portable,
other parts are partially portable (such as to any POSIX system, or
targets with 32-bit int and 8-bit char), and some parts may be very compiler-specific or target specific.

That's not an indication of failure of C for general-purpose
programming. (But I would certainly not suggest that C is the best
choice of language for many "general" programming tasks.)

I don't disagree with you, from a practical perspective:

- in practice, C serves us well for Emacs's GC, even though that can't
be written in standard C.
- it's not like there are lots of other languages out there that offer
you portability together with the ability to define your own `malloc`.

But it's still a weakness, just a fairly minor one.

The reason why you might want your own special memmove, or your own special >> malloc, is that you are doing niche and specialised software.

Region/pool/zone-based memory management is common enough that I would
not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
Can't think of a practical reason to implement my own `memove`, OTOH.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Oct 16 20:00:27 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

But more problematic is the implementation of free() without
knowing how to compare pointers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Oct 16 22:18:49 2024

On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Agreed, but once you HAVE a way of getting memory (by whatever name)
you can write malloc in std. C.

But more problematic is the implementation of free() without
knowing how to compare pointers.

Never wrote a program that actually needs free--I have re-written
programs that used free to avoid using free, though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to david.brown@hesbynett.no on Wed Oct 16 21:19:34 2024

On Wed, 16 Oct 2024 09:17:03 +0200, David Brown
<david.brown@hesbynett.no> wrote:

On 16/10/2024 07:36, Terje Mathisen wrote:

George Neuner wrote:

On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of
the
advantages of a microkernel to its developers, and avoids the need for >>>> lots of context switches. It doesn't let you easily replace low-level OS >>>> components, but not many people actually want that.

Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.

YMMV.

This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular
server/service, then you replace them in groups so that the service sees
zero downtime even though all the servers have been updated/replaced.

That's fine - /if/ you have a service that can easily be spread across >multiple systems, and you can justify the cost of that. Setting up a >database server is simple enough.

Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.

Clouds do nothing to help any of that.

But clouds /do/ mean that your virtual machine can be migrated (with
zero, or almost zero, downtime) to another physical server if there are >hardware problems or during hardware maintenance. And if you can do
easy snapshots with your cloud / VM infrastructure, then you can roll
back if things go badly wrong. So you have a single server instance,
you plan a short period of downtime, take a snapshot, stop the service, >upgrade, restart. That's what almost everyone does, other than the
/really/ big or /really/ critical service providers.

For various definitions of "short period of downtime". 8-)

Fortunately, Linux installs updates - or stages updates for restart -
much faster than Windoze. But rebooting to the point that all the
services are running still can take several minutes.

That can feel like an eternity when it's the only <whatever> server in
a small business.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Oct 17 02:35:07 2024

According to David Brown <david.brown@hesbynett.no>:

Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.

Clouds do nothing to help any of that.

AWS provides a database service that does most of that. You can spin
up databases, read-only mirrors, failover from one region to another,
staging environments to test upgrades. They offer MySQL and
PostgreSQL, as well as Oracle and DB2.

It's still a fair amount of work, but way less than doing it all yourself.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Wed Oct 16 23:06:24 2024

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of >>>>> a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>>entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and >>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space. Because that
space can be treated as a single array of char, and comparing pointers
to elements of the same array is legal, the only thing I can see that
prevents writing malloc() in standard C would be the need to somhow
define the array from the /language's/ POV (not the compiler's) prior
to using it.

Which circles back to why something like

char (*heap)[ULONG_MAX] = ... ;

would/does not satisfy the language's requirement. All the compilers
I have ever seen would have been happy with it, but none of them ever
needed something like it anyway. Conversion to <an integer type> also
would always work, but also was never needed.

I am not a language lawyer - I don't even pretend to understand the
arguments against allowing general pointer comparison.

Aside: I have worked on architectures (DSPs) having disjoint memory
spaces, spaces with differing bit widths, and even spaces where [sans
MMU] the same physical address had multiple logical addresses whose
use depended on the type of access.

I have written allocators and even a GC for such architectures. Never
had a problem convincing C compilers to compare pointers - the only
issue I ever faced was whether the result actually was meaningful to
the program.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to david.brown@hesbynett.no on Wed Oct 16 23:32:41 2024

On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
<david.brown@hesbynett.no> wrote:

It's a very good philosophy in programming language design that the core >language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and >otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need >non-standard C - the standard library is part of the implementation.

But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
compiler flags to be using a different compiler.]

Why? Because once these things are discovered, many programmers will
see their advantages and lack the discipline to avoid using them for
more general application work.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was >standardised, and AFAIK it was in K&R C. But no one (in authority) ever >claimed it could be implemented purely in standard C. What do you think
has changed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Thomas Koenig on Thu Oct 17 00:40:34 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Right. And that is why malloc(), or some essential internal component
of malloc(), has to be platform specific, and thus malloc() must be
supplied by the implementation (which means both the compiler and the
standard library).

But more problematic is the implementation of free() without knowing
how to compare pointers.

Once there is a way to get additional memory from whatever underlying environment is there, malloc() and free() can be implemented (and I
believe most often are implemented) without needing to compare
pointers. Note: pointers can be tested for equality without having
to compare them relationally, and testing pointers for equality is
well-defined between any two pointers (which may need to be converted
to 'void *' to avoid a type mismatch).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 01:18:04 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C. I am not
asking if it is still in the std libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?

You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).

Agreed, but once you HAVE a way of getting memory (by whatever name)
you can write malloc in standard C.

The point is that getting more memory is inherently platform
specific, which is why malloc() must be defined by each particular implementation, and so was put in the standard library.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 02:48:49 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any
existing functionality that cannot be written using the language
is a sign of a weakness because it shows that despite being
"general purpose" it fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT need to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc`
and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be standard K&R C--what dropped if from the
standard??

It still is part of the ISO C standard.

The paragraph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.

No, it didn't. In the original book (my copy is from the third
printing of the first edition, copyright 1978), on page 175 there
is a function 'alloc()' that shows how to write a memory allocator.
The code in alloc() calls 'morecore()', described as follows:

The function morecore obtains storage from the operating system.
The details of how this is done of course vary from system to
system. In UNIX, the system entry sbrk() returns a pointer to n
more bytes of storage. [...]

An implementation of morecore() is shown on the next page, and
it indeed uses sbrk() to get more memory. That makes it UNIX
specific, not portable standard C. Both alloc() and morecore()
are part of chapter 8, "The UNIX System Interface".

Note also that chapter 7, titled "Input and Output" and describing
the standard library, mentions in section 7.9, "Some Miscellaneous
Functions", the function calloc() as part of the standard library.
(There is no mention of malloc().) The point of having a standard
library is that the functions it contains depend on details of the
underlying OS and thus cannot be written in platform-agnostic code.
Being platform portable is the defining property of "standard C".

(Amusing aside: the entire standard library seems to be covered by
just #include <stdio.h>.)

I am not
asking if it is still in the standard libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?

Functions such as sbrk() are not part of the C language. Whether
it's called calloc() or malloc(), memory allocation has always
needed access to some facilities not provided by the C language
itself. The function malloc() is not any more writable in standard
K&R C than it is in standard ISO C (except of course malloc() can
be implemented by using calloc() internally, but that depends on
calloc() being part of the standard library).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Thu Oct 17 03:16:13 2024

George Neuner <gneuner2@comcast.net> writes:

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

[...]

malloc() used to be standard K&R C--what dropped it from the
standard ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space.

Not necessarily.

Because that space can be treated as a single array of char,

Not always.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 03:17:33 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".

That is a foolish statement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to John Levine on Thu Oct 17 14:41:06 2024

On 17/10/2024 04:35, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder
still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to
duplicate the whole thing so you can do test runs. And if the database
server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.

Clouds do nothing to help any of that.

AWS provides a database service that does most of that. You can spin
up databases, read-only mirrors, failover from one region to another,
staging environments to test upgrades. They offer MySQL and
PostgreSQL, as well as Oracle and DB2.

It's still a fair amount of work, but way less than doing it all yourself.

That's an additional service they provide - it's not an inherent part of
a cloud infrastructure. Still, it sounds like a useful service, and one
that I might find useful in the future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to George Neuner on Thu Oct 17 14:39:45 2024

On 17/10/2024 03:19, George Neuner wrote:

On Wed, 16 Oct 2024 09:17:03 +0200, David Brown
<david.brown@hesbynett.no> wrote:

On 16/10/2024 07:36, Terje Mathisen wrote:

George Neuner wrote:

On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:

On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree

The idea is impractical, not pointless. A hybrid kernel gives most of >>>>> the
advantages of a microkernel to its developers, and avoids the need for >>>>> lots of context switches. It doesn't let you easily replace low-level OS >>>>> components, but not many people actually want that.

Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.

YMMV.

This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular
server/service, then you replace them in groups so that the service sees >>> zero downtime even though all the servers have been updated/replaced.

That's fine - /if/ you have a service that can easily be spread across
multiple systems, and you can justify the cost of that. Setting up a
database server is simple enough.

Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder
still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to
duplicate the whole thing so you can do test runs. And if the database
server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.

Clouds do nothing to help any of that.

But clouds /do/ mean that your virtual machine can be migrated (with
zero, or almost zero, downtime) to another physical server if there are
hardware problems or during hardware maintenance. And if you can do
easy snapshots with your cloud / VM infrastructure, then you can roll
back if things go badly wrong. So you have a single server instance,
you plan a short period of downtime, take a snapshot, stop the service,
upgrade, restart. That's what almost everyone does, other than the
/really/ big or /really/ critical service providers.

For various definitions of "short period of downtime". 8-)

Yes, indeed.

Fortunately, Linux installs updates - or stages updates for restart -
much faster than Windoze. But rebooting to the point that all the
services are running still can take several minutes.

My experience is that the updates on Linux servers are usually fast (for desktops they can be slow, but that is usually because you have far more
and bigger programs). Updates for virtual machines are particularly
fast because you generally have a minimum of programs in the VM.
Restarts are also fast for virtual machines - physical servers are often
slow to restart, sometimes taking many minutes before they get to the
point of starting the OS boot.

That can feel like an eternity when it's the only <whatever> server in
a small business.

Sure. But for most small businesses, it's not hard to find off-peak
times when you can have hours of downtime without causing a problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to George Neuner on Thu Oct 17 16:16:42 2024

On 17/10/2024 05:06, George Neuner wrote:

On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.

It goes a bit further: for a general purpose language, any existing >>>>>> functionality that cannot be written using the language is a sign of >>>>>> a weakness because it shows that despite being "general purpose" it >>>>>> fails to cover this specific "purpose".

One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.

In an ideal world, it would be better if we could define `malloc` and >>>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

It still is part of the ISO C standard.

The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.

K&R may have been 'de facto' standard C, but not 'de jure'.

Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.

Those are both kernel system calls.

Yes, but malloc() subdivides an already provided space. Because that
space can be treated as a single array of char, and comparing pointers
to elements of the same array is legal, the only thing I can see that prevents writing malloc() in standard C would be the need to somhow
define the array from the /language's/ POV (not the compiler's) prior
to using it.

It is common for malloc() implementations to ask the OS for large chunks
of memory, then subdivide it and pass it out to the application. When
the chunk(s) it has run out, it will ask for more from the OS. You
could reasonably argue that each chunk it gets may be considered a
single unsigned char array, but that is certainly not true for
additional chunks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to George Neuner on Thu Oct 17 16:25:01 2024

On 17/10/2024 05:32, George Neuner wrote:

On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
<david.brown@hesbynett.no> wrote:

It's a very good philosophy in programming language design that the core
language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and
otherwise change things in libraries than the core language.

And it is also fine, IMHO, that some things in the standard library need
non-standard C - the standard library is part of the implementation.

But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
compiler flags to be using a different compiler.]

Specifying different flags would technically give you a different /implementation/, but it would not normally be considered a different /compiler/. I see no problem at all if libraries (standard library or otherwise) are compiled with different flags. I can absolutely
guarantee that the flags I use for compiling my application code are not
the same as those used for compiling the static libraries that came with
my toolchains. Using different /compilers/ could be a significant inconvenience, and might mean you lose additional features (such as
link-time optimisation), but as long as the ABI is consistent then they
should work fine.

Why? Because once these things are discovered, many programmers will
see their advantages and lack the discipline to avoid using them for
more general application work.

Really? Have you ever looked at the source code for a library such as
glibc or newlib? Most developers would look at that and quickly shy
away from all the macros, additional compiler-specific attributes,
conditional compilation, and the rest of it. Very, very few would look
into the details to see if there are any "tricks" or "secret" compiler extensions they can copy. And with very few exceptions, all the compiler-specific features will already be documented and available to programmers enthusiastic enough to RTFM.

In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.

malloc() used to be std. K&R C--what dropped if from the std ??

The function has always been available in C since the language was
standardised, and AFAIK it was in K&R C. But no one (in authority) ever
claimed it could be implemented purely in standard C. What do you think
has changed?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 18 05:56:03 2024

On Tue, 15 Oct 2024 11:16:55 +0300, Michael S wrote:

AFAIK, .net is hugely successful application development technology that
was never incorporated into lower layers of the OS.

Look up the infamous “Longhorn reset”. Microsoft had to chuck away a bunch of low-performance, high-overhead code and try again, and Dotnet was the reason. This delayed Windows Vista by about a year and a half, and it was
still a rush to get it out.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 18 05:57:14 2024

On Tue, 15 Oct 2024 18:40 +0100 (BST), John Dallman wrote:

Windows NT and Apple's XNU, used in all their operating systems, are
both hybrid kernels, so the idea is somewhat practical.

The fact that both are regularly outperformed (and outfeatured) by Linux,
on hardware that is supposedly optimized for those specific proprietary
OSes, just reinforces my point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 18 07:01:08 2024

On Tue, 15 Oct 2024 11:59:27 +0300, Michael S wrote:

The question is how slowness of 80286 segments compares to
contemporaries that used segment-based protected memory.
Wikipedia lists following machines as examples of segmentation:
- Burroughs B5000 and following Burroughs Large Systems
- GE 645 -> Honeywell 6080
- Prime 400 and successors
- IBM System/38

Certainly the first two had “segmentation” that was nothing like the Intel (mis)interpretation of the concept.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Oct 18 12:47:53 2024

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Fri Oct 18 06:00:54 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

[...]

My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.

No, he isn't.

One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.

You have misunderstood the meaning of "standard C", which means
code that does not rely on any implementation-specific behavior.
"All one needs is an implementation that ..." already invalidates
the requirement that the code not rely on implementation-specific
behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Fri Oct 18 05:39:02 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[ISA support for copying possibly overlapping regions of memory]

[Separately, what is possible to do in portable standard C]

[...] I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.

I would summarize the string of conversations as follows.

It started with talking about what is or is not possible in
"standard C", by which is meant C that does not rely on any implementation-specific behavior. (Topic A.)

The discussion shifted after a comment about how to provide
architectual support for copying one region of memory to
another, where the areas of memory might overlap. (Topic B.)

After the introduction of Topic B, most of the subsequent
conversation either ignored Topic A or conflated the two
topics.

The key point is that Topic B has nothing to do with Topic A,
and vice versa. It's like asking why it's colder in the
mountains than it is in the summer: both parts have something
to do with temperature, but in spite of that there is no
meaningful relationship between them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Fri Oct 18 14:06:17 2024

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.

OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

Unisys discontinued that line of systems in 1992.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Oct 18 17:34:16 2024

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models
suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to David Brown on Fri Oct 18 17:38:55 2024

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific
time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Fri Oct 18 16:19:08 2024

Michael S <already5chosen@yahoo.com> writes:

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models

No. The systems I described above are from the medium
systems family (B2000/B3000/B4000). The B5000/B6000/B7000
(large) family systems were a completely different stack based
architecture with a 48-bit word size. The Small systems (B1000)
supported task-specific dynamic microcode loading (different
microcode for a cobol app vs. a fortran app).

Medium systems evolved from the Electrodata Datatron and 220 (1954) through
the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
was also developed at the old Electrodata plant in Pasadena
(where I worked in the 80s) - eventually large systems moved
out - the more capable large systems (B7XXX) were designed in Tredyffrin
Pa, the less capable large systems (B5XXX) were designed in Mission Viejo, Ca.

suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

Large systems still exist today in emulation[*], as do the
former Univac (Sperry 2200) systems. The last medium system
(V380) was retired by the City of Santa Ana in 2010 (almost two
decades after Unisys cancelled the product line) and was moved
to the Living Computer Museum.

City of Santa Ana replaced the single 1980 vintage V380 with
29 windows servers.

After the merger of Burroughs and Sperry in '86 there were six
different mainframe architectures - by 1990, all but
two (2200 and large systems) had been terminated.

[*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Vir Campestris on Fri Oct 18 21:45:37 2024

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in
non-standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a
particular implementation (or set of implementations). It is normal to
write this kind of thing in C, but it is non-portable C. (Or at least,
not fully portable C.)

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C is
the comparison of the pointers so you know which direction to do the
copying.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Sat Oct 19 19:46:41 2024

On Fri, 18 Oct 2024 16:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

On 13/10/2024 17:45, Anton Ertl wrote:

I do think it would be convenient if there were a fully
standard way to compare independent pointers (other than
just for equality). Rarely needing something does not mean
/never/ needing it.

OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??

Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.

Pointers were 32-bits (actually 8 BCD digits)

S s OOOOOO

Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).

A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.

Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:

EEEEEEMM SsOOOOOO

Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.

Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.

What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?

It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.

In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.

In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.

Binaries compiled in 1966 ran on all
generations without recompilation.

There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).

So, can it be said that ar least some of B6500-compatible models

No. The systems I described above are from the medium
systems family (B2000/B3000/B4000).

I didn't realize that you were not talking about Large Systems.
I didn't even know that Medium Systems used segmented memory.
Sorry.

The B5000/B6000/B7000
(large) family systems were a completely different stack based
architecture with a 48-bit word size. The Small systems (B1000)
supported task-specific dynamic microcode loading (different
microcode for a cobol app vs. a fortran app).

Medium systems evolved from the Electrodata Datatron and 220 (1954)
through the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
was also developed at the old Electrodata plant in Pasadena
(where I worked in the 80s) - eventually large systems moved
out - the more capable large systems (B7XXX) were designed in
Tredyffrin Pa, the less capable large systems (B5XXX) were designed
in Mission Viejo, Ca.

suffered from the same problem as 80286 - the segment of maximal size >didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits
in the single segment?

Unisys discontinued that line of systems in 1992.

I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.

Large systems still exist today in emulation[*], as do the
former Univac (Sperry 2200) systems. The last medium system
(V380) was retired by the City of Santa Ana in 2010 (almost two
decades after Unisys cancelled the product line) and was moved
to the Living Computer Museum.

City of Santa Ana replaced the single 1980 vintage V380 with
29 windows servers.

After the merger of Burroughs and Sperry in '86 there were six
different mainframe architectures - by 1990, all but
two (2200 and large systems) had been terminated.

[*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to David Brown on Sun Oct 20 21:51:30 2024

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler
extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a particular implementation (or set of implementations). It is normal to write this kind of thing in C, but it is non-portable C. (Or at least,
not fully portable C.)

Ah, I see your point. Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C is
the comparison of the pointers so you know which direction to do the
copying.

It's a long time since I had to mistrust a compiler so much that I was
pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Vir Campestris on Mon Oct 21 08:58:05 2024

On 20/10/2024 22:51, Vir Campestris wrote:

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.

The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions. In such
cases, you are not interested in writing fully portable software -
it will already contain many implementation-specific features or use
compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C. You are relying
on implementation details, or writing code that is only suitable for a
particular implementation (or set of implementations). It is normal
to write this kind of thing in C, but it is non-portable C. (Or at
least, not fully portable C.)

Ah, I see your point. Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

Yes.

I think /every/ implementation will require communication with the OS,
if there is an OS - otherwise it will need support from other parts of
the toolchain (such as symbols created in a linker script to define the
heap area - that's the typical implementation in small embedded systems).

The nearest you could get to a portable implementation would be using a
local unsigned char array as the heap, but I don't believe that would be
fully correct according to the effective type rules (or the "strict
aliasing" or type-based aliasing rules, if you prefer those terms). It
would also not be good enough for the needs of many programs.

Of course, a fair amount of the code for malloc/free can written in
fully portable C - and almost all of it can be written in a somewhat
vaguely defined "widely portable C" where you can mask pointer bits to
handle alignment, and other such conveniences.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C.

It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C
is the comparison of the pointers so you know which direction to do
the copying.

It's a long time since I had to mistrust a compiler so much that I was pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Looking at the generated assembly is usually not a matter of mistrusting
the compiler. One of the reasons I do so is to check that the compiler
can generate efficient object code from my source code, in cases where I
need maximal efficiency. I'd rather not write assembly unless I really
have to!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Mon Oct 21 09:21:42 2024

David Brown wrote:

On 20/10/2024 22:51, Vir Campestris wrote:

On 18/10/2024 20:45, David Brown wrote:

On 18/10/2024 18:38, Vir Campestris wrote:

On 16/10/2024 08:21, David Brown wrote:

I don't see an advantage in being able to implement them in
standard C. I /do/ see an advantage in being able to do so well in
non- standard, implementation-specific C.

The reason why you might want your own special memmove, or your own >>>>> special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions.Â In such
cases, you are not interested in writing fully portable software -
it will already contain many implementation-specific features or
use compiler extensions.

I have a vague feeling that once upon a time I wrote a malloc for an
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.

Sure - but you are not writing portable standard C.Â You are relying >>> on implementation details, or writing code that is only suitable for
a particular implementation (or set of implementations).Â It is
normal to write this kind of thing in C, but it is non-portable C.Â
(Or at least, not fully portable C.)

Ah, I see your point. Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

Yes.

I think /every/ implementation will require communication with the OS,
if there is an OS - otherwise it will need support from other parts of
the toolchain (such as symbols created in a linker script to define the
heap area - that's the typical implementation in small embedded systems).

The nearest you could get to a portable implementation would be using a local unsigned char array as the heap, but I don't believe that would be fully correct according to the effective type rules (or the "strict aliasing" or type-based aliasing rules, if you prefer those terms). It would also not be good enough for the needs of many programs.

Of course, a fair amount of the code for malloc/free can written in
fully portable C - and almost all of it can be written in a somewhat
vaguely defined "widely portable C" where you can mask pointer bits to
handle alignment, and other such conveniences.

But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.

_That_ does require assembler, or compiler extensions, not standard C. >>>>

It would normally be written in C, and the compiler will generate the
"rep" assembly.Â The bit you can't write in fully portable standard
C is the comparison of the pointers so you know which direction to do
the copying.

It's a long time since I had to mistrust a compiler so much that I was
pulling the assembler apart. It sounds as though they have got smarter
in the meantime.

I just checked BTW, and you are correct.

Looking at the generated assembly is usually not a matter of mistrusting
the compiler. One of the reasons I do so is to check that the compiler
can generate efficient object code from my source code, in cases where I need maximal efficiency. I'd rather not write assembly unless I really have to!

For near-light-speed code I used to write it first in C, optimize that,
then I would translate it into (inline) asm and re-optimize based on
having the full cpu architecture available, before in the final stage I
would use the asm experience to tweak the C just enough to let the
compiler generate machine code quite close (90+%) to my best asm, while
still being portable to any cpu with more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C version
was still fast enough that a couple of years later I got a prize in the
mail: Someone in France had submitted my C code, with my name & address,
to a similar competition there and it was still faster than anyone else. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Oct 21 14:04:42 2024

I don't see an advantage in being able to implement them in standard C.

It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.

That makes no sense to me. We are talking about implementing standard library functions. If you want to implement other functions, go ahead.

No, I'm talking about a very general principle that applies to
languages, libraries, etc...

For example, in Emacs I always try [and don't always succeed] to make
sure that the default behavior for a given functionality can be
implemented using the official API entry points of the underlying
library, because it makes it more likely that whoever wants to replace
that behavior with something else will be able to do it without having
to break abstraction barriers.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Vir Campestris on Mon Oct 21 23:17:10 2024

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 21 23:52:59 2024

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 22 01:09:49 2024

On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require communication with the OS
there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.

Guess what the “OS” part of “POSIX” stands for.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Oct 21 18:32:27 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such I book I guarantee I will want to buy one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Oct 22 08:27:12 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such I book I guarantee I will want to buy one.

Thank you Tim!

Probably not a book but I would consider writing a series of blog posts
similar to that, now that I am about to retire: My wife and I will both
go on "permanent vacation" starting a week before Christmas. :-)

I already know that this will give me more time to work on digital
mapping projects (ref my https://mapant.no/ Norwegian topo map generated
from ~50 TB of LiDAR), but if there's an interest in optimization I
might do that as well.

BTW, I am also open to doing some consulting work, if the problems are interesting enough. :-)

Regards,
Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ldo@nz.invalid on Tue Oct 22 17:26:06 2024

On Tue, 22 Oct 2024 01:09:49 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require communication with the OS
there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

POSIX is an environment not an OS.

Guess what the “OS” part of “POSIX” stands for.

It's still an just environment - POSIX defines only an interface, not
an implementation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Oct 23 07:25:42 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such a book I guarantee I will want to buy one.

Thank you Tim!

I know from past experience you are good at this. I would love
to hear what you have to say.

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

P.S. Is the email address in your message a good way to reach you?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Tim Rentsch on Wed Oct 23 18:11:57 2024

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 23 18:27:06 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

And start working for "HER". (Honeydew list).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Wed Oct 23 21:12:57 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

And start working for "HER". (Honeydew list).

My wife do have a small list of things that we (i.e. I) could do when we retire...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Oct 23 21:11:59 2024

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

I'm still connected to Mill Computing as well.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Oct 23 21:01:01 2024

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

I'm still connected to Mill Computing as well.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Oct 23 21:09:47 2024

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

[C vs assembly]

For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.

One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.

My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)

I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.

If you do write such a book I guarantee I will want to buy one.

Thank you Tim!

I know from past experience you are good at this. I would love
to hear what you have to say.

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

I'm sure you're right!

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.

P.S. Is the email address in your message a good way to reach you?

Yes, that is my personal domain, so it won't be affected by my retirement.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Thu Oct 24 07:39:52 2024

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work".Â In any case I hope you both enjoy >>>> the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

I don't know that usage, I thought quires was a typesetting/printing
measure?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Tim Rentsch on Thu Oct 24 06:55:20 2024

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

One thing I have thought of is a wiki of optimization techniques that
contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Thu Oct 24 10:00:16 2024

On 24/10/2024 08:55, Anton Ertl wrote:

Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:

You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)

One thing I have thought of is a wiki of optimization techniques that contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

Would it make sense to start something under Wikibooks on Wikipedia? I
have no experience with it myself, but it looks to me like a way to have
a collaborative collection of related knowledge. It could provide the structure and framework, saving you (plural) from having to set up a
wiki, blog, or whatever.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Thu Oct 24 16:34:45 2024

David Brown <david.brown@hesbynett.no> writes:

On 24/10/2024 08:55, Anton Ertl wrote:

One thing I have thought of is a wiki of optimization techniques that
contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.

Would it make sense to start something under Wikibooks on Wikipedia?

Yes, I was thinking about that. In the bookshelf on computer
programming <https://en.wikibooks.org/wiki/Shelf:Computer_programming>
there are two "Books nearing completion" that have "Opti" in the
title:

https://en.wikibooks.org/wiki/Optimizing_Code_for_Speed https://en.wikibooks.org/wiki/Optimizing_C%2B%2B

Looking at the contents of the former, it's rather short and
high-level, and I don't think it's intended for the kind of project we
have in mind.

The latter is more in the direction I have in mind, but the limitation
to C++ is, well, limiting.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Oct 24 18:32:22 2024

On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week >>>>>> before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work".Â In any case I hope you both enjoy >>>>> the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

I don't know that usage, I thought quires was a typesetting/printing
measure?

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Lawrence D'Oliveiro on Sun Oct 27 20:42:09 2024

On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

"cannot be a _truly_ portable" is what I meant. Portable to most machine
is easy - just write for Windows. POSIX will give you a larger subset -
but still a subset.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Terje Mathisen on Sun Oct 27 20:45:09 2024

On 23/10/2024 20:12, Terje Mathisen wrote:

My wife do have a small list of things that we (i.e. I) could do when we retire...

Since I retired the garden is looking much better, I've started to win
the odd trophy sailing, most of the house has been redecorated...

But best of all - I've lost 5kG and been able to stop worrying about my
weight!

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Vir Campestris on Sun Oct 27 21:04:49 2024

On Sun, 27 Oct 2024 20:42:09 +0000, Vir Campestris wrote:

I'm pretty sure you don't get POSIX in your 64kb (max).

<https://news.ycombinator.com/item?id=34981059>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Schultz@21:1/5 to Vir Campestris on Sun Oct 27 17:55:52 2024

On 10/27/24 3:42 PM, Vir Campestris wrote:

On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:

On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

Because some implementations will require
communication with the OS there cannot be a truly portable malloc.

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

Ignores the 16 bit versions of CP/M: 8086, 68000, Z8000.

--
http://davesrocketworks.com
David Schultz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Oct 27 23:58:05 2024

According to Vir Campestris <vir.campestris@invalid.invalid>:

There can if you have a portable OS API. The only serious candidate for
that is POSIX.

One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

Mini-Unix got nearly all of v6 Unix in 56K bytes.

See https://gunkies.org/wiki/MINI-UNIX

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Oct 28 11:39:57 2024

MitchAlsup1 wrote:

On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

MitchAlsup1 wrote:

On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

My wife and I will both go on "permanent vacation" starting a week >>>>>>> before Christmas. :-)

I'm guessing that permanent vacation will be some mixture of actual >>>>>> vacation and self-chosen "work".Ã‚Â In any case I hope you both >>>>>> enjoy
the time.

Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".

Exactly!

I have unlimited amounts of potential/available mapping work, and I do >>>> want to get back to NTP Hackers.

We recently started (officially) on the 754-2029 revision.

Are you going to put in something equivalent to quires ??

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

OK, I have seen and used "Super-accumulator" as the term for those, I
have thought about implementing one in carry-save redundant form, but
that might be more redundancy than really needed?

Having a carry bit for every byte should still make it possible to
handle several additions/cycle, right?

I'm assuming the real cost is in the alignment network needed to route incoming addends into the right slice.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Oct 28 16:30:46 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Mon Oct 28 10:12:08 2024

On 10/28/2024 9:30 AM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

Another newer alternative. This came up on my news feed. I haven't
looked at the details at all, so I can't comment on it.

https://arxiv.org/abs/2410.03692

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Mon Oct 28 18:14:20 2024

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 10/28/2024 9:30 AM, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

Another newer alternative. This came up on my news feed. I haven't
looked at the details at all, so I can't comment on it.

https://arxiv.org/abs/2410.03692

That is about another number representation for AI, trying to squeeze
more AI performance out of few bits.

Personally, I like the approach of doing analog calculation for
the low-accuracy dot products that they do, followed by an A/D
converter. There is a company doing that, but I forget its name.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Mon Oct 28 15:24:18 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load
the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Of course, once you have 168-byte registers people are going to
think of new uses for them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Tue Oct 29 06:33:50 2024

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

Of course, once you have 168-byte registers people are going to
think of new uses for them.

SIMD from hell? Pretend that a CPU is a graphics card? :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Oct 29 08:07:50 2024

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

If I was implementing this I would probably want some redundant storage
to limit carry propagation, so maybe 48 bits per 64-bit chunk, in which
case I would need about 2800 bits or 6 of those 512-bit SIMD regs.

SIMD from hell? Pretend that a CPU is a graphics card? :-)

Writing this as a throughput task could make it fit better within a GPU?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Oct 29 14:19:13 2024

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

IIUC you can already implement such a thing with standard IEEE
operations, based on the "standard" Knuth approach of computing the
exact result of `a + b` as the sum of x + y where x is the "normal" sum
of a + b (and hence y holds the remaining bits lost to rounding).

I wonder how often this is used in practice.

Intuitively it should be possible to make it reasonably efficient, where
you first compute the "naive" sum but also keep the N remaining numbers representing the bits lost to each of the N roundings. I.e. you take in
a vector "as" of N numbers and return a pair of the "naive" sum plus
a vector of N rounding errors.

Σ as => (round(Σ As), rs)
such that round(Σ As) = the naive IEEE sum of as
and Σ as = round(Σ As) + Σ rs

You can then recursively compute "Σ rs" in the same way. At each step of
the recursion you can compute round(Σ |rs|) to estimate an upper bound
on the remaining error and thus stop when that error is smaller than
1 ULP or somesuch.

AFAICT, if your sum is well-conditioned you should need at most 2 steps
of the recursion, and I suspect you can predict when the next estimated
error will be too small before you start the last recursion, so the last recursion might skip the generation of the last "rs" vector.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Tue Oct 29 14:29:28 2024

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

Right, something like 2048+52+3 = 2103 bits for data, plus some status bits. For x64 they could overlay it onto AVX-512 register file in groups of 5
and use existing SIMD instructions for management.
That would allow them to pack 3 accumulators into registers z0..z14.

For RISC-V they have the large vector registers, 32 * 256-bits each I think,
so again 3 accumulators.

So its a plausible proposition.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Oct 29 19:57:25 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>

These would be very large registers. You'd need some way to store and load >>> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:

"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Oct 29 20:30:12 2024

Thomas Koenig <tkoenig@netcologne.de> writes:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>

These would be very large registers. You'd need some way to store and load >>>> the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

IBM format had one sign bit, seven exponent bits and six or fourteen >hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).

Burroughs Medium systems had four exponent sign bits, eight exponent bits,
four mantissa sign bits, and up to 400 mantissa bits. BCD, so that's an exponent range of -99 to +99 and a 1 to 100 digit mantissa.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Oct 29 20:21:11 2024

On Tue, 29 Oct 2024 19:57:25 +0000, Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.

Not restricted to posits, I believe (but the term may differ).

At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>

These would be very large registers. You'd need some way to store and
load
the these for register spills, fills and task switch, as well as move
and manage them.

Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

The operands are specified by virtual address of their in-memory
accumulator.

Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.

At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.

But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.

With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.

Terje--IEEE is all capitals.

IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.

The span of an IEEE double "quire" would be the exponent-2 + fraction.
a) The most significant non-infinity has an exponent of +1023
b) The least significant non-underflow has an exponent of -1023
Leaving a span of 2046 bits plus 52 denormalized bits or 2098-bits
or 262 bytes.

One note: When left in memory, one indexes the accumulator with
the (exponent>>6) and fetches 2 doublewords. A carry out requires
accessing the 3rd doubleword (possibly transitively).

(Insert fear and loathing for hex float here).

Heck, watching Kahan's notes on FP problems leaves one in fear of
binary floating point representations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Oct 29 21:27:29 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

(Insert fear and loathing for hex float here).

Heck, watching Kahan's notes on FP problems leaves one in fear of
binary floating point representations.

True, but... hex float is so much worse.

"Hacker's delight" has some choice words there, and the
author worked for IBM :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jan 3 03:37:50 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 19:30, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:

Compare this with the pain the x86 world went through, over a much longer >>>>> time, to move to 32-bit.

The x86 started from 8-bit roots, and increased width over time, which >>>> is a very different path.

Still, the question is why they did the 286 (released 1982) with its
protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).

I can only guess the obvious - it is what some big customer(s) were
asking for. Maybe Intel didn't see the need for 32-bit computing in the >>markets they were targeting, or at least didn't see it as worth the cost.

Anyone could see the problems that the PDP-11 had with its 16-bit
limitation. Intel saw it in the iAPX 432 starting in 1975. It is
obvious that, as soon as memory grows beyond 64KB (and already the
8086 catered for that), the protected mode of the 80286 would be more
of a hindrance than even the real mode of the 8086. I find it hard to believe that many customers would ask Intel for something the 80286
protected mode with segments limited to 64KB, and even if, that Intel
would listen to them. This looks much more like an idee fixe to me
that one or more of the 286 project leaders had, and all customer
input was made to fit into this idea, or was ignored.

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

However, playing devil's advocate I can see sense in 286. IMO
Intel targeted quite a diffferent market. IIUC main intended marker
for 8086 were industial control and various embedded aplication.
286 was probably intenended for similar markets, but with stronger
emphasis on security. In control application it is typical to
have several cooperating processes. 286 allows separate local
descriptor tables for each task, so mutitasking program easily
may have say 30000 descriptors. Trying to get similar number
of separately protected objects using paging would require
similar number of pages, which with 16 MB total address space
leads to 512 byte pages. For smaller paged systems situation
is even worse: with 512 kB of memory 512 byte pages lead to
1024 pages in total which means that access control can not
be very granular and one would get significant memory
fragmentation for parts smaller than page. I can guess that
Intel rejected very small pages as problematic in implementation.
So if the goal is fine grained access control, then segementation
for machine of size of 286 looks better than paging.

Concerning code "model", I think that Intel assumend that
most procedures would fit in a single segment and that
average procedure will be of order of single kilobytes.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine. For
control applications it is likely that each procedure
will access moderate number of segments and total amount
of accessed data will be moderate. In other words, Intel
probably considerd "mostly medium" model where procedure
mainly accesses it data segment using just 16-bit offsets
and occasionally accesses other segments.

Compared to PDP-11 this leads to resonably natural
code that use some hundreds of kilobytes of memory,
much better than 128 kB limit of PDP-11 with separate
code and data areas. And segment maniputlation allows
also bigger programs.

What went wrong? IIUC there were several control systems
using 286 features, so there was some success. But PC-s
became main user of x86 chips and significant fraction
of PC-s was used for gaming. Game authors wanted direct
access to hardware which in case of 286 forced real mode.
Also, for long time 8088 played mayor role and PC software
"had" to run on 8088. Software vendors theoretically could
release separate versions for each processor or do some
runtime switching of critical procedures, but easiest way
was to depend on compatibility with 8088. "Better" OS-es
went Unix way, depending on paging and not using segmentation.
But IIUC first paging Unix appeared _after_ release of 286.
In 286 time Multics was highly regarded and it heavily depended
on segmentaion. MVS was using paging hardware, but was
talking about segments, except for that MVS segmentation
was flawed because some addresses far outside a segment were
considered as part of different segment. I think that also
in VMS there was some taliking about segments. So creators
of 286 could believe that they are providing "right thing"
and not a fake possible with paging hardware.

Concerning the cost, ther 80286 has 134,000 transistors, compared to supposedly 68,000 for the 68000, and the 190,000 of the 68020. I am
sure that Intel could have managed a 32-bit 8086 (maybe even with the
nice addressing modes that the 386 has in 32-bit mode) with those
134,000 transistors if Motorola could build the 68000 with half of
that.

I think that Intel could manage to build "mostly" 32-bit processor
in transistor budget of 8086, that is have say 7 registers 32-bit
each, where each register could be treated as a pair of 16-bit
registers and 32-bit operations would take twice as much time
as 16-bit operation. But I think that such processor would be
slower (say 10% slower) than 8086 mostly because of more need to
use longer addresses. Similarly, hypotetical 32-bit 286 would
be slower than real 286. And I do not think thay could make
32-bit processor with segmentation in available transistor
buget, and even it they managed it would be slowed down by too
long addresses (segment + 32-bit offset).

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Jan 3 08:38:49 2025

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).

Concerning code "model", I think that Intel assumend that
most procedures would fit in a single segment and that
average procedure will be of order of single kilobytes.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.

With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets. That would be faster than
what you outline, as soon as one call happens. But apparently 16-bit
branches are not that beneficial, or they would have gone that way on
the 386.

Another usage of segments for code would be to put the code segment of
a shared object (known as DLL among Windowsheads) in a segment, and
use far calls to call functions in other shared objects, while using
near calls within a shared object. This allows to share the code
segments between different programs, and to locate them anywhere in
physical memory. However, AFAIK shared objects were not a thing in
the 80286 timeframe; Unix only got them in the late 1980s.

I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.

What went wrong? IIUC there were several control systems
using 286 features, so there was some success. But PC-s
became main user of x86 chips and significant fraction
of PC-s was used for gaming. Game authors wanted direct
access to hardware which in case of 286 forced real mode.

Every successful software used direct access to hardware because of performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.

But IIUC first paging Unix appeared _after_ release of 286.

From <https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

|The kernel of 32V was largely rewritten by Berkeley graduate student
|Özalp Babaoğlu to include a virtual memory implementation, and a
|complete operating system including the new kernel, ports of the 2BSD |utilities to the VAX, and the utilities from 32V was released as 3BSD
|at the end of 1979.

The 80286 was introduced on February 1, 1982.

In 286 time Multics was highly regarded and it heavily depended
on segmentaion. MVS was using paging hardware, but was
talking about segments, except for that MVS segmentation
was flawed because some addresses far outside a segment were
considered as part of different segment. I think that also
in VMS there was some taliking about segments. So creators
of 286 could believe that they are providing "right thing"
and not a fake possible with paging hardware.

There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.

So if they really had wanted protected mode to succeed, they should
have designed in 32-bit data segments (and maybe also 32-bit code
segments). Alternatively, if protected mode and the 32-bit addresses
do not fit in the 286 transistor budget, a CPU that implements the
32-bit feature and leaves away protected mode would have been more
popular than the 80286; and (depending on how the 32-bit extension was implemented) it might have been a better stepping stone towards the
kind of CPU with protected mode that they imagined; but the alt-386
designers probably would not have designed in this kind of protected
mode that they did.

Concerning paging, all these scenarios are without paging. Paging was primarily a virtual-memory feature, not a memory-protection feature.
It acquired memory protection only as far as it was easy with pages
(i.e., at page granularity). So paging was not designed as a
competition to segments as far as protection was concerned. If
computer architects had managed to design segmentation with
competetive performance, we would be seeing hardware with both paging
and segmentation nowadays. Or maybe even without paging, now that
memories tend to be big enough to make virtual memory mostly
unnecessary.

And I do not think thay could make
32-bit processor with segmentation in available transistor
buget,

Maybe not.

and even it they managed it would be slowed down by too
long addresses (segment + 32-bit offset).

On the contrary, every program that does not fit in the medium memory
model on the 80286 would run at least as fast on such a CPU in real
mode and significantly faster in protected mode.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 3 18:11:53 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

But IIUC first paging Unix appeared _after_ release of 286.

From ><https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

|The kernel of 32V was largely rewritten by Berkeley graduate student
|Özalp Babaoğlu to include a virtual memory implementation, and a
|complete operating system including the new kernel, ports of the 2BSD >|utilities to the VAX, and the utilities from 32V was released as 3BSD
|at the end of 1979.

There was also a version of Western Electric unix that ran on the VAX in that time frame.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Sat Jan 4 22:40:51 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).

In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem. But there is a lot of loading segment registers
and slow loading is a problem.

Concerning code "model", I think that Intel assumend that
most procedures would fit in a single segment and that
average procedure will be of order of single kilobytes.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.

With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.

At that time Intel apparently wanted to avoid having too many
instructions.

That would be faster than
what you outline, as soon as one call happens. But apparently 16-bit branches are not that beneficial, or they would have gone that way on
the 386.

For machine with 32-bit bus benefit is much smaller.

Another usage of segments for code would be to put the code segment of
a shared object (known as DLL among Windowsheads) in a segment, and
use far calls to call functions in other shared objects, while using
near calls within a shared object. This allows to share the code
segments between different programs, and to locate them anywhere in
physical memory. However, AFAIK shared objects were not a thing in
the 80286 timeframe; Unix only got them in the late 1980s.

IIUC shared segments were widely used on Multics.

I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.

Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.

What went wrong? IIUC there were several control systems
using 286 features, so there was some success. But PC-s
became main user of x86 chips and significant fraction
of PC-s was used for gaming. Game authors wanted direct
access to hardware which in case of 286 forced real mode.

Every successful software used direct access to hardware because of performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.

For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.

More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so. And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).

But IIUC first paging Unix appeared _after_ release of 286.

From <https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

|The kernel of 32V was largely rewritten by Berkeley graduate student |Ã?zalp BabaoÄ?lu to include a virtual memory implementation, and a |complete operating system including the new kernel, ports of the 2BSD |utilities to the VAX, and the utilities from 32V was released as 3BSD
|at the end of 1979.

The 80286 was introduced on February 1, 1982.

OK

In 286 time Multics was highly regarded and it heavily depended
on segmentaion. MVS was using paging hardware, but was
talking about segments, except for that MVS segmentation
was flawed because some addresses far outside a segment were
considered as part of different segment. I think that also
in VMS there was some taliking about segments. So creators
of 286 could believe that they are providing "right thing"
and not a fake possible with paging hardware.

There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.

AFAICS that covered wast majority of programs during eighties.
Turbo Pascal offered only medium memory model and was quite
popular. Its code generator produced mediocre output, but
real Turbo Pascal programs used a lot of inline assembly
and performance was not bad.

Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.

So if they really had wanted protected mode to succeed, they should
have designed in 32-bit data segments (and maybe also 32-bit code
segments). Alternatively, if protected mode and the 32-bit addresses
do not fit in the 286 transistor budget, a CPU that implements the
32-bit feature and leaves away protected mode would have been more
popular than the 80286; and (depending on how the 32-bit extension was implemented) it might have been a better stepping stone towards the
kind of CPU with protected mode that they imagined; but the alt-386
designers probably would not have designed in this kind of protected
mode that they did.

Intel probably assumend that 286 would cover most needs, especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286. And for bigger systems they released 386.

Concerning paging, all these scenarios are without paging. Paging was primarily a virtual-memory feature, not a memory-protection feature.

Yes, exactly.

It acquired memory protection only as far as it was easy with pages
(i.e., at page granularity). So paging was not designed as a
competition to segments as far as protection was concerned. If
computer architects had managed to design segmentation with
competetive performance, we would be seeing hardware with both paging
and segmentation nowadays. Or maybe even without paging, now that
memories tend to be big enough to make virtual memory mostly
unnecessary.

And I do not think thay could make
32-bit processor with segmentation in available transistor
buget,

Maybe not.

and even it they managed it would be slowed down by too
long addresses (segment + 32-bit offset).

On the contrary, every program that does not fit in the medium memory
model on the 80286 would run at least as fast on such a CPU in real
mode and significantly faster in protected mode.

I think that Intel considerd "programs that do it in the medium
memory model" as tiny minority. IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version. But naive compilation
in large (or huge) model leads to worse speed than flat mode.

In a bit different spirit, for programs that do not fit in
64kB, but are not too large there is natural temptation to
have some "compression" scheme for pointers and use mostly
16-bit pointers. That can be done without special hardware
support. OTOH Intel segmentation is a specific proposal
in such direction with hardware support. Clearly it is
less flexible than software schemes based on native 32-bit
addressing. But I think that Intel segmentation had some
attractive features during eighties.

Another thing is 386. I think that designers of 286 thought
that 386 will remove some limitations. And 386 allowed
bigger segmensts removing one major limitation. OTOH
for 32-bit processor with segementation it would be natural
to have 32-bit segment registers. It is not clear to
me if 16-bit segment registers in 386 were deemed necessary
for backward compatibility or maybe in 386 period flat
fraction in Intel won and they kept segmentation mostly
for compatibility.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jan 5 02:56:08 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>: >antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).

Intel clearly had some strong opinions about how people would program
the 286, which turned out to bear almost no relation to the way we
actually wanted to program it.

Some of the stuff they did was just perverse, like putting flag
bits in the low part of the segment number rather than the high
bit. If you had objects bigger than 64K, you had to shift
the segment number three bits to the left when computing
addresses.

They also apparently didn't expect people to switch segments much.
If you loaded a segment register with the value it already contained,
it still fetched all of the stuff from memory. How many gates would
it have taken to check for the same value and bypass the loads? If
they had done that, we could have used large model calls everywhere
since long and short calls would be about the same speed, and not
had to screw around deciding what was a long call and what was short
and writing glue codes to allow both kinds.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Sun Jan 5 03:55:39 2025

On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case >>there are too few segments (if you have one segment per object).

Intel clearly had some strong opinions about how people would program
the 286, which turned out to bear almost no relation to the way we
actually wanted to program it.

Clearly, Intel thought that .text, .data, .stack, and .heap were
about all anyone would ever need.

Some of the stuff they did was just perverse, like putting flag
bits in the low part of the segment number rather than the high
bit. If you had objects bigger than 64K, you had to shift
the segment number three bits to the left when computing
addresses.

Let us just agree that whatever they were thinking, they blew it.

They also apparently didn't expect people to switch segments much.

Clearly. They expected segments to be essentially stagnant--unlike
the people trying to do things with x86s...

If you loaded a segment register with the value it already contained,
it still fetched all of the stuff from memory.

If segment LD was 1 cycle, the number of segment changes would be
fine. But since LDing a segment was so expensive, they probably
could not afford the transistors and wires to do what you suggest.

How many gates would
it have taken to check for the same value and bypass the loads?

286:: 180 gates per segment register
386:: 360 gates per segment register

If
they had done that, we could have used large model calls everywhere
since long and short calls would be about the same speed, and not
had to screw around deciding what was a long call and what was short
and writing glue codes to allow both kinds.

If you had 32 segment registers, it probably would not have mattered
that segment LD was slow. And if you had 32 pointing registers, you
would not have need a LD-OP ISA.

Sigh ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Waldek Hebisch on Sun Jan 5 08:54:29 2025

Waldek Hebisch wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.

AFAICS that covered wast majority of programs during eighties.
Turbo Pascal offered only medium memory model and was quite
popular. Its code generator produced mediocre output, but
real Turbo Pascal programs used a lot of inline assembly
and performance was not bad.

As someone who wrote megabytes of that asm, I feel qualified to comment:

Turbo Pascal itself 1.0 ran in Small model (64kB code & data) afair, but
since the compiler/editor/linker/loader/debugger only used 35 kB (37 kB
if you also loaded the text error messages), it had enough room left
over for the source code it compiled.

From the very beginning it supported Medium as you state, with separate
code in the CS reg and data+stack (DS+SS) sharing a single segment.

This way you had to use ES for all cross-segment operations,
particularly REP MOVSB block moves.

Later versions supported the Large model where all addresses were segment+offset pairs, as well as Huge where the segment was pointing at
the object, rounded down to the nearest 16-byte boundary, and the offset (typically BX) was always [0-15].

Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.

Protected could only be fast if segment reloads were rare, in my own
code I would allocate arrays of largish objects as the max number that
would fit in 64K, then grab the next.

Terje
PS. Happy New Year!

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Anton Ertl on Sun Jan 5 14:48:00 2025

In article <2025Jan3.093849@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

The 8086 has branches with 8-bit offsets and branches and calls
with 16-bit offsets. The 386 in 32-bit mode has branches with
8-bit offsets and branches and calls with 32-bit offsets; if
16-bit offsets for branches would be useful enough for performance,
they could instead have designed the longer branch length to be
16 bits, and maybe a prefix for 32-bit branch offsets. That would
be faster than what you outline, as soon as one call happens.
But apparently 16-bit branches are not that beneficial, or they
would have gone that way on the 386.

Don't assume that Intel of the early 1980s would have done enough
simulation to explore those possibilities thoroughly. Given the mistakes
they made in the 1970s with iAPX 432 and in the 1990s with Itanium, both through lack of simulation with varying workloads, they may well have
been working by rules of thumb and engineering "intuition."

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Sun Jan 5 11:10:28 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).

In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem.

If you go that way, you lose all the benefits of segments, and run
into the "segments too small" problem. Which you then want to
circumvent by using segment and offset in your addressing of the small
data structures, which leads to:

But there is a lot of loading segment registers
and slow loading is a problem.

...

Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.

With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.

At that time Intel apparently wanted to avoid having too many
instructions.

Looking in my Pentium manual, the section on CALL has a 20 lines for
"call intersegment", "call gate" (with priviledge variants) and "call
to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
286), and the "Operation" section (the specification in pseudocode)
consumes about 4 pages, followed by a 1.5 page "Description" section.

9 of these 10 far call variants deal with protected-mode things, so
Intel obviously had no qualms about adding instruction variants. If
they instead had no protected mode, but some 32-bit support, including
the near call with 32-bit offset that I suggest, that would have
reduced the number of instruction variants.

I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.

Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.

Which "good things offered by hardware" do you see "wasted" by this
usage in Xenix? To me this seems to be the only workable way to use
the 286 protected mode. Ok, the medium model (near data, far code)
may also have been somewhat workable, but looking at the cycle counts
for the protected-mode far calls on the Pentium (and on the 286 they
were probably even more costly), which start at 22 cycles for a "call
gate, same priviledge" (compared to 1 cycle on the Pentium for a
direct call near), one would strongly prefer the small model.

Every successful software used direct access to hardware because of
performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.

For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.

MicroSoft and IBM invested lots of work in a 286 protected-mode
interface: OS/2 1.x. It was limited to the 286 at the insistence of
IBM, even though work started in August 1985, when they already knew
that the 386 was coming soon. OS/2 1.0 was released in April 1987,
1.5 years after the 386.

OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
too late, so the 286 killed OS/2; here we have a case of a software
project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
few years).

Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.

Also, Microsoft started NT OS/2 in November 1988 to target the 386
while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
parted ways, NT OS/2 became Windows NT, which is the starting point of
all remaining Windowses from Windows XP onwards.

Xenix, apart from OS/2 the only other notable protected-mode OS for
the 286, was ported to the 386 in 1987, after SCO secured "knowledge
from Microsoft insiders that Microsoft was no longer developing
Xenix", so SCO (or Microsoft) might have done it even earlier if the
commercial situation had been less muddled; in any case, Xenix jumped
the 286 ship ASAP.

The verdict is: The only good use of the 286 is as a faster 8086;
small memory model multi-tasking use is possible, but the 64KB
segments are so limiting that everybody who understood software either
decided to skip this twist (MicroSoft, except on their OS/2 death
march), or jumped ship ASAP (SCO).

More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so.

Were there any who released software both as 8086 and a protected-mode
80286 variants? Microsoft/SCO with Xenix, anyone else?

And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).

Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
which does not require heroic efforts AFAIK.

There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.

AFAICS that covered wast majority of programs during eighties.

The "vast majority" is not enough; if a key application like Lotus
1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
did not limit themselves to 64KB of data.

Turbo Pascal offered only medium memory model

Acoording to Terje Mathiesen, it also offered the large memory model.
On its Wikipedia page, I find: "Besides allowing applications larger
than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
Turbo Pascal 4.0 introduced support for the large memory model in
1988.

Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.

80286 protected mode is never faster than real mode on the same CPU,
so the way to make programs fast on the 286 is to stick with real
mode; using the small memory model is an alternative, but as
mentioned, the memory limits are too restrictive.

Intel probably assumend that 286 would cover most needs,

As far as protected mode was concerned, they hardly could have been
more wrong.

especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286.

They provided 24 address pins, so they obviously assumed that there
would be 80286 systems with >8MB. 64KB segments are already too
limiting on systems with 1MB (which was supported by the 8086),
probably even for anything beyond 128KB.

IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version.

Maybe in real mode. Certainly not in protected mode. Just run your
tuned large-model protected-mode program against a 32-bit small-model
program for the same task on a 386SX (which is reported as having a
very similar speed to the 80286 on 16-bit programs). And even if you
find one case where the protected-mode program wins, nobody found it
worth their time to do this nonsense. And so OS/2 flopped despite
being backed by IBM and, until 1990, Microsoft.

But I think that Intel segmentation had some
attractive features during eighties.

You are one of a tiny minority. Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.

Another thing is 386. I think that designers of 286 thought
that 386 will remove some limitations. And 386 allowed
bigger segmensts removing one major limitation. OTOH
for 32-bit processor with segementation it would be natural
to have 32-bit segment registers. It is not clear to
me if 16-bit segment registers in 386 were deemed necessary
for backward compatibility or maybe in 386 period flat
fraction in Intel won and they kept segmentation mostly
for compatibility.

The latter (read the 386 oral history). The 386 designers knew that
segments have no future, and they were right, so they kept them at a
minimum.

If they had gone for 32-bit segment registers (and 64-bit segment
registers for AMD64), would segments have fared any better? I doubt
it. Using segments would have stayed slow, and would have been
ignored by nearly all programmers.

These days we see segment-like things in security extensions of
instruction sets, but slowness still plagues these extensions, and
security researchers often find ways to subvert the promised security
(and sometimes even more).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Waldek Hebisch on Sun Jan 5 15:20:31 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Another usage of segments for code would be to put the code segment of
a shared object (known as DLL among Windowsheads) in a segment, and
use far calls to call functions in other shared objects, while using
near calls within a shared object. This allows to share the code
segments between different programs, and to locate them anywhere in
physical memory. However, AFAIK shared objects were not a thing in
the 80286 timeframe; Unix only got them in the late 1980s.

IIUC shared segments were widely used on Multics.

They were widely used on both the Burroughs large systems
and the HP-3000 as well, both exemplars of segmentation
done right, in so far as it can be.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Dallman on Sun Jan 5 15:23:38 2025

jgd@cix.co.uk (John Dallman) writes:

In article <6d5fa21e63e14491948ffb6a9d08485a@www.novabbs.org>, >mitchalsup@aol.com (MitchAlsup1) wrote:

On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:

They also apparently didn't expect people to switch segments much.

Clearly. They expected segments to be essentially stagnant--unlike
the people trying to do things with x86s...

An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not. Intel may still have been thinking in
terms of embedded systems when designing the 80286.

I suspect that we don't, today, recall all the constraints that
were facing Intel with respect to processor ASIC development in the
late 70's and early 80's.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to mitchalsup@aol.com on Sun Jan 5 15:15:00 2025

In article <6d5fa21e63e14491948ffb6a9d08485a@www.novabbs.org>, mitchalsup@aol.com (MitchAlsup1) wrote:

On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:

They also apparently didn't expect people to switch segments much.

Clearly. They expected segments to be essentially stagnant--unlike
the people trying to do things with x86s...

An idea: The target markets for the 8080 and 8085 were clearly embedded systems. The Z80 and 6502 rapidly became popular in the micro-computer
market, but the 808[05] did not. Intel may still have been thinking in
terms of embedded systems when designing the 80286.

The IBM PC was launched in August 1981 and around a year passed before it became clear that this machine was having a huge and lasting effect on
the market. The 80286 was released on February 1st 1982, although it
wasn't used much in PCs until the IBM PC/AT in August 1984.

The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
to have been the first version of x86 where it was obvious at the start
of design that use in general-purpose computers would be important. It
was far more suitable for the job than the 80286, and things developed
from there.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Swindells@21:1/5 to Anton Ertl on Sun Jan 5 18:30:41 2025

On Sun, 05 Jan 2025 11:10:28 GMT, Anton Ertl wrote:

Xenix, apart from OS/2 the only other notable protected-mode OS for the
286, was ported to the 386 in 1987, after SCO secured "knowledge from Microsoft insiders that Microsoft was no longer developing Xenix", so
SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped the 286 ship
ASAP.

Microport Systems had UNIX System V for the 286.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sun Jan 5 17:51:34 2025

jgd@cix.co.uk (John Dallman) writes:

An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not.

The 8080 was used in the first microcomputers, e.g., the 1974 Altair
8800 and the IMSAI 8080. It was important for all the CP/M machines,
because the CP/M software (both the OS and the programs running on it)
were written to use the 8080 instruction set, not the Z80 instruction
set. And CP/M was the professional microcomputer OS before the IBM PC compatible market took off, despite the fact that the most popular microcomputers of the time (such as the Apple II, TRS-80 ad PET) did
not use it; there was an add-on card for the Apple II with a Z80 for
running CP/M, though, which shows the importance of CP/M.

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.

Intel released the MCS-51 (aka 8051) in 1980 for embedded systems, and
it's very successful there, and before that came the MCS-48 (8048) in
1976.

Intel may still have been thinking in
terms of embedded systems when designing the 80286.

I very much doubt that the segments and the 24 address bits were
designed for embedded systems. The segments look more like an echo of
the iAPX432 than of anything designed for embedded systems.

The idea of some static allocation of memory for which segments might
work may come from old mainframe systems, where programs were (in my impression) more static than PC programs and modern computing. Even
in Unix programs, which were more dynamic than mainframe programs had
quite a bit of static allocation in the early days; this is reflected
in the paragraph in the GNU coding standards:

|Avoid arbitrary limits on the length or number of any data structure, |including file names, lines, files, and symbols, by allocating all
|data structures dynamically. In most Unix utilities, "long lines are
|silently truncated". This is not acceptable in a GNU utility.

The IBM PC was launched in August 1981 and around a year passed before it >became clear that this machine was having a huge and lasting effect on
the market. The 80286 was released on February 1st 1982, although it
wasn't used much in PCs until the IBM PC/AT in August 1984.

The 80286 project was started in 1978, before any use of the 8086. <https://timeline.intel.com/1978/kicking-off-the-80286> claims that
they "spent six months on field research into customers' needs alone";
Judging by the results, maybe the customers were clueless, or maybe
Intel asked the wrong questions.

The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
to have been the first version of x86 where it was obvious at the start
of design that use in general-purpose computers would be important.

Actually, reading the oral history of the 386, at the start the 386
project was just an unimportant followon of the 286, while the main
action was expected to be on the BiiN project (from which the i960
came). Only sometime during that project the IBM PC market exploded
and the 386 became the most important project of the company.

But yes, they were very much aware of the needs of programmers in the
386 project, and would probably have done something with just paging
and no segments if they did not have the 8086 and 80286 legacy.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun Jan 5 20:01:25 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.

I have heard that the IBM PC was originally designed with a Z80, and fairly late
in the process someone decided (not unreasonably) that it wouldn't be different enough from all the other Z80 boxes to be an interesting product. They wanted a 16 bit processor but for time and money reasons they stayed with the 8 bit bus they already had. The options were 68008 and 8088. Moto was only shipping samples of the 68008 while Intel could provide 8088 in quantity, so they went with the 8088.

If Moto had been a little farther along, the history of the PC industry
could have been quite different.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jan 5 19:40:42 2025

On Sun, 5 Jan 2025 17:51:34 +0000, Anton Ertl wrote:

jgd@cix.co.uk (John Dallman) writes:

An idea: The target markets for the 8080 and 8085 were clearly embedded >>systems. The Z80 and 6502 rapidly became popular in the micro-computer >>market, but the 808[05] did not.

The 8080 was used in the first microcomputers, e.g., the 1974 Altair
8800 and the IMSAI 8080. It was important for all the CP/M machines,
because the CP/M software (both the OS and the programs running on it)
were written to use the 8080 instruction set, not the Z80 instruction
set. And CP/M was the professional microcomputer OS before the IBM PC compatible market took off, despite the fact that the most popular microcomputers of the time (such as the Apple II, TRS-80 ad PET) did
not use it; there was an add-on card for the Apple II with a Z80 for
running CP/M, though, which shows the importance of CP/M.

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.

Intel released the MCS-51 (aka 8051) in 1980 for embedded systems, and
it's very successful there, and before that came the MCS-48 (8048) in
1976.

Intel may still have been thinking in
terms of embedded systems when designing the 80286.

I very much doubt that the segments and the 24 address bits were
designed for embedded systems. The segments look more like an echo of
the iAPX432 than of anything designed for embedded systems.

The idea of some static allocation of memory for which segments might
work may come from old mainframe systems, where programs were (in my impression) more static than PC programs and modern computing. Even
in Unix programs, which were more dynamic than mainframe programs had
quite a bit of static allocation in the early days; this is reflected
in the paragraph in the GNU coding standards:

|Avoid arbitrary limits on the length or number of any data structure, |including file names, lines, files, and symbols, by allocating all
|data structures dynamically. In most Unix utilities, "long lines are |silently truncated". This is not acceptable in a GNU utility.

The IBM PC was launched in August 1981 and around a year passed before it >>became clear that this machine was having a huge and lasting effect on
the market. The 80286 was released on February 1st 1982, although it
wasn't used much in PCs until the IBM PC/AT in August 1984.

The 80286 project was started in 1978, before any use of the 8086. <https://timeline.intel.com/1978/kicking-off-the-80286> claims that
they "spent six months on field research into customers' needs alone"; Judging by the results, maybe the customers were clueless, or maybe
Intel asked the wrong questions.

Or perhaps what Intel thought they heard was not closely related
to what the customers were actually saying.

The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
to have been the first version of x86 where it was obvious at the start
of design that use in general-purpose computers would be important.

Actually, reading the oral history of the 386, at the start the 386
project was just an unimportant followon of the 286, while the main
action was expected to be on the BiiN project (from which the i960
came). Only sometime during that project the IBM PC market exploded
and the 386 became the most important project of the company.

But yes, they were very much aware of the needs of programmers in the
386 project, and would probably have done something with just paging
and no segments if they did not have the 8086 and 80286 legacy.

Oh well ...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Sun Jan 5 20:55:20 2025

On Sun, 5 Jan 2025 20:01:25 +0000, John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the >>iAPX432 clearly showed that they wanted to be dominant there. It's an >>irony of history that the 8086/8088 actually went where the action
was.

I have heard that the IBM PC was originally designed with a Z80, and
fairly late
in the process someone decided (not unreasonably) that it wouldn't be different
enough from all the other Z80 boxes to be an interesting product. They
wanted a
16 bit processor but for time and money reasons they stayed with the 8
bit bus
they already had. The options were 68008 and 8088. Moto was only
shipping
samples of the 68008 while Intel could provide 8088 in quantity, so they
went
with the 8088.

If Moto had been a little farther along, the history of the PC industry
could have been quite different.

If Moto had done 68008 first, it may very well have turned out
differently.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to John Levine on Sun Jan 5 20:46:43 2025

John Levine <johnl@taugh.com> wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.

I have heard that the IBM PC was originally designed with a Z80, and fairly late
in the process someone decided (not unreasonably) that it wouldn't be different
enough from all the other Z80 boxes to be an interesting product. They wanted a
16 bit processor but for time and money reasons they stayed with the 8 bit bus
they already had. The options were 68008 and 8088. Moto was only shipping samples of the 68008 while Intel could provide 8088 in quantity, so they went with the 8088.

If Moto had been a little farther along, the history of the PC industry
could have been quite different.

The 8088 was not a threat to any of IBM’s existing products, the 68x00 was.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sun Jan 5 22:01:36 2025

MitchAlsup1 wrote:

On Sun, 5 Jan 2025 20:01:25 +0000, John Levine wrote:

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.

I have heard that the IBM PC was originally designed with a Z80, and
fairly late
in the process someone decided (not unreasonably) that it wouldn't be
different
enough from all the other Z80 boxes to be an interesting product. They
wanted a
16 bit processor but for time and money reasons they stayed with the 8
bit bus
they already had. The options were 68008 and 8088. Moto was only
shipping
samples of the 68008 while Intel could provide 8088 in quantity, so they
went
with the 8088.

If Moto had been a little farther along, the history of the PC industry
could have been quite different.

If Moto had done 68008 first, it may very well have turned out
differently.

But neither of these were possible (i.e. available) when IBM picked
their CPU.

I do believe that IBM did seriously consider the risk of making the PC
too good, so that it would compete directly with their low-end systems (8100?).

At least, that's what I assumed when the PC-AT only ran at 6MHz on a CPU
which was designed for 8 MHz. I fondly remember a bunch of overclocking
hacks on various 286 machines, most of them ran at 9 MHz, and I don't
think I saw any that didn't handle 8 MHz.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Sun Jan 5 21:49:20 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.

Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).

In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem.

If you go that way, you lose all the benefits of segments, and run
into the "segments too small" problem. Which you then want to
circumvent by using segment and offset in your addressing of the small
data structures, which leads to:

But there is a lot of loading segment registers
and slow loading is a problem.

...

Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.

With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.

At that time Intel apparently wanted to avoid having too many
instructions.

Looking in my Pentium manual, the section on CALL has a 20 lines for
"call intersegment", "call gate" (with priviledge variants) and "call
to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
286), and the "Operation" section (the specification in pseudocode)
consumes about 4 pages, followed by a 1.5 page "Description" section.

9 of these 10 far call variants deal with protected-mode things, so
Intel obviously had no qualms about adding instruction variants. If
they instead had no protected mode, but some 32-bit support, including
the near call with 32-bit offset that I suggest, that would have
reduced the number of instruction variants.

I wrote "instructions". Intel clearly used modes and variants,
but different call would lead to new opcode.

I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.

Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.

Which "good things offered by hardware" do you see "wasted" by this
usage in Xenix?

Medimu mode and shared segments. Plus escape for programs needing
bigger memory (traditional Unix programs by neccesity fit in 64kB
limits).

To me this seems to be the only workable way to use
the 286 protected mode. Ok, the medium model (near data, far code)
may also have been somewhat workable, but looking at the cycle counts
for the protected-mode far calls on the Pentium (and on the 286 they
were probably even more costly), which start at 22 cycles for a "call
gate, same priviledge" (compared to 1 cycle on the Pentium for a
direct call near), one would strongly prefer the small model.

I have found instruction list on the web which claims 26 + m cycles
where m in "length of next instruction" (whatever that means) for
protected mode call using segement. Real mode call using segement
is 13 + m cycles. Near call call is 7 + m cycles.

Intel clearly expected that segment-changing calls are infrequent.
AFAICS this was better than system conventions on IBM mainframes
where "standard" call normally called memory allocation function
to allocate stack frame. I do not have data for VAX handy, but
VAX calls were quite complex, so probably also not fast.

And modern data at least partially confirms Intel beliefs. When
AMD introduced 64-bit mode thay also introduced complex calling
convention intended to optimize speed of calls. Later there
was a paper by Intel folks essentially claiming that this
calling convention does not matter: C compilers inline small
routines, so cost of calls relatively to other things is quite
small. I think that what was inlined in 2010 would be called
using near calls in 1982.

Every successful software used direct access to hardware because of
performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.

For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.

MicroSoft and IBM invested lots of work in a 286 protected-mode
interface: OS/2 1.x. It was limited to the 286 at the insistence of
IBM, even though work started in August 1985, when they already knew
that the 386 was coming soon. OS/2 1.0 was released in April 1987,
1.5 years after the 386.

OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
too late, so the 286 killed OS/2; here we have a case of a software
project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
few years).

Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.

What I recall is a bit different. IIRC first successful version of
Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
286 protected mode and 386 protected mode. Only later Microsoft
dropped requirement for 8086 compatiblity. I think still later
it dropped 286 support. Windows 95 was supposed to be 32-bit,
but contained quite a lot of 16-bit code. IIRC system interface
to Windows 3.0 and 3.1 was 16-bit and only later Microsoft
released extention allowing 32-bit system calls.

I have no information about Windows internals except for some
public statements by Microsoft and other people, but I think
it reasonable to assume that Windows was actually a succesful
example of 8086/286/386 compatibility. That is their 16 bit
code could use real mode segmentation or protected mode
segmentation the later both for 286 and 386. For 32-bit
version they added translation layer to transform arguments
between 16-bit world and 32-bit world. It is possible
that this translation layer involved a lot of effort. IIUC
DEC when porting VMS to Alpha essentially gave up using
32-bit pointers as main interface.

Anyway, it seems that Windows was at least as tied to 286
as OS/2 when it became sucessful and dropped 286 support
later. And for long time after dropping 286 support
Windows massively used 16-bit segments.

Also, Microsoft started NT OS/2 in November 1988 to target the 386
while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
parted ways, NT OS/2 became Windows NT, which is the starting point of
all remaining Windowses from Windows XP onwards.

Xenix, apart from OS/2 the only other notable protected-mode OS for
the 286, was ported to the 386 in 1987, after SCO secured "knowledge
from Microsoft insiders that Microsoft was no longer developing
Xenix", so SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped
the 286 ship ASAP.

The verdict is: The only good use of the 286 is as a faster 8086;
small memory model multi-tasking use is possible, but the 64KB
segments are so limiting that everybody who understood software either decided to skip this twist (MicroSoft, except on their OS/2 death
march), or jumped ship ASAP (SCO).

As I mentioned above I do not believe your claim about Microsoft.
There were DOS-extenders which allowed use of 286 protected mode
under DOS. They were used by several software vendors. Clearly,
programming for flat 32-bit mode is easier and on software market
that matters more than other factors.

I think that 286 protected mode is good for its intended use, that
is protected multitasking systems having more than 64 kB but less
than say 4 MB. Of course, if you have a lot of hardware resources,
than 32-bit system using paging may be easier to create. Also,
speed is tricky: on 486 (and possibly 386) hardware task switch
was easy to use, but slower than tuned purely software
implementation. In other parts reloading of segment registers
could slow down things quite a lot, so 16-bit protected mode
required a lot of tuning to minimize number of times when
segement registers were reloaded.

I do not know if people used 286 in this way, but natural use
of 286 is as a debugger for 8086 programs. That is use segment
protection to catch stray accesses. Once program works OK
deliver it as a real mode program on 8086 gaining speed and
bigger market.

AFAIK Linux started using 32-bit mode but heavily depending on
386 segmentation. Rater quickly dependence on segments was
limited and what remained was well isolated. But I think that
Linux shows that _creating_ simple multitasking system is
easier using hardware properties coming together with 286
segmentation.

Intel misjudged what is typical in programs. But they were not
alone in this. I have translation of Tanenbaum book on computer
architecture from 1976 (original, translation is from 1983).
Tanenbaum is very posivite about segmentation, descriptors and
"high level machines". He gave simple examples where descriptors
and microprogrammed "high level machine" are supposed to give
better performance than more conventianal machine.

And as I already wrote, Intel misjudged market for 286. They
could guess that 286 system will be too expensive for home
market for long time. They probably did not expect that
286 will find its way into PC-s.

More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so.

Were there any who released software both as 8086 and a protected-mode
80286 variants? Microsoft/SCO with Xenix, anyone else?

IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
to say "supported on Windows". That is Windows 3.0 on 286 almost
surely used 286 protected mode and probably run "Windows" programs
in protected mode. But Windows also supported 8086 and Microsoft
guidelines insisted that proper "Windows program" should run on
8086.

On DOS I do not remember names of specific programs. I remember
Phar Lap who provided 286 DOS extender and quite a few programs
used it. Browsing trough binaries on machines that I used I saw
the name several times. Sometimes program using DOS extender
would clearly say that it requires 286, but I vaguely remember
cases with separate 286 binaries and 8086 binaries where startup
code loaded right binary. Probably there were also cases whare
needed switching was hidden inside a single binary.

And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).

Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
which does not require heroic efforts AFAIK.

Well, baside virtual 8086 mode there is tricky code to get
right effect. A lot of late "DOS" programs dependend on DOS
extenders and significant fraction of such programs run fine
under dosemu. I do not know if Windows ever got its DOS box
to level of dosemu, but when I used dosemu I heard that
various things did not work in Windows DOS box.

There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.

AFAICS that covered wast majority of programs during eighties.

The "vast majority" is not enough; if a key application like Lotus
1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
did not limit themselves to 64KB of data.

I do not know if they offered protected mode versions. But it
is likely that they did once machines with more than 640 kB
formed resonable fraction of the PC market.

Turbo Pascal offered only medium memory model

Acoording to Terje Mathiesen, it also offered the large memory model.
On its Wikipedia page, I find: "Besides allowing applications larger
than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
Turbo Pascal 4.0 introduced support for the large memory model in
1988.

I am not entirely sure, but probaly I used 4.0. I certainly used
5.0 and later versions. AFAIR all versions that I used limited
"static" data to 64 kB, that together with no such limit for code
I take as definition of "medium" model. I do not remeber explicit
model switches which were common on PC C compilers. PC compilers
allowed far/near qualifiers on pointers and I do not rememeber
significant restrictions on this (but other folks reported that
some combinations did not work): for data model set defaults,
but programmer could override it. So in Turbo Pascal one could
use large pointers if desired (or maybe even by default), but
static data was in a single 64 kB segment.

Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.

80286 protected mode is never faster than real mode on the same CPU,
so the way to make programs fast on the 286 is to stick with real
mode; using the small memory model is an alternative, but as
mentioned, the memory limits are too restrictive.

Well, if program needs more than 1 MB total workarounds on 286
may be more expensive than cost of protected mode. More to
the point, if one needs security features, then doing them
in real mode via sofware is likely to take more time than 286
version. Intel clearly did not anticipate that large fraction
of 286-s will be used in PC-s and that in PC vendors/developers
will prefer speed gain (modest when protected mode version
has enough tuning) to protection.

Intel probably assumend that 286 would cover most needs,

As far as protected mode was concerned, they hardly could have been
more wrong.

especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286.

They provided 24 address pins, so they obviously assumed that there
would be 80286 systems with >8MB. 64KB segments are already too
limiting on systems with 1MB (which was supported by the 8086),
probably even for anything beyond 128KB.

IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version.

Maybe in real mode. Certainly not in protected mode. Just run your
tuned large-model protected-mode program against a 32-bit small-model
program for the same task on a 386SX (which is reported as having a
very similar speed to the 80286 on 16-bit programs).

My instruction table show _longer_ times for several intructions
on 386 compared to 286. For example real mode far call on 286
has 13 clocks + penalty, on 386 17 clocks + the same penalty,
protected mode call on 286 has 26 clocks + penalty, on 386 has
34 clocks + penalty. Near call on both is 7 clocks + penalty.

Anyway, if program consists of several procedures (or clusters
of closely related procedures) each having few kilobytes then
it can easily happen that there are thousends of instructions
between far calls, so cost of far calls is going to be
negligible (19 clocks per thousends of instructions). If
program manages to do its work in single 64 kB data (not
unreasonable for 1 MB code), than it will be faster than
program using 32-bit addresses. More relevant, in multitaking
situation with each task having its own data segment there
will be reloading of segment registers on task switch,
which is likely to be negligible. Again, each task will
gain due to smaller pointers. With OS present there will
be segment reloading due to system calls and this may
be more significant. However, this mostly due to protection
and not segmentation.

And even if you
find one case where the protected-mode program wins, nobody found it
worth their time to do this nonsense.

That is largely true. I wonder what will happen with x32 mode
on x86_64. AFAIK x32 mode showed measurable performance gains,
20-30% smaller programs and similar speed gains. In principle
it should be cheap to support it as it is "just another 32-bit
target". But some (for me important) programs do not work
in this mode and there are voices to completly drop it.

And so OS/2 flopped despite
being backed by IBM and, until 1990, Microsoft.

But I think that Intel segmentation had some
attractive features during eighties.

You are one of a tiny minority. Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.

Well, 16-bit segments clearly are too limited when one has several
megabytes of memory. And consistently 32-bit segmented system
increases memory use which is nontrivial cost. OTOH there is
question how much customers are going to pay for security
features. I think recent times show that secuity has significant
costs. But lack of security may lead to big losses. So
there is no easy choice.

Now people talk more about capabilities. AFAICS capabilities
offer more than segments, but are going to have higher cost.
So abstractly, for some systems segments still may look
attractive. OTOH we now understand that software ecosystem
is much more varied than prevalent view in seventies, so
system that fit well to segments are a tiny part.

And considering bad memory, do you remember PAE? That had
similar spirit to 8086 segmentation. I guess that due
to bad feeling for segments among programmers (and possibly
more relevant compatiblity troubles) Intel did not extend
this to segments, but spirit was still there.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Mathisen on Mon Jan 6 00:35:00 2025

In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no (Terje Mathisen) wrote:

I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).

I recall reading back in the 1980s that the PC was intended to be
incapable of competing with the System/36 minis, and the previous
System/34 and /32 machines. It rather failed at that.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Dallman on Mon Jan 6 03:02:22 2025

On Thu, 1 Jan 1970 0:00:00 +0000, John Dallman wrote:

In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no (Terje Mathisen) wrote:

I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).

I recall reading back in the 1980s that the PC was intended to be
incapable of competing with the System/36 minis, and the previous
System/34 and /32 machines. It rather failed at that.

Perhaps IBM should have made them more performant !?!

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to Hebisch on Sun Jan 5 23:01:29 2025

On Sun, 5 Jan 2025 21:49:20 -0000 (UTC), antispam@fricas.org (Waldek
Hebisch) wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.

What I recall is a bit different. IIRC first successful version of
Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
286 protected mode and 386 protected mode. Only later Microsoft
dropped requirement for 8086 compatiblity.

They didn't drop 8086 so much as required 386. Windows and "DOS box"
required the CPU to have "virtual 8086" mode.

I think still later it dropped 286 support.

I know 286 protected mode support continued at least through NT. Not
sure about 2K.

Windows 95 was supposed to be 32-bit, but contained quite a lot
of 16-bit code.

The GUI was 32-bit, the kernel and drivers were 16-bit. Weird, but it
made some hardware interfacing easier.

IIRC system interface to Windows 3.0 and 3.1 was 16-bit and only
later Microsoft released extention allowing 32-bit system calls.

Never programmed 3.0.

3.1 and 3.11 (WfW) had a combination 16/32-bit kernel in which most
device drivers were 16-bit, but the disk driver could be either 16 or
32 bit. In WfW the network stack also was 32-bit and the NIC driver
could be either.

However the GUI in all 3.x versions was 16-bit 286 protected mode.

You could run 32-bit "Win32s" programs (Win32s being a subset of
Win32), but Win32s programs could not use graphics.

I have no information about Windows internals except for some
public statements by Microsoft and other people, but I think
it reasonable to assume that Windows was actually a succesful
example of 8086/286/386 compatibility. That is their 16 bit
code could use real mode segmentation or protected mode
segmentation the later both for 286 and 386. For 32-bit
version they added translation layer to transform arguments
between 16-bit world and 32-bit world. It is possible
that this translation layer involved a lot of effort.

For a number of years I worked on Windows based image processing
systems that used OTS ISA-bus acceleration hardware. The drivers were
16-bit DLLs, and /non-reentrant/. There was one "general" purpose
board and several special purpose boards that could be combined with
the general board in "stacks" that communicated via a private high
speed bus. There could be multiple stacks of boards in the same
system.

[Our most complicated system had 7 boards in 2 stacks, one with 5
boards and the other with 2. Our biggest system had 18 boards: 6
stacks of 3 boards each. Ever see a 20 slot ISA backplane?]

The non-reentrant driver made it difficult to simultaneously control
separate stacks to do different tasks. We created a (reentrant)

16 bit dispatching "thunk" DLL to translate calls for every

function of every board that we might possibly want to use ...
hundreds in all ... and then dynamically loaded multiple instances of
the driver as required. PITA !!! Worked fine but very hard to debug, particularly when doing several different operations simultaneously.

On 3.x we simulated threading in the shared 16-bit application space
using multiple processes, messaging with hidden windows, and "far
call" IPC using the main program as a kind of "shared library". Having
real threads on 95 and later allowed actually consolidating everything
into the same program and (at least initially) made everything easier.
But then NT forced dealing with protected mode interrupts, while at
the same time still using 16-bit drivers for everything else - and
that became yet another PITA.

We continued to use the image hardware until SIMD became fast enough
to compete (circa GHz Pentium4 being available on SBC). Excepting
NT3.x we had systems based on every Windows from 3.1 to NT4.

Anyway, it seems that Windows was at least as tied to 286
as OS/2 when it became sucessful and dropped 286 support
later. And for long time after dropping 286 support
Windows massively used 16-bit segments.

I don't know exactly when 286 protected mode was dropped. I do know
that, at least through NT4, 16-bit DOS mode and GUI applications would
run so long as they relied on system calls and didn't directly try to
touch hardware.

I occasionally needed to run 16-bit VC++ on my NT4 machine.

IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
to say "supported on Windows". That is Windows 3.0 on 286 almost
surely used 286 protected mode and probably run "Windows" programs
in protected mode. But Windows also supported 8086 and Microsoft
guidelines insisted that proper "Windows program" should run on
8086.

Yes. I used - but never programmed - 3.0 on a V20 (8086 clone). It
was painfully slow even with 1MB of RAM.

... Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.

Well, 16-bit segments clearly are too limited when one has several
megabytes of memory. And consistently 32-bit segmented system
increases memory use which is nontrivial cost. OTOH there is
question how much customers are going to pay for security
features. I think recent times show that secuity has significant
costs. But lack of security may lead to big losses. So
there is no easy choice.

Now people talk more about capabilities. AFAICS capabilities
offer more than segments, but are going to have higher cost.
So abstractly, for some systems segments still may look
attractive. OTOH we now understand that software ecosystem
is much more varied than prevalent view in seventies, so
system that fit well to segments are a tiny part.

And considering bad memory, do you remember PAE? That had
similar spirit to 8086 segmentation. I guess that due
to bad feeling for segments among programmers (and possibly
more relevant compatiblity troubles) Intel did not extend
this to segments, but spirit was still there.

The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

Intel had a chance to do it right with the 386, but instead they
doubled down and expanded the existing poor implementation to support
larger segments.

I realize that transistor counts at the time might have made an
on-chip SMU impossible, but ISTM the SMU would have been a very small
component that (if necessary) could have been implemented on-die as a coprocessor.

<grin>Maybe my de-deuces are wild ...</grin>
but there they are nonetheless.

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to George Neuner on Mon Jan 6 08:24:43 2025

George Neuner <gneuner2@comcast.net> writes:

The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds
accesses (e.g., buffer overflow exploits).

If the program uses 32-bit (or nowadays 64-bit) addresses, and the
segment number is just part of that, you don't get that protection: An out-of-bounds access could not be distinguished from a valid access to
a different segment. There might be some addresses that are in no
segment, and that would lead to a segment violation, but the same is
true for paging-based "security" now; the important part is that there
would be (and is) no guarantee that an out-of-bounds access is caught.

The 286 segments catch out-of-segment accesses. The size granularity
of the 386's 32-bit segments is coarse, but at least out-of-bounds
accesses do not intrude into other segments.

On the 286 and 386 segment numbers are stored in memory just like any
other data, so an attacker may be able to change the segment number in
addition to (or instead of) the offset, and thus gain access to
sensitive data, so the security provided by 286/386 segments is
limited. I have not looked closely into CHERI, but I dimly remember
some claims that they protect against manipulation of the extra data
(what would be the segment number in the 286) in the 128-bit address.

Intel had a chance to do it right with the 386, but instead they
doubled down and expanded the existing poor implementation to support
larger segments.

It looks to me that they took the right choices: Support 286 protected
mode, add virtual 8086 mode, support a flat memory model like
everybody else has done in modern computers (S/360, PDP-11); to
combine these requirements, they added support for segments up to 4GB
in size, so people wanting to use flat 32-bit addressing could just
use the tiny memory model (CS=DS=SS) and forget about segments.

I realize that transistor counts at the time might have made an
on-chip SMU impossible, but ISTM the SMU would have been a very small >component that (if necessary) could have been implemented on-die as a >coprocessor.

How would the addresses be divided into segment and offset in your
model? What kind of addresses would you have used on the 286? What
would the SMU have to do? Would a PC have used such an SMU if it was
a separate chip?

If they had made the 286 a kind of real-mode-only 386SX-like CPU, I
think that PCs would have been designed without SMU. And one problem
would have been that you probably would want 32 address bits to flow
from the CPU to the SMU, but the 286 and 386SX only have 24 address
pins, and additional pins are expensive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Mon Jan 6 14:41:22 2025

On Mon, 06 Jan 2025 08:24:43 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

How would the addresses be divided into segment and offset in your
model? What would the SMU have to do?

- anton

Those are sort of questions that in the past I several times asked Nick Maclaren when he was still active on c.a. Never got an answer that I
was able to understand.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jan 6 15:19:32 2025

On Mon, 6 Jan 2025 03:02:22 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Thu, 1 Jan 1970 0:00:00 +0000, John Dallman wrote:

In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no
(Terje Mathisen) wrote:

I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).

I recall reading back in the 1980s that the PC was intended to be
incapable of competing with the System/36 minis, and the previous
System/34 and /32 machines. It rather failed at that.

Perhaps IBM should have made them more performant !?!

Impossible. More performant S/36 would undermine S/38.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jan 6 16:05:02 2025

Anton Ertl wrote:

George Neuner <gneuner2@comcast.net> writes:

The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds accesses (e.g., buffer overflow exploits).

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last)
allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jan 6 16:36:41 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) >allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

It also does not help for out-of-bounds accesses that are not just
adjacent to an earlier in-bounds access. That may also be a less
common vulnerability than adjacent positive-stride buffer overflows.
But if we throw hardware on the problem, do we want to spend hardware
on something that does not catch all out-of-bounds accesses?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jan 6 18:58:16 2025

According to George Neuner <gneuner2@comcast.net>:

The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

The whole point of a segmented architecture is that the segments are visible and
meaningful. You put a thing (for some definition of thing) in a segment to control access to the thing. So if it's an array, all of the address calculations are relative to the segment and out of bounds references fail because they point to a non-existent part of the segment. Similiarly if it's code, a jump outside the segment's boundaries fails.

Muitics and the Burroughs machines had (still have, I suppose for emulated Burroughs) visible segments and programmers liked them just fine. The problems were that the segment sizes were too small as memories got bigger, and that they
weren't byte addressed which these days is practically mandatory. The 286 added
additional flaws that there weren't enough segment registers and segment loads were very slow.

What you're describing is multi-level page tables. Every virtual memory system has them. Sometimes the operating systems make the higher level tables visible to applications, sometimes they don't. For example, in IBM mainframes the second
level page table entries, which they call segments, can be shared between applications.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 6 19:49:34 2025

On Mon, 6 Jan 2025 16:36:41 +0000, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

The best idea I have seen to help detect out of bounds accesses, is to >>round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that >>the end of the requested region coincides with the end of the (last) >>allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

It also does not help for out-of-bounds accesses that are not just
adjacent to an earlier in-bounds access. That may also be a less
common vulnerability than adjacent positive-stride buffer overflows.
But if we throw hardware on the problem, do we want to spend hardware
on something that does not catch all out-of-bounds accesses?

An IBM guy once told me::

"If you are going to put it in HW, put it in in such a way that you
never have to change the definition of what you put in.

So, to answer the above question:: you want to check absolutely
all boundaries on all multi-container data objects, including
array bounds within a structure::

struct { integer a,b,c,d;
double l[max],m[max],n[max][max]; } k;

Any access to m[] is checked to be within the substructure
of m[*], so you cannot touch l[] or n[][], or a,b,c, or d.

Try doing that with segmentation bounds checking...or
capabilities...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Mon Jan 6 19:45:43 2025

John Levine <johnl@taugh.com> writes:

According to George Neuner <gneuner2@comcast.net>:

The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

The whole point of a segmented architecture is that the segments are visible and
meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.

Muitics and the Burroughs machines had (still have, I suppose for emulated

The original HP-3000 also had segments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Jan 6 19:41:49 2025

On Mon, 6 Jan 2025 15:05:02 +0000, Terje Mathisen wrote:

Anton Ertl wrote:

George Neuner <gneuner2@comcast.net> writes:

The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds
accesses (e.g., buffer overflow exploits).

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page.
This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

You allocate no more actual memory, but you do consume an additional
virtual address PTE on those pages marked no-access. If, later, you
expand that allocated area, you can THEN allocate a page and update
the PTE.

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

Use an unallocated page prior to the buffer, too.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Levine on Mon Jan 6 19:48:46 2025

John Levine <johnl@taugh.com> writes:

According to George Neuner <gneuner2@comcast.net>:

The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

The whole point of a segmented architecture is that the segments are visible and
meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.

Muitics and the Burroughs machines had (still have, I suppose for emulated >Burroughs) visible segments and programmers liked them just fine. The problems >were that the segment sizes were too small as memories got bigger, and that they
weren't byte addressed which these days is practically mandatory. The 286 added
additional flaws that there weren't enough segment registers and segment loads >were very slow.

What you're describing is multi-level page tables. Every virtual memory system >has them. Sometimes the operating systems make the higher level tables visible >to applications, sometimes they don't. For example, in IBM mainframes the second
level page table entries, which they call segments, can be shared between >applications.

There have been a number of attempts to use capabilities to describe
individual data items (the aforementioned Burrougsh systems are the
canonical examples).

There are investigations into adapting such schemes to modern
microprocessors, one of which is CHERI which uses 128-bit
pointers to encode various attributes, including the size
of the object.

https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Mon Jan 6 22:02:30 2025

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

It is also problematic to allocate 8K (or more) for a small entity, or
on the stack.

Bounds checking should ideally impart minimum overhead so that it
can be enabled in production code.

Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)

This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

Memory access is to base + index, with one additional point:
If index > ubound, then the instruction raises an exception.

This works less well with C's pointers, for which you would have
to pass some sort of fat pointer. Compilers would have to make
sure that the address of the base object is passed.

Comments?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jan 6 22:57:11 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last)
allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

It is also problematic to allocate 8K (or more) for a small entity, or
on the stack.

Bounds checking should ideally impart minimum overhead so that it
can be enabled in production code.

Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)

This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Jan 6 23:41:41 2025

On Mon, 6 Jan 2025 22:02:30 +0000, Thomas Koenig wrote:

Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)

This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

Memory access is to base + index, with one additional point:
If index > ubound, then the instruction raises an exception.

Now, you are only checking the ubound and not the lbound; so,
you only stumble over ½ the bound errors.

Where you should START is with a data structure that defines
the memory region::

First Byte accessible Possibly lbound
Last Byte accessible Possibly ubound
other stuff as needed

Then figure out how to efficiently perform the checks in ISA
of choice (or add to ISA).

This works less well with C's pointers, for which you would have
to pass some sort of fat pointer. Compilers would have to make
sure that the address of the base object is passed.

I blame the programmers for not using FAT pointers (and then
teaching the compilers how to get rid of most of the checks.)
Nothing is preventing C programmers from using FAT pointers,
and thereby avoid all those buffer overruns.

Comments?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lynn Wheeler@21:1/5 to John Levine on Mon Jan 6 17:28:11 2025

John Levine <johnl@taugh.com> writes:

What you're describing is multi-level page tables. Every virtual
memory system has them. Sometimes the operating systems make the
higher level tables visible to applications, sometimes they don't. For example, in IBM mainframes the second level page table entries, which
they call segments, can be shared between applications.

initial adding virtual memory to all IBM 370s was similar to 24bit
360/67 but had options for 16 1mbyte segments or 256 64kbyte segments
and either 4kbyte or 2kbyte pages. Initial mapping of 360 MVT to VS2/SVS
was single 16mbyte address space ... very similar to running MVT in a
CP/67 16mbyte virtual machine.

The upgrade to VS2/MVS gave each region its own 16mbyte virtual address
space. However, OS/360 MVT API heritage was pointer passing API ... so
they mapped a common 8mbyte image of the "MVS" kernel into every 16mbyte virtual address space (leaving 8mbytes for application code), kernel API
call code could still directly access user code API parameters
(basically same code from MVT days).

However, MVT subsystems were also moved into their separate 16mbyte
virtual address space ... making it harder to access application API
calling parameters. So they defined a common segment area (CSA), 1mbyte
segment mapped into every 16mbyte virtual address space, application
code would get space in the CSA for API parameter information calling subsystesm.

Problem was the requirement for subsystem API parameter (CSA) space was proportional to number of concurrent applications plus number of
subsystems and quickly exceed 1mbyte ... and it morphs into
multi-megabyte common system area. By the end of the 70s, CSAs were
running 5-6mbytes (leaving 2-3mbytes for programs) and threatening to
become 8mbytes (leaving zero mbytes for programs)... part of the mad
rush to XA/370 and 31-bit virtual addressing (as well as access
registers, and multiple concurrent virtual address spaces ... "Program
Call" instruction had a table of MVS/XA address space pointers for
subsystems, the PC instruction whould move the caller's address space
pointer to secondary and load the subsystem address space pointer into
primary ... program return instruction reversed the processes and moved
the secondary pointer back to primary).

--
virtualization experience starting Jan1968, online at home since Mar1970

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 7 11:05:20 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Tue Jan 7 11:45:25 2025

MitchAlsup1 wrote:

On Mon, 6 Jan 2025 15:05:02 +0000, Terje Mathisen wrote:

Anton Ertl wrote:

George Neuner <gneuner2@comcast.net> writes:

The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.

Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].

What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds
accesses (e.g., buffer overflow exploits).

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last)
allocated page.
This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?

You allocate no more actual memory, but you do consume an additional
virtual address PTE on those pages marked no-access. If, later, you
expand that allocated area, you can THEN allocate a page and update
the PTE.

(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)

Use an unallocated page prior to the buffer, too.

Yeah, of course, but you do lose the exact trap ability.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 7 10:53:17 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Mon, 6 Jan 2025 22:02:30 +0000, Thomas Koenig wrote:

Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)

This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

Memory access is to base + index, with one additional point:
If index > ubound, then the instruction raises an exception.

Now, you are only checking the ubound and not the lbound; so,
you only stumble over ½ the bound errors.

If the base register does point to the start of the entity,
that would also be covered, at least for one-dimensional
arrays.

Where you should START is with a data structure that defines
the memory region::

First Byte accessible Possibly lbound
Last Byte accessible Possibly ubound
other stuff as needed

Then figure out how to efficiently perform the checks in ISA
of choice (or add to ISA).

One such example is defined in the Fortran standard, in the C
descriptors, from ISO_Fortran_binding.h . There are two data
structures: The CFI_dim_t structure, which describes (in integer
variables) the lower bound, the extend (a.k.a number of elements)
and the stride. The CFI_cdesc_t structure then describes the
base address (a void *), the length of an individual element,
the version, the rank, the type, several attributes (is it
allocatable or a pointer) and the number of dimension.

You can see an example at

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/ISO_Fortran_binding.h;hb=refs/heads/master

(Unfortunately, for historical reasons, gfortran uses another
format internally for array descriptors).

Hmm... let's look at a simplified example.

void bounds_error (const char *fmt, ...) __attribute__ ((format (printf, 1,2)))
__attribute__ ((noreturn));

void set_element (double *a, unsigned long lower, unsigned long upper,
unsigned long n)
{
if (n < lower || n > upper)
bounds_error ("Error: %ld not between %ld and %ld", n, lower, upper);
a[n - lower] = 1.0;
}

it is hard avoiding two comparisons and branches without having
some sort of range comparison, something like

cmpr Rdst,Rsrc,Rlow,Rhigh

which would then set conditional bits according to the different
ranges that Rsrc can be in.

This works less well with C's pointers, for which you would have
to pass some sort of fat pointer. Compilers would have to make
sure that the address of the base object is passed.

I blame the programmers for not using FAT pointers (and then
teaching the compilers how to get rid of most of the checks.)
Nothing is preventing C programmers from using FAT pointers,
and thereby avoid all those buffer overruns.

They still have to do it by hand, it is much easier to do if
the language they use would offer it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Jan 7 17:04:29 2025

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 7 14:43:02 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Tue Jan 7 15:28:07 2025

Michael S <already5chosen@yahoo.com> writes:

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.

A compiler is free to create row or column capabilities for C or
FORTRAN if the goal is more than just memory safety.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 7 16:41:36 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly >>caught both in Matlab and in Octave. And Matlab has Fortran roots.

And vice versa, Fortran 90 borrowed heavily from Matlab :-)

Out-of-bounds-accesses are an error in Fortran, but the language
does not require.

A compiler is free to create row or column capabilities for C or
FORTRAN if the goal is more than just memory safety.

Would this be reasonably efficient?

Looking at an extreme case, the straightforward matrix
multiplication below (including some initialization so it's
self-contained)

program main
implicit none
real a(0:2,0:2), b(0:2,0:2), c(0:2,0:2)
integer i,j,k
do i=0,2
do j=0,2
a(i,j) = 2*i + j
b(i,j) = i - j
c(i,j) = 0
end do
end do
do j = 0, 2
do k = 0,2
do i = 0, 2
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
print *,c
end

gives us, in f2c translation to C of the three nested matmul loops,

for (j = 0; j <= 2; ++j) {
for (k = 0; k <= 2; ++k) {
for (i__ = 0; i__ <= 2; ++i__) {
c__[i__ + j * 3] += a[i__ + k * 3] * b[k + j * 3];
}
}
}

(yes, any bounds checking should have been moved outside the loops :-)
how could capabilities be used to detect bounds violations for all
indices on each access?

And what would the effort be? Can they be created in the time
for a simple pointer addition, or is something from the OS required?
Would this require something like a "pointer to pointer"?

(I have to admit that I haven't read a lot of CHERI, but what I
have read makes me supect that the designers didn't really have multi-dimensional arrays in mind; but then neither did the
designers of C, and CHERI is certainly C-centered).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Jan 7 20:16:57 2025

On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.

WATFIV would catch a(9,11)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 7 21:26:11 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.

WATFIV would catch a(9,11)

So would any compiler that generates bounds checking code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 7 22:01:55 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Assume a class of load and store instructions containing

- One source or destination register
- One base register
- One index register
- One ubound register

See aforementioned CHERI.

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).

I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would

do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do

interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?

A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.

Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.

WATFIV would catch a(9,11)

Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.

It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 7 23:16:07 2025

On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

-----------------

WATFIV would catch a(9,11)

Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.

It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)

How long does it take SW to construct the kinds of slices
a FORTRAN subroutine can hand off to another subroutine ??
That is a CHERRI capability that allows for access to every
even byte in a structure but no odd byte in the same structure ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Jan 8 11:53:51 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

-----------------

WATFIV would catch a(9,11)

Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.

It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)

How long does it take SW to construct the kinds of slices
a FORTRAN subroutine can hand off to another subroutine ??

For the snippet

subroutine bar
interface
subroutine foo(a)
real, intent(in), dimension(:,:) :: a
end subroutine foo
end interface
real, dimension(10,10) :: a
call foo(a)
end

(which calls foo with an assumed-shape array) gfortran hands this
to the middle end:

__attribute__((fn spec (". ")))
void bar ()
{
real(kind=4) a[100];

{
struct array02_real(kind=4) parm.0;

parm.0.span = 4;
parm.0.dtype = {.elem_len=4, .version=0, .rank=2, .type=3};
parm.0.dim[0].lbound = 1;
parm.0.dim[0].ubound = 10;
parm.0.dim[0].stride = 1;
parm.0.dim[1].lbound = 1;
parm.0.dim[1].ubound = 10;
parm.0.dim[1].stride = 10;
parm.0.data = (void *) &a[0];
parm.0.offset = -11;
foo (&parm.0);
}
}

The middle and back ends are then free to optimize.

That is a CHERRI capability that allows for access to every
even byte in a structure but no odd byte in the same structure ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Valencia@21:1/5 to Terje Mathisen on Sat Jan 11 13:59:21 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page.

I think I've mentioned this once before, but I did precisely this during my time at Sequent, and the C library blew up. Turned out the C string support routines were pulling in cache line lengths at a time, and it was such a win they didn't want to observe "strict" C string access rules. I assume they padded things such that no "real life" string could end up against a page boundary abutted to an invalid page address, but since they weren't
interested in fixing it, I stopped worrying about it.

A kinder, gentler time. I wonder if such things still lurk out there.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
Fediverse: @vandys@goto.vsta.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jan 11 22:31:15 2025

On Wed, 8 Jan 2025 11:53:51 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

-----------------

WATFIV would catch a(9,11)

Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.

It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)

How long does it take SW to construct the kinds of slices
a FORTRAN subroutine can hand off to another subroutine ??

For the snippet

subroutine bar
interface
subroutine foo(a)
real, intent(in), dimension(:,:) :: a
end subroutine foo
end interface
real, dimension(10,10) :: a
call foo(a)
end

(which calls foo with an assumed-shape array) gfortran hands this
to the middle end:

__attribute__((fn spec (". ")))
void bar ()
{
real(kind=4) a[100];

{
struct array02_real(kind=4) parm.0;

parm.0.span = 4;
parm.0.dtype = {.elem_len=4, .version=0, .rank=2, .type=3};
parm.0.dim[0].lbound = 1;
parm.0.dim[0].ubound = 10;
parm.0.dim[0].stride = 1;
parm.0.dim[1].lbound = 1;
parm.0.dim[1].ubound = 10;
parm.0.dim[1].stride = 10;
parm.0.data = (void *) &a[0];
parm.0.offset = -11;
foo (&parm.0);
}
}

The middle and back ends are then free to optimize.

Thank you for this example.

That is a CHERRI capability that allows for access to every
even byte in a structure but no odd byte in the same structure ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Keith Thompson on Wed Jan 15 07:09:38 2025

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Wed Jan 15 14:00:26 2025

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive. There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Wed Jan 15 18:00:34 2025

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.

What happens for mismatched array bounds between caller
and callee? Nothing, I guess?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Wed Jan 15 22:28:24 2025

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99. So, more like four
decades. Or 33 years since Fortran got its first standard.

There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.

What happens for mismatched array bounds between caller
and callee? Nothing, I guess?

I don't know. I didn't read this part of the standard. Or any part of
any C standard past C89.

Never used them, too. For me, multi-dimensional arrays look mostly like
source of confusion rather than useful feature. At least as long as
there are no automatically generated descriptors. With exception for
VERY conservative cases like array fields in structure, with all
dimensions fixed at compile time.

I don't know, but I can guess. And in case I am wrong Keith Thompson
will correct me.
Most likely the standard says that mismatched array bounds between
caller and callee is UB.
And most likely in practice it works as expected. I.e. if caller
defined the matrix as X[M][N] and caller is treating it as Y[P][Q] then
access to Y[i][j] for as long as k=i*Q+j < M*N will go to X[k/N][k%N].

However, you have to pay attention that in practice something like that happening by mistake with variably-modified types is far less likely
than it is with classic C multi-dimensional arrays.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Wed Jan 15 20:59:15 2025

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99.

It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Keith Thompson on Wed Jan 15 22:39:57 2025

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

In most but not all contexts. For example, `sizeof arr` yields the size
of the array, not the size of a pointer.

Jep.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

In C, multidimensional arrays are nothing more or less than arrays of
arrays. You can also build data structures using pointers that are
accessed using the same a[i][j] syntax as is used for a multidimensional array. And yes, they can be difficult to work with.

A pointer forest is also Not Good (TM) for efficiency...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Thu Jan 16 10:11:36 2025

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header
which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:

array[y][x] -> array[y*width + x]

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Thu Jan 16 11:43:48 2025

On 15/01/2025 21:28, Michael S wrote:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

It's not a big thing. VLA's were added in C99, but one big and
influential compiler supplier didn't want to bother supporting them
(there's lots in C99 that they didn't bother supporting) so the argued
for it to be optional in C11. By the time C23 was in planning, they had finally got around to supporting most of C99, so it is no longer
optional for standards compliance. But basically the situation is the
same as it always has been - if you use a solid C compiler like gcc,
clang, icc, etc., you can freely use VLA's. If you use MS's half-done
effort, you can't. (MS's compiler has much better support for newer C++ standards - they just seem determined to be useless at C support.)

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99. So, more like four
decades. Or 33 years since Fortran got its first standard.

Yes.

There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.

What happens for mismatched array bounds between caller
and callee? Nothing, I guess?

Bad things /might/ happen. But they might not - it's undefined behaviour.

I don't know. I didn't read this part of the standard. Or any part of
any C standard past C89.

Never used them, too. For me, multi-dimensional arrays look mostly like source of confusion rather than useful feature. At least as long as
there are no automatically generated descriptors. With exception for
VERY conservative cases like array fields in structure, with all
dimensions fixed at compile time.

I don't know, but I can guess. And in case I am wrong Keith Thompson
will correct me.
Most likely the standard says that mismatched array bounds between
caller and callee is UB.

Yes.

If you have:

int x[4][6];

then the expression "x[i]" is evaluated by converting "x" to a pointer
to an array of 6 ints. Thus x[0][6] would be an out-of-bounds access to
the first array of 6 ints in x - it is /not/ defined to work like
x[1][0], even though you'd get the same bit of memory if you worked out
the array address by hand.

In practice, it might work fine. When you declare an array type, the
compiler will believe you - C is a trusting language. But if you have
given the compiler conflicting information, things can go badly wrong.
So if you declare an array somewhere with one format that the compiler
can see, and then access it through an lvalue (such as a pointer) with a different format that the compiler also can see, the compiler might
generate code that assumes one format or the other, or a mix of them.
Or it might assume that the pointer can't refer to the declared array
because they are not the same format, and keep values cached in
registers that don't match up.

I expect you'd see problems most often if the compiler is able to make
use of SIMD or vector registers to handle blocks of the data at a time.
And you are more likely to see trouble with cross-module optimisations
(LTO in gcc terms) since it leads to greater sharing of information over
wider ranges of the code.

As always, the advice is not to lie to your compiler - it might not bite
you now, but it may well do in the future when you least expect it.

And most likely in practice it works as expected. I.e. if caller
defined the matrix as X[M][N] and caller is treating it as Y[P][Q] then access to Y[i][j] for as long as k=i*Q+j < M*N will go to X[k/N][k%N].

Remember that in C (and all other programming languages), if you try to
do something that is not defined behaviour, there isn't any concept of
"works as expected" as far as the language is concerned. What the
/programmer/ expected is a different matter - but if the language (or additional information from the compiler) does not define the behaviour,
then the programmer's expectations are based on a misunderstanding.

However, you have to pay attention that in practice something like that happening by mistake with variably-modified types is far less likely
than it is with classic C multi-dimensional arrays.

I'm not sure why you'd say that.

The rule for getting array code right is quite simple - don't use arrays without knowing the bounds for each dimension. You can get these by
passing bounds as parameters, or using fixed constants, or wrapping
fixed-size arrays in a struct and using sizeof - however you do it, make
sure you know the bounds and keep them consistent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Thu Jan 16 12:36:45 2025

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99.

It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.

Not really, no. The world of C programmers can be divided into those
that work with C compilers and can freely use pretty much anything in C,
and those that have to content with limited, non-standard or otherwise problematic compilers and write code that works for them. Such
compilers include embedded toolchains for very small microcontrollers or
DSPs, and MS's C compiler.

Some C code needs to be written in a way that works on MS's C compiler
as well as other tools, but most is free from such limitations. Even
for those targeting Windows, it's common to use clang or gcc for serious
C coding.

MS used to have a long-term policy of specifically not supporting C well because that might make it easier for people to write cross-platform C
code for Windows and Linux. Instead, they preferred to push developers
towards C# and Windows-only programming - or if that failed, C++ which
was not as commonly used on *nix. Now, I think, they just don't care
much about C - they don't see many people using their tools for C and
haven't bothered supporting any feature that needs much effort. They
know that they can't catch up with other C compilers, so have made it
easier to integrate clang with their IDE's and development tools.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Terje Mathisen on Thu Jan 16 13:11:56 2025

On 16/01/2025 10:11, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:

array[y][x] -> array[y*width + x]

That does not surprise me. Vec<> in Rust is very similar to
std::vector<> in C++, as far as I know (correct me if that's wrong). So
a vector of vectors of int is not contiguous or consistent - each
subvector can have a different current size and capacity. Doing a
bounds check for accessing xs[i][j] (or in C++ syntax, xs.at(i).at(j)
when you want bounds checking) means first reading the current size
member of the outer vector, and checking "i" against that. Then xs[i]
is found (by adding "i * sizeof(vector)" to the data pointer stored in
the outer vector). That is looked up to find the current size of this
inner vector for bounds checking, then the actual data can be found.

This is /completely/ different from classic C multi-dimensional arrays.
It is more akin to a one-dimensional C array of pointers to individually allocated one-dimensional C arrays - but even less efficient due to an
extra layer of indirection.

If you know the size of the data at compile time, then in C++ you have std::array<> where the information about size is carried in the type,
with no run-time cost. A nested std::array<> is a perfectly good and
efficient multi-dimensional array with runtime bounds checking if you
want to use it, as well as having value semantics (no decay to pointer
types in expressions). I would guess there is something equivalent in
Rust ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Thu Jan 16 13:59:55 2025

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's >>>>>>>>> a ton of C code out there), but trying to retrofit a safe
memory model onto C seems a bit awkward - it might have been >>>>>>>>> better to target a language which has arrays in the first
place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation
that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for
multi-dimensional arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not,
arrays stuff in C had not changed (AFAIK) since C99.

It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.

Not really, no. The world of C programmers can be divided into those
that work with C compilers and can freely use pretty much anything in
C, and those that have to content with limited, non-standard or
otherwise problematic compilers and write code that works for them.
Such compilers include embedded toolchains for very small
microcontrollers or DSPs, and MS's C compiler.

Some C code needs to be written in a way that works on MS's C
compiler as well as other tools, but most is free from such
limitations. Even for those targeting Windows, it's common to use
clang or gcc for serious C coding.

MS used to have a long-term policy of specifically not supporting C
well because that might make it easier for people to write
cross-platform C code for Windows and Linux. Instead, they preferred
to push developers towards C# and Windows-only programming - or if
that failed, C++ which was not as commonly used on *nix. Now, I
think, they just don't care much about C - they don't see many people
using their tools for C and haven't bothered supporting any feature
that needs much effort. They know that they can't catch up with
other C compilers, so have made it easier to integrate clang with
their IDE's and development tools.

Microsoft does care about C, but only in one specific area - kernel programming.

OK. That's not an area I have been involved in at all, so I will take
your word for it. Does that also extend to device drivers?

The only other language officially allowed for Windows
kernel programming is C++, but coding kernel drivers in C++ is
discouraged.

C++ is absolutely fine for low-level programming, but you need to know
how to write low-level C++ code. Someone used to writing application
code in C++ can write really bad low-level C++ code very, very quickly -
it takes more effort to get things totally wrong in C!

I suppose that driver written in C++ would have major
difficulties passing Windows HLK tests and getting WHQL signing.

I once took a brief look at that process many years ago, and decided it
was not for me!

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Thu Jan 16 14:35:32 2025

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's
a ton of C code out there), but trying to retrofit a safe
memory model onto C seems a bit awkward - it might have been
better to target a language which has arrays in the first
place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

C language always had multi-dimensional arrays, with limitation
that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for
multi-dimensional arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.

I'd missed that one.

Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.

From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.

Well, apart from playing with what is mandatory and what is not,
arrays stuff in C had not changed (AFAIK) since C99.

It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.

Not really, no. The world of C programmers can be divided into those
that work with C compilers and can freely use pretty much anything in
C, and those that have to content with limited, non-standard or
otherwise problematic compilers and write code that works for them.
Such compilers include embedded toolchains for very small
microcontrollers or DSPs, and MS's C compiler.

Some C code needs to be written in a way that works on MS's C
compiler as well as other tools, but most is free from such
limitations. Even for those targeting Windows, it's common to use
clang or gcc for serious C coding.

MS used to have a long-term policy of specifically not supporting C
well because that might make it easier for people to write
cross-platform C code for Windows and Linux. Instead, they preferred
to push developers towards C# and Windows-only programming - or if
that failed, C++ which was not as commonly used on *nix. Now, I
think, they just don't care much about C - they don't see many people
using their tools for C and haven't bothered supporting any feature
that needs much effort. They know that they can't catch up with
other C compilers, so have made it easier to integrate clang with
their IDE's and development tools.

Microsoft does care about C, but only in one specific area - kernel programming. The only other language officially allowed for Windows
kernel programming is C++, but coding kernel drivers in C++ is
discouraged. I suppose that driver written in C++ would have major
difficulties passing Windows HLK tests and getting WHQL signing.

As you can guess, in kernel drivers VLA are unwelcome. VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to David Brown on Thu Jan 16 16:46:17 2025

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.

VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.

IMO VMT-s are vastly superior to raw pointers, but to fully
get their advantages one would need better tools. Also,
kernel needs to deal with variable size arrays embedded in
various data structures. This is possible using pointers,
but current VMT-s are too weak for many such uses.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jan 16 17:24:58 2025

On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header
which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:

array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Thu Jan 16 09:15:55 2025

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:

array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to All on Thu Jan 16 09:55:50 2025

On 1/16/2025 9:24 AM, MitchAlsup1 wrote:

On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header >>> which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:

array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.

OK, so Terje's observation of it being faster doing the calculation
himself is due to him not doing these additional checks?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Waldek Hebisch on Thu Jan 16 18:12:46 2025

Waldek Hebisch <antispam@fricas.org> schrieb:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack.

You can pass them as VLAs (which Fortran has had since 1958)
or you can declare them. It is the latter which would need
to allocate on the stack.

But allocating them on the stack is an implementation detail.
Since Fortran 90, you can also do

subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(n,m) :: a

which will delcare the array a with the bounds of n and m.
(Fortran can also do dynamic memory allocation, so

subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(:,:), allocatable :: c
allocate (c(n,m))

would also work, and also automatically release the memory).

Because Fortran users are used to large arrays, any good Fortran
compiler will also allocate a on the heap if it is too large.

Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.

If you have a memory allocation pattern like

p1 = malloc(chunk_1); /* Fill it */
p2 = malloc(chunk_2);
/* Use it */
free (p2);
p3 = malloc(chunk_3);
/* Use it */
free (p3)
/* Use p1 */

There is a chance that p2 still pollutes the cache and parts of
p1 may have been removed unnecessarily. This would not have been
the case p2 and p3 had been allocated on the stack.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

Stack limits are artificial, but

I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.

... for kernels maybe less so.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jan 16 18:23:28 2025

On Thu, 16 Jan 2025 17:55:50 +0000, Stephen Fuld wrote:

On 1/16/2025 9:24 AM, MitchAlsup1 wrote:

On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a header >>>> which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>>> to get the actual position:

array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.

OK, so Terje's observation of it being faster doing the calculation
himself is due to him not doing these additional checks?

Most likely.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Jan 16 18:30:35 2025

On Thu, 16 Jan 2025 18:12:46 +0000, Thomas Koenig wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

David Brown <david.brown@hesbynett.no> wrote:

Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

Algol 60 only had stack allocation for dynamically sized arrays,
so stacks had to be as big as the data are.

Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.

If/when HW can track deallocating stack storage; the core
does not have to write back modified lines to memory at cache
line replacement. {{Look, once the SP moves out of that part of
the stack, nobody is allowed to dereference it anymore, so,
nobody cares about the value in those containers.}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu Jan 16 20:46:04 2025

On Thu, 16 Jan 2025 18:12:46 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why.
I've never understood why people think there is something
"dangerous" about VLAs, or why they think using heap allocations
is somehow "safer".

VLA normally allocate on the stack.

You can pass them as VLAs (which Fortran has had since 1958)
or you can declare them. It is the latter which would need
to allocate on the stack.

The part about passing, including dynamic allocation, is what in C
called VM types.

But allocating them on the stack is an implementation detail.
Since Fortran 90, you can also do

subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(n,m) :: a

which will delcare the array a with the bounds of n and m.
(Fortran can also do dynamic memory allocation, so

subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(:,:), allocatable :: c
allocate (c(n,m))

would also work, and also automatically release the memory).

Because Fortran users are used to large arrays, any good Fortran
compiler will also allocate a on the heap if it is too large.

Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.

In user space it is just unfortunate tradition. Not in all languages,
BTW. In Go, for example, default stack is 1 GB, which is still small,
but not ridiculously small as 1 to 8 MB that are typical in C, C++,
Rust and I suppose Fortran.
However original point of discussion was kernel programming. In kernel
there are pretty good reasons in place why default stack is very small.
8-32 KB, I think. May be on Apple few times bigger, I didn't check.
The reason is that in many kernel contexts page fault not allowed, so
you have to allocate physical memory rather than just reserve address
space.

"To avoid infinite recursion" is not a valid reason, IMHO.

Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.

If you have a memory allocation pattern like

p1 = malloc(chunk_1); /* Fill it */
p2 = malloc(chunk_2);
/* Use it */
free (p2);
p3 = malloc(chunk_3);
/* Use it */
free (p3)
/* Use p1 */

There is a chance that p2 still pollutes the cache and parts of
p1 may have been removed unnecessarily. This would not have been
the case p2 and p3 had been allocated on the stack.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

Stack limits are artificial, but

I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.

... for kernels maybe less so.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Thu Jan 16 20:12:03 2025

Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:

Â array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

Because Rust really doesn't have multi-dim vectors, instead using
vectors of pointers to vectors?

OTOH, it is perfectly OK to create your own multi-dim data structure,
and using macros you can probably get the compiler to generate
near-optimal code as well, but afaik, nothing like that is part of the
core language.

I do know that several people have created fast string libraries, where
any string that is short enough ends up entirely inside the dope vector,
so no heap allocation.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jan 16 19:14:08 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:

array[y][x] -> array[y*width + x]

That is what any Fortran compiler does under the hood, of course.

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

It should.

The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.

Depends on the relevant flag for bounds checking (at least for Fortran).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Thu Jan 16 20:22:35 2025

Stephen Fuld wrote:

On 1/16/2025 9:24 AM, MitchAlsup1 wrote:

On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a
header
which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication
to get the actual position:

Â Â array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler generate
code for that itself?

The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.

OK, so Terje's observation of it being faster doing the calculation
himself is due to him not doing these additional checks?

No, more like one of the Advent of Code problems that naively looked
like a nice little hash table problem, with strings as the keys:

"-4,0,3,6"

I.e. 4 integers, all in the -9 to 9 range, used to verify that this was
the first time this particular combination was seen.

The first speedup (compared to my original Perl code) was from
converting this to 4 signed byte values all packed into a u32 variable,
then on each iteration I would shift the key up by 8 (getting rid of the
oldest delta) and add in the new delta as the new bottom byte, then use
that u32 as the hash table key.

My code became an order of magnitude faster when I instead allocated a
single vector with 19*19*19*19 elements, then biased each of those four
delta values by +9 so that they would all be in the [0..18] range
instead of [-9..9], and do the addressing as ((d3*19+d2)*19+d1)*19+d0.

Rust would still verify that the final value was in range, but this
becomes a single (never taken) CMP/JA combination.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Waldek Hebisch on Thu Jan 16 21:02:28 2025

antispam@fricas.org (Waldek Hebisch) writes:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

IIUC popular current processors are still quite far from having
64-bit virtual address space, so there is still reason to limit
stack size, simply limit can be much bigger than on 32-bit
systems.

The ARMv8/ARMv9 architecture supports up to 52 bits of
VA space (and up to 52-bits of PA space). Most implementations
typically provide 48/48; I know of one that does 52/52
and another that supports 48/52.

Going larger would require more levels of translation table.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Thomas Koenig on Thu Jan 16 20:34:51 2025

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack.

You can pass them as VLAs (which Fortran has had since 1958)
or you can declare them.

As explained in other post in C VLA means allocation, passing
is VMT.

Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.

Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.

On multiuser machine there is some point in it: you do not
want buggy student program to cause thrashing. In other
words you need stack limit that is some smallish fraction
of real memory. With virtual memory heap allocations bigger
than RAM work fine.

There is good reason for small kernel stacks: it is used to
handle interupts, including page faults, so must be real
memory. Since each thread needs its own kernel stack, bigger
stack would mean quite a lot of memory use.

In 32-bit era there was also valid reason for small user stacks.
Namely, one needs to pre-allocate address space for stack(s) and
with lots of threads there is not enough address space to give
sizeable stack to each thread.

IIUC popular current processors are still quite far from having
64-bit virtual address space, so there is still reason to limit
stack size, simply limit can be much bigger than on 32-bit
systems.

There is also another issue: stack allocations become invalid
when routine doing allocation returns. Which depending on
application may be unacceptable. So, reuse of code doing
stack allocation is tricky, while for heap allocation simple
reference count may be ehough to ensure that allocation is
freed when nobody uses given area. Consequently, there is
tendency to use heap allocation to allow more flexible use
patterns. With more use of heap allocation there is less
use of stack allocation and big stacks are considered
unnecessary.

Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.

Sure.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

Stack limits are artificial, but

I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.

... for kernels maybe less so.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Thu Jan 16 22:23:38 2025

On 16/01/2025 22:10, Keith Thompson wrote:

David Brown <david.brown@hesbynett.no> writes:

On 16/01/2025 10:11, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:
It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:
array[y][x] -> array[y*width + x]

Note that this will inhibit bounds checking on the inner dimension.
That might be part of the reason for the improvement in speed.

For example, given int array[10][10], array[0][11] is out of bounds,
even if it logically refers to the same location as array[1][0]. This results in undefined behavior in C, and perhaps some kind of exception
in a language that requires bounds checking. If you do this manually by defining a 1d array, any checking applies only to the entire array.

That does not surprise me. Vec<> in Rust is very similar to
std::vector<> in C++, as far as I know (correct me if that's wrong).
So a vector of vectors of int is not contiguous or consistent - each
subvector can have a different current size and capacity. Doing a
bounds check for accessing xs[i][j] (or in C++ syntax, xs.at(i).at(j)
when you want bounds checking) means first reading the current size
member of the outer vector, and checking "i" against that. Then xs[i]
is found (by adding "i * sizeof(vector)" to the data pointer stored in
the outer vector). That is looked up to find the current size of this
inner vector for bounds checking, then the actual data can be found.

I'm not familiar with Rust's Vec<>, but C++'s std::vector<> guarantees
that the elements are stored contiguously. But the std::vector<> object itself doesn't contain those elements; it's a fixed-size chunk of data (basically a struct in C terms) whose size doesn't change regardless of
the number of elements (and typically regardless of the element type).
So a std::vector<std::vector<int>> will result in the data for each row
being stored contiguously, but the rows themselves will be allocated dynamically.

Yes, exactly.

Of course you could do as Terje did in Rust - make a std::vector<> of
size N x M and do the "i * N + j" calculation manually. Now that C++23
has a multi-parameter subscript operator, you can do that quite neatly
in a little wrapper class around a std::vector<> with a nice access
operator. But it's still more efficient to use a std::array<> if you
know the sizes at compile time.

This is /completely/ different from classic C multi-dimensional
arrays. It is more akin to a one-dimensional C array of pointers to
individually allocated one-dimensional C arrays - but even less
efficient due to an extra layer of indirection.

If you know the size of the data at compile time, then in C++ you have
std::array<> where the information about size is carried in the type,
with no run-time cost. A nested std::array<> is a perfectly good and
efficient multi-dimensional array with runtime bounds checking if you
want to use it, as well as having value semantics (no decay to pointer
types in expressions). I would guess there is something equivalent in
Rust ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Waldek Hebisch on Thu Jan 16 22:16:43 2025

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate
anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

The stack on Linux is 10 MB by default, and 1 MB by default on Windows.
That's a /lot/ if you are working with fairly small but non-constant
sizes. So if you are working with a selection of short-lived
medium-sized bits of data - say, parts of strings for some formatting
work - putting them on the stack is safe and can be significantly faster
than using the heap.

Using VLAs (or the older but related technique, alloca) means you don't
waste space. Maybe you are working with file paths, and want to support
up to 4096 characters per path - but in reality most paths are less than
100 characters. With fixed size arrays, allocating 16 of these and initialising them will use up your entire level 1 cache - with VLAs, it
will use only a tiny fraction. These things can make a big difference
to code that aims to be fast.

Fixed size arrays are certainly easier to analyse and are often a good
choice, but VLA's definitely have their advantages in some situations,
and they are perfectly safe and reliable if you use them appropriately
and correctly.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

Other people might have bad uses of VLAs - it doesn't mean /you/ have to
use them badly too!

I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.

If a kernel stack is that small (or you are working on an embedded
system with very small stacks), then clearly you have to take that into account. I've used them a couple of times in embedded systems with
small stacks - obviously the size of the VLA was also small. (On such
systems, heap allocations are very much unwelcome - though not quite as unwelcome as overflowing the stack :-) )

Far and away my most common use of VLAs is, however, not variable length
at all. It's more like :

const int no_of_whatsits = 20;
const int size_of_whatsit = 4;

uint8_t whatsits_data[no_of_whatsits * size_of_whatsit];

Technically in C, that is a VLA because the size expression is not a
constant expression according to the rules of the language. But of
course it is a size that is known at compile-time, and the compiler
generates exactly the same code as if it was a constant expression. It
is equally amenable to analysis and testing. (In C++, it is considered
a normal array - C++ does not support VLAs, but is happy with code like
that.) With C23, these const variables can now be constexpr, and the
array will then be a normal array and not a VLA - without that making
the slightest difference to the actual generated code.

VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.

IMO VMT-s are vastly superior to raw pointers, but to fully
get their advantages one would need better tools. Also,
kernel needs to deal with variable size arrays embedded in
various data structures. This is possible using pointers,
but current VMT-s are too weak for many such uses.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Thu Jan 16 21:40:39 2025

David Brown <david.brown@hesbynett.no> writes:

On 16/01/2025 17:46, Waldek Hebisch wrote:

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate >anything on the heap without knowing the bounds and being sure it is >appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

The stack on Linux is 10 MB by default, and 1 MB by default on Windows.

On all the linux systems I use, the stack limit defaults to 8192KB.

That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

Now, that's for the primary thread stack, for which the OS
manages the growth. For other threads in the process,
the size varies based on the threads library in use
and whether the application is compiled for 32-bit or
64-bit systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Keith Thompson on Thu Jan 16 23:39:13 2025

On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]

I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope
vector, so no heap allocation.

Some implementations of C++ std::string do this. For example, the GNU implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.

Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to David Brown on Fri Jan 17 02:22:54 2025

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

Well, AFAICS VLA-s may get allocated on function entry. In such
case caller have to check for allocation size, which spreads
allocation related code between caller and called function.
In case of 'malloc' one can simply check return value. In fact,
in many programs simple wrapper that exits in case of allocation
failure is enough (if application can not do its work without
memory and there is no memory, then there is no point in continuing
execution).

The stack on Linux is 10 MB by default, and 1 MB by default on Windows. That's a /lot/ if you are working with fairly small but non-constant
sizes. So if you are working with a selection of short-lived
medium-sized bits of data - say, parts of strings for some formatting
work - putting them on the stack is safe and can be significantly faster
than using the heap.

IME this is relatively rare case. For formatting frequently a single
result buffer (possibly expanded when needed) with other pieces of
data added there gave me good performance. Intermediate strings
appeared as return values of called functions. Without reoganizing
code this does not respect stack discipline. Once reorganized
I get best results without materializing intermediate strings.

Using VLAs (or the older but related technique, alloca) means you don't
waste space. Maybe you are working with file paths, and want to support
up to 4096 characters per path - but in reality most paths are less than
100 characters. With fixed size arrays, allocating 16 of these and initialising them will use up your entire level 1 cache - with VLAs, it
will use only a tiny fraction.

One case initialize only used part. Or simply used uninitialized
arrays (that is what I do normally). It rather hard to give
meaningful initialization in case where size of payload varies.

These things can make a big difference
to code that aims to be fast.

Fixed size arrays are certainly easier to analyse and are often a good choice, but VLA's definitely have their advantages in some situations,
and they are perfectly safe and reliable if you use them appropriately
and correctly.

In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.

Other people might have bad uses of VLAs - it doesn't mean /you/ have to
use them badly too!

Well, for me typical cases is for work arrays where needed size
may vary widely. Using 'malloc' is simpler is such use given
small stack limit. With small stack limit VLA would be a
micro-optimization, not worth extra effort.

<snip>

Far and away my most common use of VLAs is, however, not variable length
at all. It's more like :

const int no_of_whatsits = 20;
const int size_of_whatsit = 4;

uint8_t whatsits_data[no_of_whatsits * size_of_whatsit];

Technically in C, that is a VLA because the size expression is not a
constant expression according to the rules of the language. But of
course it is a size that is known at compile-time, and the compiler
generates exactly the same code as if it was a constant expression.

OK, that is useful case (but in spirt this is not VLA).

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Keith Thompson on Fri Jan 17 02:10:52 2025

On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]

I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.

Some implementations of C++ std::string do this. For example, the GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.

Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

I don't understand. What pointer are you referring to?

The pointer which would have had to point elsewhere had the string
not been contained within.

In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Fri Jan 17 10:20:43 2025

On 16/01/2025 22:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 16/01/2025 17:46, Waldek Hebisch wrote:

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

The stack on Linux is 10 MB by default, and 1 MB by default on Windows.

On all the linux systems I use, the stack limit defaults to 8192KB.

OK. The details don't matter much here. (Of course, if you are
intending to put large objects on the stack, then the details /do/
matter, and you probably want to specify a minimum stack size explicitly.)

That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

Now, that's for the primary thread stack, for which the OS
manages the growth. For other threads in the process,
the size varies based on the threads library in use
and whether the application is compiled for 32-bit or
64-bit systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Keith Thompson on Fri Jan 17 15:52:48 2025

On 17/01/2025 04:52, Keith Thompson wrote:

antispam@fricas.org (Waldek Hebisch) writes:
[...]

Well, AFAICS VLA-s may get allocated on function entry.

[...]

That would rarely be possible for objects with automatic storage
duration (local variables). For example:

void func(void) {
do_this();
do_that();
int vla[rand() % 10 + 1];
}

Memory for `vla` can't be allocated until its size is known,
and it can't be known until the definition is reached. For most automatically allocated objects, the lifetime begins when execution
reaches the `{` of the enclosing block; the lifetime of `vla`
begins at its definition.

Or did you have something else in mind?

I'm guessing he was thinking of something like :

void func(int n) {
if (n < 1000) {
int vla[n];
do_stuff(vla);
} else {
int * p = malloc(n * sizeof(int));
do_stuff(p);
free(p);
}
}

Although the lifetime of vla[n] is limited to the block that is in that
one branch, the compiler could certainly handle the allocation with a
single stack-pointer change at the entry to the function. It is common
for optimised code to try to have just one stack frame allocation at
code entry, and a deallocation at exit, rather than re-arranging the
stack within blocks of code. But it is not common to do so when the
sizes are not known at compile time and the VLA (or alloca) is not on
all paths - precisely because the programmer might be doing such checks.

(Should this part of the discussion migrate to comp.lang.c, or is it
still sufficiently relevant to computer architecture?)

Some of the "arch" folks here have compared to other languages, which is
nice. But if regulars here think the thread branch has become too
bogged down in details of C, we can stop.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Waldek Hebisch on Fri Jan 17 15:30:24 2025

On 17/01/2025 03:22, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

Well, AFAICS VLA-s may get allocated on function entry.

It could be allocated as soon as the size is known, yes.

In such
case caller have to check for allocation size, which spreads
allocation related code between caller and called function.

It is not about allocation sizes - it's about knowing the data you are
dealing with, and sanitising unknown data.

In the very rough example I gave of string formatting or manipulation,
you might be getting the strings in from outside - command line
parameters, database entries, wildcard directory searches, etc. You
sanity check the data when it comes in - regardless of whether or not
you plan to allocate memory (stack or heap) for copying them. Now you
know that the sizes are reasonable, you can allocate VLAs (or use
alloca, or use malloc) without extra worries.

I am not suggesting you should have some kind of rule to check sizes
just before every VLA declaration - I am suggesting that when you know
the size is reasonable and safe, then using a VLA is reasonable and safe.

In case of 'malloc' one can simply check return value.

Drivel.

That's a myth that originated in the days of K&R C.

It is certainly true that if malloc returns 0, your allocation has
failed. There are a few - but only a very few - circumstances where
that is something that can realistically happen in code that is doing
its job properly. Typically that would be in resource-constrainted
systems where you might have some unusual circumstances causing overload.

But generally (and this means there will be exceptions), checking for
null returns from malloc is :

a) Never properly tested, and often results in leaked resources or other problems;

b) Totally unrealistic in any real-world use of the code;

c) Treated as though it is a divine duty that must always be done
ritually and religiously;

d) Treated as though it magically makes the code safe, correct and reliable.

Hopefully you can see that these points are self-contradictory.

If you try to call malloc with a size that is unreasonable for the circumstances, all kinds of bad things can happen /despite/ a non-null
return value. What goes wrong can depend on many factors, including the
OS, the malloc library, the size, the system setup, and what you do with
the returned pointer. Simply /trying/ to run malloc with a bad size
may, on some systems, lead to the OS trying to free up as much memory as
it can in order to accommodate your request - whether malloc ends up
returning null or not. Or maybe the request is done in with lazy
allocations - you asked for 100 TB of memory and you got a pointer back,
and things will only go wrong when you start using the virtual space.

Remember, from the point of view of people using the computer, having
the OS push lots of stuff out of memory is tantamount to a broken
system. A program that has runaway memory usage causes great
frustration, and often leads to users doing a hard reset. And all the
time, the malloc() calls have returned a non-null value.

So what does all that mean? It means you do /not/ blindly call
malloc(), check for a null result, and think that's all good. It means
you be sure you know what sizes you are asking for /before/ you call
malloc - probably long before you get to the bit of code that actually
calls malloc(). It means you look /before/ you leap - you don't "just
go for it" and hope that you can figure out what went wrong from the
debris left at the crash site.

And if you are in doubt - maybe you are pushing the target system to the limits, or have a program that demands more memory than many systems
might have - you check in advance to see if the memory will be easily available. Such checks will be OS specific, of course.

(I'm sure some people will now be thinking "you should have used
ulimit", or "don't enable swap", or "that's the fault of over-commit".
That would all be missing the point. You can of course use such tools
as a way of making sure your sizes are reasonable - it's up to the
developer to decide how to handle such checks and controls. But
checking the return of malloc is so far from being sufficient that it is basically useless in most circumstances.)

It is /exactly/ the same for VLAs (or alloca).

The limits for what sizes are "reasonable" will, of course, be smaller
for stack allocations than for heap allocations. But that's all target dependent anyway - for the systems I typically work with, the limit for "reasonable" heap allocations is orders of magnitude smaller than
"reasonable" stack allocations on desktops.

In fact,
in many programs simple wrapper that exits in case of allocation
failure is enough (if application can not do its work without
memory and there is no memory, then there is no point in continuing execution).

Have you ever seen that happening in real life? Have you ever even
known such code to be properly tested?

Don't get me wrong - a wrapper like this can be a good idea. But it's
like an electrical fuse - it's a last resort, and only triggers if
something has gone badly wrong. When you see a great music system with
a 10 kW amplifier, you check if your house electrical system can handle
that /before/ you buy it. You don't buy it, plug it in and rely on the
fusebox to keep your house from burning down - even though you want the
fuse there as a failsafe. For the most part, if malloc ever returns 0,
the problem lies before malloc is called.

(Sorry for the rant - "my code is safe because I check the result of
malloc" is one of these misconceptions that really annoy me.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brian G. Lucas on Fri Jan 17 15:17:29 2025

"Brian G. Lucas" <bagel99@gmail.com> writes:

On 1/17/25 4:20 AM, David Brown wrote:

On 16/01/2025 22:40, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 16/01/2025 17:46, Waldek Hebisch wrote:

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate >>>> anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off >>>> point that is different.

The stack on Linux is 10 MB by default, and 1 MB by default on Windows. >>>

On all the linux systems I use, the stack limit defaults to 8192KB.

On linux, one can call the routine setrlimit(RLIMIT_STACK, ...) to change
the stack size.

Yes, as a unix/linux kernel engineer, I've implemented that system call
and the supporting kernel infrastructure in a version of unix a few
decades ago.

I'll point out that the implementation provides both HARD and SOFT
limits for the stack (and all other resources), and the user can
only affect the SOFT limit, and the user may not raise the SOFT
limit above the HARD limit, unless running with the appropriate
capabilities (e.g. UID == 0).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Fri Jan 17 16:15:36 2025

On 17/01/2025 03:10, MitchAlsup1 wrote:

On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]

I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>>> vector, so no heap allocation.

Some implementations of C++ std::string do this. For example, the GNU >>>> implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.

Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

I don't understand. What pointer are you referring to?

The pointer which would have had to point elsewhere had the string
not been contained within.

There are a couple of ways you can do "small string optimisation". One
would be to have a structure something like this :

struct String1 {
size_t capacity;
char * data;
char small_string[16];
}

Then "data" would point to "small_string" for a capacity of 16, and if
that's not enough, use malloc to allocate more space.

An alternative would be to have something like this (I'm being /really/
sloppy with alignments, rules for unions, and so on - this is
illustrative only, not real code!) :

struct String2 {
bool is_small;
union {
char small_string[31];
struct {
size_t capacity;
char * data;
}
}
}

This second version lets you put more characters in the local
small_string area, reusing space that would otherwise be used for the
pointer and capacity. But it has more runtime overhead when using the
string :

void print1(String1 s) {
printf(s.data);
}

void print2(String2 s) {
if (s.is_small) {
printf(s.small_string);
} else {
printf(s.data);
}
}

There are, of course, many other ways to make string types (such as
supporting copy-on-write), but I suspect that Mitch is thinking of style String2 while Keith is thinking of style String1.

In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Fri Jan 17 16:42:17 2025

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

These may be points that you are looking at for your embedded work,
but the average programmer does not.

An example, Fortran-specific: Fortran 2018 made all procedures
recursive by default. This means that some Fortran codes will start
crashing because of stack overruns when this is implemented :-(

You don't allocate
anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

What would you recommend as a limt? (See fmax-stack-var-size=N
in gfortran).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to David Brown on Fri Jan 17 18:02:28 2025

David Brown wrote:

On 17/01/2025 03:10, MitchAlsup1 wrote:

On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]

I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the
dope
vector, so no heap allocation.

Some implementations of C++ std::string do this.Â For example, the >>>>> GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.

Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !! >>>

I don't understand.Â What pointer are you referring to?

The pointer which would have had to point elsewhere had the string
not been contained within.

There are a couple of ways you can do "small string optimisation". One would be to have a structure something like this :

struct String1 {
    size_t capacity;
    char * data;
    char small_string[16];
}

Then "data" would point to "small_string" for a capacity of 16, and if
that's not enough, use malloc to allocate more space.

An alternative would be to have something like this (I'm being /really/ sloppy with alignments, rules for unions, and so on - this is
illustrative only, not real code!) :

struct String2 {
    bool is_small;
    union {
        char small_string[31];
        struct {
            size_t capacity;
            char * data;
        }
    }
}

This second version lets you put more characters in the local
small_string area, reusing space that would otherwise be used for the pointer and capacity. But it has more runtime overhead when using the string :

    void print1(String1 s) {
        printf(s.data);
    }

    void print2(String2 s) {
        if (s.is_small) {
            printf(s.small_string);
        } else {
            printf(s.data);
        }
    }

There are, of course, many other ways to make string types (such as supporting copy-on-write), but I suspect that Mitch is thinking of style String2 while Keith is thinking of style String1.

All Vec<> types have a 3-word descriptor, with the first and second word
being a pointer to the data and the current length, while the allocated
size is stored in the third word.

This is a total of 24 bytes, so quite a bit of overhead if you just need
a few bytes.

In the Rust Fast/Small vector type, they could use the top bit of the
size field (no Vec<> object can be larger or equal to 2^63 bytes), then
they need 4 or 5 bits for the actual length (but using 7 is easier),
leaving 23 bytes for the embedded data. With little-endian storage this corresponds to the last byte of the 24-byte block.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Fri Jan 17 18:21:26 2025

On 17/01/2025 17:42, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

That would be another way of saying you have no idea when your program
is going to blow up from lack of stack space. You don't need VLAs to
cause such problems.

In reality, you /do/ know a fair amount. Often your knowledge is
approximate - you know you are not going to need anything like as much
stack as the system provides, and you don't worry about it. In other situations (such as in small embedded systems), you think about it all
the time - again, regardless of any VLAs.

If you are in a position where you suspect you might be pushing close to
the limits of your stack, "standard" doesn't come into it - you are
dealing with a particular target, and you can use whatever functions or
support that target provides.

These may be points that you are looking at for your embedded work,
but the average programmer does not.

The average programmer can think "I've got megabytes of stack. There's
no problem with VLAs of several KB." That's often fine - all you need
to do is be sure that your VLA's are a no more than a few KB in size.
Your code is as safe (in this aspect) as pretty much any other piece of
code on the platform.

An example, Fortran-specific: Fortran 2018 made all procedures
recursive by default. This means that some Fortran codes will start
crashing because of stack overruns when this is implemented :-(

You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

What would you recommend as a limt? (See fmax-stack-var-size=N
in gfortran).

Maybe 50 KB ? It's going to depend highly on the code. Obviously if
you have a recursive function, you are not going to want a big stack
frame. But for occasional one-off use, big stack frames are fine -
VLA's, fixed arrays, or anything else. Once you are getting bigger than
that, the overhead of malloc is probably negligible in measurable
performance. (There are others here who could do a far better job than
I at estimating that accurately - I am more concerned with what
influences code reliability.)

(gcc has stack frame limit flags for C and C++ too. And it can generate reports on function stack usage. In all cases, it can only know about
limits if they are fixed, or at least limited, at compile time.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Fri Jan 17 20:08:30 2025

David Brown <david.brown@hesbynett.no> schrieb:

On 17/01/2025 17:42, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

That would be another way of saying you have no idea when your program
is going to blow up from lack of stack space. You don't need VLAs to
cause such problems.

I may know a few things (or I can find out), but the general user,
especially somebody who writes scientific software in Fortran,
in general does not.

In reality, you /do/ know a fair amount. Often your knowledge is
approximate - you know you are not going to need anything like as much
stack as the system provides, and you don't worry about it. In other situations (such as in small embedded systems), you think about it all
the time - again, regardless of any VLAs.

If you are in a position where you suspect you might be pushing close to
the limits of your stack, "standard" doesn't come into it - you are
dealing with a particular target, and you can use whatever functions or support that target provides.

Again, try look at it from the viewpoint of somebody who writes
scientific or technical code, and for whom such code should
"just work". Also look at it from the viewpoint of somebody who
co-maintains a compiler for such people.

gfortran has the -fstack-arrays option, which can bring very
large performance improvements - 50% in some real-world code.
Do I know what code users are writing? Not in the least,
unless they provide bug reports.

And a stack overflow has the most unfriendly user interface of all.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Keith Thompson on Fri Jan 17 19:27:25 2025

On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]

I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.

Some implementations of C++ std::string do this. For example, the GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.

Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

I don't understand. What pointer are you referring to?

In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.

So, when it is stored elsewhere, how do you get from the struct to the
string ??
You use a pointer !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Sun Jan 19 18:49:00 2025

In article <r1fiP.189541$FOb4.58758@fx15.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

On all the linux systems I use, the stack limit defaults to 8192KB.

That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

Now, that's for the primary thread stack, for which the OS
manages the growth. For other threads in the process,
the size varies based on the threads library in use
and whether the application is compiled for 32-bit or
64-bit systems.

The library I work on documents the required stack sizes for threads that
enter it, and for the threads it creates. Just another of the details one
has to take care of. We didn't think of it when the project was started,
but that was forty years ago this year.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Terje Mathisen on Mon Jan 20 12:29:13 2025

On 1/16/2025 11:12 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:

Â array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler
generate code for that itself?

Because Rust really doesn't have multi-dim vectors, instead using
vectors of pointers to vectors?

OTOH, it is perfectly OK to create your own multi-dim data structure,
and using macros you can probably get the compiler to generate near-
optimal code as well, but afaik, nothing like that is part of the core language.

That surprised me. So I did a search for "Rust Multi dimensional
arrays", and got several hits. It seems there are various ways to do
this depending upon whether you want an array of arrays or a
"traditional" multi-dimensional array. There is a crate for the latter.

I don't know enough Rust to get all the details in the various search
results, but it seems there are options.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tkoenig@netcologne.de on Tue Jan 21 20:30:47 2025

On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.

The bigger problem is knowing how much stack is available to use -
there may be no way (or no easy way) to find the actual size ... or
the limit if the stack expands ... and circumstances beyond the
program may have limited it to be smaller than the program requested.

These may be points that you are looking at for your embedded work,
but the average programmer does not.

An example, Fortran-specific: Fortran 2018 made all procedures
recursive by default. This means that some Fortran codes will start
crashing because of stack overruns when this is implemented :-(

You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.

What would you recommend as a limt? (See fmax-stack-var-size=N
in gfortran).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to George Neuner on Wed Jan 22 02:19:57 2025

On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.

On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Jan 22 14:15:56 2025

Stephen Fuld wrote:

On 1/16/2025 11:12 AM, Terje Mathisen wrote:

Stephen Fuld wrote:

On 1/16/2025 1:11 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:
[...]

CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.

[...]

C does have arrays.

Sort of - they decay into pointers at first sight.

But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.

Rust provides an interesting data point here:

It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.

However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:

Â Ã‚Â array[y][x] -> array[y*width + x]

Terje

I am obviously missing something, but why doesn't the compiler
generate code for that itself?

Because Rust really doesn't have multi-dim vectors, instead using
vectors of pointers to vectors?

OTOH, it is perfectly OK to create your own multi-dim data structure,
and using macros you can probably get the compiler to generate near-
optimal code as well, but afaik, nothing like that is part of the core
language.

That surprised me. So I did a search for "Rust Multi dimensional
arrays", and got several hits. It seems there are various ways to do
this depending upon whether you want an array of arrays or a
"traditional" multi-dimensional array. There is a crate for the latter.

I don't know enough Rust to get all the details in the various search results, but it seems there are options.

Notice what I wrote above, Rust allows for compile-time code generation
in the form of macros which are in some ways even more powerful than C++ templates, so I'n not surprised to learn that there already exists
public crate(s) to handle this. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 14:58:04 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>>>> never understood why people think there is something "dangerous" about >>>>>> VLAs, or why they think using heap allocations is somehow "safer".

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.

On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.

I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 22 17:45:47 2025

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

David Brown <david.brown@hesbynett.no> schrieb:

On 16/01/2025 17:46, Waldek Hebisch wrote:

David Brown <david.brown@hesbynett.no> wrote:

On 16/01/2025 13:35, Michael S wrote:

On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:

On 15/01/2025 21:59, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

As you can guess, in kernel drivers VLA are unwelcome.

I can imagine that they are - but I really don't understand why. I've >>>>>>> never understood why people think there is something "dangerous" about >>>>>>> VLAs, or why they think using heap allocations is somehow "safer". >>>>>>

VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.

Sure.

Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.

You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.

In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.

Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.

On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.

I would discourage programmers from relying on that for any reason whatsoever. The aux vectors are pushed before the envp entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Wed Jan 22 18:44:14 2025

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Notice what I wrote above, Rust allows for compile-time code generation
in the form of macros which are in some ways even more powerful than C++ templates, so I'n not surprised to learn that there already exists
public crate(s) to handle this. :-)

That sounds scary; C++ templates are already Turing-complete...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 20:00:30 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.

I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 22 22:25:33 2025

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.

I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 22:44:45 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and subtract >>>>>SP from it.

I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want to
do.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu Jan 23 01:39:29 2025

On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and >>>>>subtract SP from it.

I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the envp
entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want to
do.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
Of those that do support signals, not every one supports catching
SIGSEGV.
Of those that do support catching SIGSEGV, not every one can recover
after that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Jan 23 01:00:49 2025

Michael S <already5chosen@yahoo.com> writes:

On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.

I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the envp
entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want to
do.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

Linux and MAC can't touch windows in terms of volume - but I'd
argue that in the universe of programmers, they're close to
if not equal to windows. The vast vast majority of windows
users don't have a compiler. Those that do are working at
a higher level where the knowlege of the stack base address
would not be a useful value to know.

Unix (bsd/sysv) and linux support the ucontex argument
on the signal handler which provides processor state so
the signal handler can recover from the fault in whatever
fashion makes sense then transfer control to a known
starting point (either siglongjmp or by manipulating the
return context provided to the signal handler). This is
clearly going to be processor and implementation specific.

Yes, Windows is an abberation. I offered a solution, not
"the" solution. I haven't seen any valid reason for a program[*]
to need to know the base address of the process stack; if there
were a need, the implementation would provide. I believe windows
does have a functional equivalent to SIGSEGV, no? A quick search
shows "EXCEPTION_ACCESS_VIOLATION" for windows.

[*] Leaving aside the rare system utility or diagnostic
utility or library (e.g. valgrind, et alia may find
that a useful datum).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Thu Jan 23 08:14:52 2025

Michael S <already5chosen@yahoo.com> writes:

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

"man raise" tells me that raise() is C99. "man signal" tells me that
signal() is C99.

Of those that do support signals, not every one supports catching
SIGSEGV.

"man 7 signal" tells me that SIGSEGV is P1990, i.e., 'the original
POSIX.1-1990 standard'. I.e., there were even some Windows systems
that support it.

Of those that do support catching SIGSEGV, not every one can recover
after that.

Gforth catches and recovers from SIGSEGV in order to return to
Gforth's command line rather than terminating the process; in
snapshots from recent years that's also used for determining whether
some number is probably an address (try to read from that address; if
there's a signal, it's not an address). I tried building Gforth on a
number of Unix systems, and even the most rudimentary ones (e.g.,
Ultrix), supported catching SIGSEGV. There is a port to Windows with
Cygwin done by Bernd Paysan. I don't know if that could catch
SIGSEGV, but I am sure that it's possible in Windows in some way, even
if that way is not available through Cygwin.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Thu Jan 23 11:52:32 2025

On Thu, 23 Jan 2025 01:00:49 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.

I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the
envp entries.

This brings into question what is "on" the stack ?? to be
included in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want
to do.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

Linux and MAC can't touch windows in terms of volume - but I'd
argue that in the universe of programmers, they're close to
if not equal to windows. The vast vast majority of windows
users don't have a compiler. Those that do are working at
a higher level where the knowlege of the stack base address
would not be a useful value to know.

I did not have "big" computers in mind. In fact, if we only look at
"big" things then Android dwarfs anything else. And while Android is not
POSIX complaint, it is probably similar enough for your method to work.

I had in mind smaller things.
All but one of very many embedded environments that I touched in
last 3 decades had no signals. The exceptional one was running
Linux.

Unix (bsd/sysv) and linux support the ucontex argument
on the signal handler which provides processor state so
the signal handler can recover from the fault in whatever
fashion makes sense then transfer control to a known
starting point (either siglongjmp or by manipulating the
return context provided to the signal handler). This is
clearly going to be processor and implementation specific.

Yes, Windows is an abberation. I offered a solution, not
"the" solution. I haven't seen any valid reason for a program[*]
to need to know the base address of the process stack; if there
were a need, the implementation would provide. I believe windows
does have a functional equivalent to SIGSEGV, no? A quick search
shows "EXCEPTION_ACCESS_VIOLATION" for windows.

But then one would have to use SEH which is not the same as signals.
Although a specific case of SIGSEGV is the one where the SEH and
signals happen to be rather similar.
I can try it one day, but not today.

[*] Leaving aside the rare system utility or diagnostic
utility or library (e.g. valgrind, et alia may find
that a useful datum).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Thu Jan 23 12:23:37 2025

On Thu, 23 Jan 2025 08:14:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

"man raise" tells me that raise() is C99. "man signal" tells me that signal() is C99.

I would guess that it belongs to the part of the standard that defines requirements for hosted implementation. My use of C for "real work"
(as opposed to hobby) is almost exclusively in freestanding
implementations.

Even for hosted implementations, signal handled is guaranteed to be
invoked only when signal is raised by raise(). It is not our case.

Of those that do support signals, not every one supports catching
SIGSEGV.

"man 7 signal" tells me that SIGSEGV is P1990, i.e., 'the original POSIX.1-1990 standard'. I.e., there were even some Windows systems
that support it.

Of those that do support catching SIGSEGV, not every one can recover
after that.

Gforth catches and recovers from SIGSEGV in order to return to
Gforth's command line rather than terminating the process; in
snapshots from recent years that's also used for determining whether
some number is probably an address (try to read from that address; if
there's a signal, it's not an address). I tried building Gforth on a
number of Unix systems, and even the most rudimentary ones (e.g.,
Ultrix), supported catching SIGSEGV.

From cppreference: https://en.cppreference.com/w/c/program/signal
"If the user defined function returns when handling SIGFPE, SIGILL or
SIGSEGV, the behavior is undefined."

There is a port to Windows with
Cygwin done by Bernd Paysan. I don't know if that could catch
SIGSEGV, but I am sure that it's possible in Windows in some way, even
if that way is not available through Cygwin.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Thu Jan 23 12:39:14 2025

Michael S <already5chosen@yahoo.com> writes:

=46rom cppreference: https://en.cppreference.com/w/c/program/signal
"If the user defined function returns when handling SIGFPE, SIGILL or >SIGSEGV, the behavior is undefined."

As is almost everything else occurring in production code. So such
references are not particularly relevant for production code; what
actual (in this case) operating system kernels and libraries do is
relevant.

And my experience from three decades across a wide variety of Unix
systems on a wide variety of hardware is that what we do in our
SIGSEGV handler works. But our signal handlers don't return, they
longjmp() (in the cases that do not terminate the process).

AFAIK returning would usually try to reexecute the segfaulting
instruction, which would be the right thing if we had eliminated the
cause for the SIGSEGV in the signal handler, but we don't do that in
Gforth. Continuing with the next instruction would not be very useful
for us, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jan 23 14:04:24 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Michael S <already5chosen@yahoo.com> writes:

=46rom cppreference: https://en.cppreference.com/w/c/program/signal
"If the user defined function returns when handling SIGFPE, SIGILL or >>SIGSEGV, the behavior is undefined."

As is almost everything else occurring in production code. So such >references are not particularly relevant for production code; what
actual (in this case) operating system kernels and libraries do is
relevant.

Indeed. And 'behavior is undefined' applies to the C specification;
an implementation may certainly "define" that behavior and
programmers using that implementation may rely on that definition.

And my experience from three decades across a wide variety of Unix
systems on a wide variety of hardware is that what we do in our
SIGSEGV handler works. But our signal handlers don't return, they
longjmp() (in the cases that do not terminate the process).

Or siglongjmp(). Or using implementation-defined (or POSIX defined) capabilities (e.g. manipulating the process/thread context supplied to POSIX signal handlers).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Jan 23 14:31:22 2025

Michael S <already5chosen@yahoo.com> writes:

On Thu, 23 Jan 2025 08:14:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority. =20

=20
"man raise" tells me that raise() is C99. "man signal" tells me that
signal() is C99.
=20

I would guess that it belongs to the part of the standard that defines >requirements for hosted implementation. My use of C for "real work"
(as opposed to hobby) is almost exclusively in freestanding
implementations.

In free-standing implementations, you must set the stack
pointer yourself[*], so you implicitly know the stack start
and stack bounds. You don't need to use the SIGSEGV
technique that was described for hosted programs
to find the base address of the stack (if there is no
implementation-defined API that will provide the data).

[*] As well as providing all the other needed state that a hosted
implementation might provide.

I've written a fair amount of non-hosted code myself (hypervisors,
Operating Systems, standalone diagnostics) - the programmer needs
to initialize the machine state (often in assembler) then establish
the state required by the code (stack, initial register state,
establishing protected mode, paging and long-mode on x86/x86_64
systems, etc). None of this is using facilities defined by the
C standard. Both hypervisors (at SGI and 3Leaf) were actually
written in C++ - our platform initialization code also needed
to ensure that static constructors were invoked prior to invoking
the C++ code amongst other initializations (such as clearing
the BSS region).

Similar work needs to be done (either by you, or by the framework
provided by the toolset provider such as greenhills et alia).

Even for hosted implementations, signal handled is guaranteed to be
invoked only when signal is raised by raise(). It is not our case.

POSIX hosted implementations have guarantees beyond those provided by
the C standard, including related to signal delivery and handling,
and any implementation can provide guarantees beyond those described
in the C standard.

Standalone code is almost by definition non-portable.

e.g.
SS_DATA=0x18

.text
.global dvmmstart
dvmmstart:
#
# Get processor into known state.
#
cld
cli
movl %eax, %esi
movl $SS_DATA, %eax
movl %eax, %ds
movl %eax, %es
movl %eax, %fs
movl %eax, %gs

lss stack_segdesc,%esp

#
# Clear BSS
#
xorl %eax,%eax
movl $_edata,%edi
movl $_end,%ecx
subl %edi,%ecx
rep stosb

#
# Invoke C++ code. Pass begin and end address of memory map.
#
movl $512, %eax # Starting with 512 bytes
subl %esi, %eax # Subtract remaining
addl $0x90000, %eax # e820 data map address
pushl %eax # Push arg to main
pushl $0x90000 # Push arg to main
call dvmm_main # Invoke main

#
# Should never return.
#
hlt
stack:
.space 4096,0
stacktop:

stack_segdesc:
.long stacktop
.word SS_DATA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Thu Jan 23 17:41:16 2025

On Thu, 23 Jan 2025 11:52:32 +0200
Michael S <already5chosen@yahoo.com> wrote:

On Thu, 23 Jan 2025 01:00:49 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.

I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the
envp entries.

This brings into question what is "on" the stack ?? to be
included in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or
want to do.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

Linux and MAC can't touch windows in terms of volume - but I'd
argue that in the universe of programmers, they're close to
if not equal to windows. The vast vast majority of windows
users don't have a compiler. Those that do are working at
a higher level where the knowlege of the stack base address
would not be a useful value to know.

I did not have "big" computers in mind. In fact, if we only look at
"big" things then Android dwarfs anything else. And while Android is
not POSIX complaint, it is probably similar enough for your method to
work.

I had in mind smaller things.
All but one of very many embedded environments that I touched in
last 3 decades had no signals. The exceptional one was running
Linux.

Unix (bsd/sysv) and linux support the ucontex argument
on the signal handler which provides processor state so
the signal handler can recover from the fault in whatever
fashion makes sense then transfer control to a known
starting point (either siglongjmp or by manipulating the
return context provided to the signal handler). This is
clearly going to be processor and implementation specific.

Yes, Windows is an abberation. I offered a solution, not
"the" solution. I haven't seen any valid reason for a program[*]
to need to know the base address of the process stack; if there
were a need, the implementation would provide. I believe windows
does have a functional equivalent to SIGSEGV, no? A quick search
shows "EXCEPTION_ACCESS_VIOLATION" for windows.

But then one would have to use SEH which is not the same as signals.
Although a specific case of SIGSEGV is the one where the SEH and
signals happen to be rather similar.
I can try it one day, but not today.

[*] Leaving aside the rare system utility or diagnostic
utility or library (e.g. valgrind, et alia may find
that a useful datum).

At the end, I can not resist myself and did it today, wasting an hour
and a half during which I was supposed to do real work.
With Microsoft's language extensions it was trivial.
But I don't know how to do it (on Windows) with gcc.

Code:

static void test(char** res, int depth) {
*res = (char*)&res;
if (depth > 0)
test(res, depth-1);
}

int main()
{
char* res=(char*)&res;
__try { test(&res, 1000000); }
__except(1) { // 1==EXCEPTION_EXECUTE_HANDLER
printf("SEH __except block\n");
}
printf("%p - %p = %zd\n", &res, res, (char*)&res - res);
return 0;
}

It prints:
SEH __except block
000000000029F990 - 00000000001A4020 = 1030512

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Thu Jan 23 11:50:22 2025

On Wed, 22 Jan 2025 22:44:45 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

(MitchAlsup1)

On a Linux machine, you can find the last envp[*] entry and subtract >>>>>>SP from it.

I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.

This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.

Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??

I think we have an illdefined measurement !!

Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want to
do.

https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

or any problem requiring potentially unbounded recursion.

However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to George Neuner on Thu Jan 23 17:18:32 2025

George Neuner <gneuner2@comcast.net> writes:

On Wed, 22 Jan 2025 22:44:45 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??

It's not something that a programmer generally would need, or want to
do.

Note the word "generally".

https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

or any problem requiring potentially unbounded recursion.

For which the standard unix resource limits are usually
sufficient.

Henry's 'scheme' is not typical of the vast majority of
programs. Even Henry notes that the macro for his
scheme (pun intended) is machine dependent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Thu Jan 23 14:22:24 2025

Michael S wrote:

At the end, I can not resist myself and did it today, wasting an hour
and a half during which I was supposed to do real work.
With Microsoft's language extensions it was trivial.
But I don't know how to do it (on Windows) with gcc.

Code:

static void test(char** res, int depth) {
*res = (char*)&res;
if (depth > 0)
test(res, depth-1);
}

int main()
{
char* res=(char*)&res;
__try { test(&res, 1000000); }
__except(1) { // 1==EXCEPTION_EXECUTE_HANDLER
printf("SEH __except block\n");
}
printf("%p - %p = %zd\n", &res, res, (char*)&res - res);
return 0;
}

It prints:
SEH __except block
000000000029F990 - 00000000001A4020 = 1030512

To get the top of stack, just after main() is called you might be able to
use a varargs routine to read the stack pointer. Round this up to the top
of a 4KB page and that is likely the top of stack or near it.
This could be stashed in a TLS variable for later.

Something like...

#include <stdio.h>
#include <stdarg.h>
#include <threads.h>

thread_local unsigned long stkTopPtr;

static unsigned long GetStackPtr (int junk, ...)
{
va_start(argptr, junk);
return (unsigned long) argptr;
}

int __cdecl main (int argc, char *argv[])
{
unsigned long stkPtr;

stkPtr = GetStackPtr (1);
stkTopPtr = stkPtr | 0xFFF;
printf ("Stack top: 0x%08X\n", stkTopPtr);
return 0;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Mon Jan 27 17:18:16 2025

Michael S <already5chosen@yahoo.com> writes:

On Thu, 23 Jan 2025 08:14:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.

"man raise" tells me that raise() is C99. "man signal" tells me
that signal() is C99.

I would guess that it belongs to the part of the standard that
defines requirements for hosted implementation. [...]

Right. Almost all of the standard library is not required
for freestanding implementations, and <signal.h> is not
among the required set.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (3 / 13)
Uptime:	05:47:24
Calls:	10,388
Calls today:	3
Files:	14,061
Messages:	6,416,799

Re: Reduced compared to what, Concertina II Progress

Who's Online

Recent Visitors

System Info