David Brown <david.brown@hesbynett.no> writes:
I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. ...
Surely then, the PDP-8 can be counted as a RISC processor. There are
only 8 instructions defined by a 3-bit opcode, and due to the
instruction encoding, a single operate instruction can perform multiple >(sequential) operations.
By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.
There was a microcontroller that we once considered for a project, which
had only a single instruction - "move". We ended up with a different
chip, so I never got to play with it in practice.
According to David Brown <david.brown@hesbynett.no>:
By my logic (such as it is - I don't claim it is in any sense
"correct"), the PDP-8 would definitely be /CISC/. It only has a few
instructions, but that is irrelevant (that was my point) - the
instructions are complex, and therefore it is CISC.
Having actually programmed a PDP-8 I find this assertion hard to
understand. It's true, it had instructions that both fetched an
operand and did something with it, but with only one register what
else were they going to do?
It was very RISC-like in that you used a sequence of simple
instructions to do what would would be one instruction on more complex machines. For example, you got the effect of a load by clearing the
register (CLA) and then adding the memory word. To do subtraction,
clear, add the second operand, negate, add the first operand, maybe
negate again depending on whether you wanted A-B or B-A. We all knew a
long list of these idioms.
There was a microcontroller that we once considered for a project, which
had only a single instruction - "move". We ended up with a different
chip, so I never got to play with it in practice.
I saw some of those, and the one-instruction thing was a conceit. They
all had plenty of instructions, just with the details in the operand specifiers rather than the op code.
Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)
PDP-8 certainly is simple nor does it have many instructions,
but it certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
On 08/12/2023 16:38, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
(I'm snipping because I pretty much agree with the rest of what you wrote.)
Is Coldfire a load/store architecture? If not, it's not a RISC.
I agree that there's a fairly clear boundary between a "load/store >architecture" and a "non-load/store architecture". And I agree that it
is usually a more important distinction than the number of instructions,
or the complexity of the instructions, or any other distinctions.
But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?
Things have changed a lot since the term "RISC" was first coined, and
maybe architectural and ISA features are so mixed that the terms "RISC"
and "CISC" have lost any real meaning.
If that's the case, then we
should simply talk about LSA and NLSA architectures, and stop using
"RISC" and "CISC".
I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.
That would mean that "RISC" has an original definition; what is it?
On the other hand, a lot of RISC architectures - all the
instructions are 32 bits long, the register banks have at
least 32 registers in them, the architecture is load-store
- currently have OoO implementations. Like having hardware
floating-point, this is done to get the best possible speed
given the much larger number of transistors we can put on a
die this day.
Unlike allowing hardware floating-point, though, I think this
change strikes directly at the _raison d'tre_ of RISC itself.
If RISC exists because it's designed around making pipelining
fast and efficient . . .
I don't think trying to redefine "RISC" to mean
something different from its original purpose helps.
That would mean that "RISC" has an original definition; what is it?
- anton
Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
z/Architecture, ColdFire, maybe even Mitch's MY 66000) and
there's CISC than which _anything_ else would be better
(such as the world's most popular ISA, devised by a company
that made MOS memories, and then branched out into making
some of the world's first single-chip microprocessors)...
Given that situation, where "good CISC" is relatively
minor in its market presence compared to bad, bad, very
bad CISC, some architecture designers have chosen to
incorporate as much of Patterson's original description,
if not definition, of RISC into their designs as is
practical in order to distance themselves more convincingly
from x86 and x86-64.
In designing Concertina II, which might well be described
as a half-breed architecture from Hell that hasn't made
up its mind whether to be RISC, CISC, or VLIW, even I have
been affected by that concern.
John Savard
However, Patterson also wrote a popular article about one of the
early RISC processors with which he was connected for _Scientific
American_,
On the other hand, a lot of RISC architectures - all the instructions
are 32 bits long, the register banks have at least 32 registers in
them, the architecture is load-store - currently have OoO
implementations. Like having hardware floating-point, this is done
to get the best possible speed given the much larger number of transistors
we can put on a die this day.
Unlike allowing hardware floating-point, though, I think this
change strikes directly at the _raison d'être_ of RISC itself.
If RISC exists because it's designed around making pipelining
fast and efficient, once you've got an OoO implementation, of
what benefit is RISC?
Maybe the OoO circuitry doesn't have to
work so hard,
or OoO plus 32 registers can delay register
hazards even longer than OoO plus 8 registers.
Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
z/Architecture, ColdFire,
there's CISC than which _anything_ else would be better
(such as the world's most popular ISA, devised by a company
that made MOS memories, and then branched out into making
some of the world's first single-chip microprocessors)...
some architecture designers have chosen to
incorporate as much of Patterson's original description,
if not definition, of RISC into their designs as is
practical in order to distance themselves more convincingly
from x86 and x86-64.
That would mean that "RISC" has an original definition; what is it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
According to MitchAlsup <mitchalsup@aol.com>:
That would mean that "RISC" has an original definition; what is it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
Do you mean this paper by Patterson and Ditzel or something else?
According to MitchAlsup <mitchalsup@aol.com>:
That would mean that "RISC" has an original definition; what is it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
Do you mean this paper by Patterson and Ditzel or something else?
https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf
Also compare this paper on the 801 which starts in a very different
place but ends up with many of the same conclusions.
https://dl.acm.org/doi/pdf/10.1145/800050.801824
It appears that David Brown <david.brown@hesbynett.no> said:
Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)
The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.
Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.
The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.
The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.
Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.
The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.
The ARM A64 seem to have had no qualms at introducing features like
load-pair and store-pair that raise eyebrows among purists, so if they >thought they would gain enough by deviating from A64 being a
load-store architecture, or from sticking to fixed-width instructions,
or from it having 32 registers, they would have gone there.
Apparently they did not think so, and the IPC and performance per Watt
of Firestorm indicates that they have designed well.
I actually means "Reduced Instruction Set Computers for VLSI" K.
https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf
My memory ain't what it used to be........
Several quotes from the Ditzel paper::
By a judicious choice of the proper instruction set and the design of a corresponding
architecture, we feel that it should be possible to have a very simple instruction set
that can be very fast. This may lead to a substantial net gain in overall program
execution speed. This is the concept of the Reduced Instruction Set Computer.
Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].
John Levine <johnl@taugh.com> writes:
It appears that David Brown <david.brown@hesbynett.no> said:
Based solely on the information Scott gave, however, I would suggest
that the "OPR" instruction - "microcoded operation" - and the "IOT"
operation would mean it certainly was not RISC. (This is true even if
it has other attributes commonly associated with RISC architectures.)
The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.
The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.
Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.
The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.
According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
so this is a purely hypothetical argument.
The underlying ideas were looked at in the late 1970s, but I
believe the name RISC goes back only to the early 1980s.
I checked, the 801 project started in 1975. The RISC-I paper was
published in 1981 and I think they came up with the name in 1980, so
close enough.
Keep in mind that the PDP-8 was built from 1400 discrete transistors
and 10,000 diodes. It had to be simple.
The LGP-30, first delivered in 1956, had 113 tubes and 1450
diodes. I think it's fair to say the LGP-30 has a good
claim to being the world's first minicomputer.
I poked around one that was old and dead but just looking at it, you
could see what a very elegant design it was to get useful work out of
so little logic.
The Bendix G-15 had 450 tubes and 3000 diodes so it's the other
contender for the title. Both machines were introduced in 1956,
cost about the same, and were about the same size, 800lb for the LGP-30, >956lb for the G-15.
Yup. If anyone can find that Johnson tech report I'd like to read
it. Some googlage only found references to it.
Actually the 'RISC purity' of the A64 Architecture was not
likely to have ever been a consideration when choosing which
features to add to the architecture. They're in the money
making business, not some idealistic RISC business.
According to MitchAlsup <mitchalsup@aol.com>:
I actually means "Reduced Instruction Set Computers for VLSI" K.
https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf
My memory ain't what it used to be........
Several quotes from the Ditzel paper::
By a judicious choice of the proper instruction set and the design of a corresponding
architecture, we feel that it should be possible to have a very simple instruction set
that can be very fast. This may lead to a substantial net gain in overall program
execution speed. This is the concept of the Reduced Instruction Set Computer. >>
Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].
Yup. If anyone can find that Johnson tech report I'd like to read it.
Some googlage only found references to it.
It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together,
a more sophisticated version of
what Johnson did, so they were constantly trading off what they could
do in hardware and what they could do in software, usually finding
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.
Berkeley used the old PCC compiler which wasn't terrible but did not
do very sophisticated register allocation, so they invented hardware
register windows. In retrospect, the 801 project was right and windows
albeit clever were a bad idea. Better to use that chip area for a
bigger cache.
In article <ul4pvc$2r22$1@gal.iecc.com>, johnl@taugh.com (John Levine)
wrote:
Yup. If anyone can find that Johnson tech report I'd like to read
it. Some googlage only found references to it.
The Computer History Museum has hardcopy:
<https://www.computerhistory.org/collections/catalog/102773566>
The UK's Centre for Computing History also appears to have a copy: ><https://www.computinghistory.org.uk/det/12205/Bell-Computing-Science-Tech >nical-Report-80-A-32-Bit-Processor-Design/> Since they're local to me,
I've asked them if they can make me a copy.
If anyone else wants to hunt, the reference is in: ><https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>
The author was Stephen C Johnson, who worked at Bell Labs in the 1970s, >largely on languages; he was responsible for YACC.
It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together, ...
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.
Apparently they had no notion of JiT compilation and JiT caches.
According to MitchAlsup <mitchalsup@aol.com>:
It seems to me there were two threads to the RISC work. IBM designed
the hardware and compiler together, ...
that software could do it better, e.g., splitting instruction and data
caches because they knew their compiler's code never modified
instructions.
Apparently they had no notion of JiT compilation and JiT caches.
They certainly knew about JIT code since IBM sort programs have been generating comparison code since the 1960s if not longer. Back in the
olden days, particularly on machines without index registers, you
pretty much had to write code where one instruction would modify
another to do address or length calculations.
By the 1970s nobody did that, programs were loaded and didn't change
once they were loaded. If you want to do JIT, write out the JIT code,
then poke the cache to invalidate the area where you put the JIT code.
It's the same thing it did when loading a program in the first place.
Quadibloc <quadibloc@servername.invalid> writes:
However, Patterson also wrote a popular article about one of the early
RISC processors with which he was connected for _Scientific American_,
A short web search on that came up empty. Do you have a reference?
there's CISC than which _anything_ else would be better (such as the >>world's most popular ISA, devised by a company that made MOS memories,
and then branched out into making some of the world's first single-chip >>microprocessors)...
AMD made MOS memories and some of the world's first single-chip microprocessors?
My point was that the original statement was: they knew their compiler's
code never modified instructions.
Yet a JiT compiler HAS to modify instructions.
I am not poking fun at 801 {for which I have great admiration.}
I am poking fun at the inspecificity of the statement.
mitchalsup@aol.com (MitchAlsup) writes:
PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.
The S/360 architecture says that one instruction can modify the one that immediately follows it. This sort of thing was very common before there
were index registers and still somewhat common in the 1960s.
What is that design philosophy supposed to be?
The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
claim that Nova led to RISC and PDP-11 to CISC, ...
when you look at
Nova and its successors, including single-chip implementations, it's
an accumulator architecture which consumed many cycles for each
instruction, but it invested hardware in fast multiplication and
division. The design seems to have further evolved in the direction
of CISC, and we can read in "The Soul of a New Machine" about the
headaches that the microarchitects of the Eclipse MV/8000 had dealing
with virtual memory etc. in that architecture, and this architecture
was replaced by the RISC 88000 architecture a decade later.
Moreover, the major "philosophy" behind the PDP-8 is probably to make
it as cheap as possible.
Anton Ertl wrote:
I don't think trying to redefine "RISC" to mean something different
from its original purpose helps.
That would mean that "RISC" has an original definition; what is it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
RISC was defined before CISC was coined as its contrapositive.
On 12/9/2023 10:11 AM, MitchAlsup wrote:
Anton Ertl wrote:
I don't think trying to redefine "RISC" to mean something
different from its original purpose helps.
That would mean that "RISC" has an original definition; what is
it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
RISC was defined before CISC was coined as its contrapositive.
I agree, but this illustrates some of the semantic confusion we are
having regarding defining RISC. In normal speech, the opposite of
"reduced" is "increased",
and the opposite of "complex" is "simple".
So RISC and CISC are not really on the opposite end of a scale, but
are on different scales!
If we substitute "simple" for "reduced", a lot of nice things sort of
fall out. Some examples
Requiring a single instruction length simplifies decoding, as does no "dependent" code where you can't decode a later part of the
instruction till you decode something in an earlier part.
Requiring all instructions be single cycle simplifies the pipeline
design. I think this applies to no mem-op instructions
Requiring no more than one memory reference (and relatedly
prohibiting non-aligned memory accesses) simplifies some internal
agen stuff.
etc.
Of course, as time went on, both the number of gates on a chip and
our understanding of how to do things more simply increased. So we
were able to add more "complexity" to the design while keeping with
the "spirit" of simplicity. So we got multi-cycle instructions in
the CPU, not a co-processor, and non-aligned memory accesses, etc.
Under this view, the number of instructions is not the key defining
factor, but sort of a side effect of making the design "simple".
So if they had used "simple" instead of "reduced" a lot of confusion
would have been prevented. :-)
On Mon, 11 Dec 2023 12:23:51 -0800
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
On 12/9/2023 10:11 AM, MitchAlsup wrote:
Anton Ertl wrote:
I don't think trying to redefine "RISC" to mean something
different from its original purpose helps.
That would mean that "RISC" has an original definition; what is
it?
See the classic "Case for the Reduced Instruction Set Computer"
Katevinis.
RISC was defined before CISC was coined as its contrapositive.
I agree, but this illustrates some of the semantic confusion we are
having regarding defining RISC. In normal speech, the opposite of
"reduced" is "increased",
In context of Reduced Instruction Set the opposit of "reduced" is
"full" or "complete". IMHO.
On 12/11/2023 2:38 PM, Michael S wrote:
In context of Reduced Instruction Set the opposit of "reduced" is
"full" or "complete". IMHO.
OK, but what does "full" or "complete" mean?
There are always
instructions/functionality that could be added, so in that sense, no >instruction set is full or complete.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.
This web page suggests it was more from the other direction, they started from the compiler:
The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which had
been a long running design philosophy in the hardware industry.
https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/mips/index.html
I agree that these days RISC doesn't really meen anything beyond "not a Vax or S/360".
R's,
John
Looking at the genesis of the RISCs, full means the S/360 and S/370 >instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
What is that design philosophy supposed to be?
We don't have to guess, because DEC published an entire book about the
way they built their computers. The PDP-5 was originally a front end
system for a nuclear reactor in Canada, 12 bits both because the
analog values it was handling needed that much precision, and also
because they used ideas from the LINC lab computer. The PDP-5 is
recognizable as a cut down PDP-4 which was in turn a cheaper redesign
of the PDP-1 which was largely based on the MIT TX-0 computer built to
test core memories in the 1950s. They all had word addresses and a
single register, not surprising since that's what all scientific
computers of the era had.
The PDP-8 reimplemented the PDP-5 using newer components and packaging
so was a lot smaller and cheaper. The book says it was important that
it was the first computer small enough to sit on a lab bench, or in a
rack leaving room for other equipment.
The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
claim that Nova led to RISC and PDP-11 to CISC, ...
Moreover, the major "philosophy" behind the PDP-8 is probably to make
it as cheap as possible.
Actually, it was to build the best computer they could for the target
price, although those often end up around the same place.
Either one (1. Satisfy the requirements at the lowest cost; 2. Build
the best thing for a given price point) are general engineering
principles. The VAX architects were certainly convinced that they
design the best architecture for the target price, too. And VAX is
obviously not a RISC, so there is more to RISC than that philosophy.
- anton
Anton Ertl wrote:
Either one (1. Satisfy the requirements at the lowest cost; 2. Build
the best thing for a given price point) are general engineering
principles. The VAX architects were certainly convinced that they
design the best architecture for the target price, too. And VAX is
obviously not a RISC, so there is more to RISC than that philosophy.
In the days that store was more expensive than gates, VAX makes a lot
of sense--this era corresponds to the 10-cycles per instruction going
down towards 4-cycles per instruction. This era could not be extended
into the 1-instruction per cycle reals with a VAX-like ISA.
On 12/10/23 11:45?AM, John Levine wrote:
According to MitchAlsup <mitchalsup@aol.com>:
:
Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
the results to propose a better machine, and then repeating the cycle over a dozen times.
Though the initial intent was not specifically to come up with a simple design, the result
was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
VAX [Johnson79].
Yup. If anyone can find that Johnson tech report I'd like to read it.
Some googlage only found references to it.
It looks like one can get a PDF from Semantic Scholar: >https://www.semanticscholar.org/paper/A-32-bit-processor-design-Johnson/5ef2b3e8a755a2c29833eba8ab61117c296d95ac
I have a PDF on my computer that I can email to anyone interested >(paaronclayton is my gmail address).
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
mitchalsup@aol.com (MitchAlsup) writes:
PDP-8 certainly is simple nor does it have many instructions, but it
certainly is NOT RISC.
Did not have a large GPR register file
Was Not pipelined
Was Not single cycle execution
Did not overlap instruction fetch with execution
Did not rely on compiler for good code performance
Of course the PDP-8 is a RISC. These propperties may have been
common among some RISC processors, but they don't define what
RISC is. RISC is a design philosophy, not any particular set
of architectural features.
What is that design philosophy supposed to be?
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.
This web page suggests it was more from the other direction, they
started from the compiler:
The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.
https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html
I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".
The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.
That it was designed before is irrelevant. All that matters is
that the end result is consistent with that philosophy.
Surely people don't view the Itanium as being a RISC.
And what
about the Mill?
Tim Rentsch wrote:
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Looking at the genesis of the RISCs, full means the S/360 and
S/370 instruction sets for the 801 project, and VAX for the
Berkeley RISC project. Not sure what full means for Stanford
MIPS.
This web page suggests it was more from the other direction, they
started from the compiler:
The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.
https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html
I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".
Surely people don't view the Itanium as being a RISC. And what
about the Mill? Is that a RISC or not?
Itanium is VLIW
Mill is Belted
Both are dependent on compiler to perform code scheduling.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
Surely people don't view the Itanium as being a RISC.
What makes you think so?
It has a lot of RISC characteristics, in particular, it's a load/store architecture with a large number of general-purpose registers (and for
the other registers, there are also many of them, avoiding the register allocation problems that compilers tend to have with unique registers).
In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
On Tue, 02 Jan 2024 10:42:32 +0000, Anton Ertl wrote:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
Surely people don't view the Itanium as being a RISC.
What makes you think so?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers tend
to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
The TMS320C6000 has individual instructions that are indeed very
similar to those of a RISC machine. But because they're grouped in
blocks of eight instructions, with a bit in each instruction to
indicate whether or not a given instruction can execute in parallel
with those that precede it, it is classed as a VLIW architecture.
Intel didn't use the term VLIW in referring to the Itanium. I guess
they didn't think that 128 bits (unlike 256 bits) was "very" long.
But that's basically what the Itanium was, even if it shared a lot of characteristics with RISC. Three instructions were grouped into a
128-bit block; possible parallelism between them was indicated
explicitly, and each of the three instructions even had a different
format from the two others.
John Savard
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.
This web page suggests it was more from the other direction, they
started from the compiler:
The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.
https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html
I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".
Surely people don't view the Itanium as being a RISC.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Looking at the genesis of the RISCs, full means the S/360 and S/370
instruction sets for the 801 project, and VAX for the Berkeley RISC
project. Not sure what full means for Stanford MIPS.
This web page suggests it was more from the other direction, they
started from the compiler:
The Stanford research group had a strong background in compilers,
which led them to develop a processor whose architecture would
represent the lowering of the compiler to the hardware level, as
opposed to the raising of hardware to the software level, which
had been a long running design philosophy in the hardware
industry.
https://cs.stanford.edu/people/eroberts/
courses/soco/projects/risc/mips/index.html
I agree that these days RISC doesn't really meen anything beyond
"not a Vax or S/360".
Surely people don't view the Itanium as being a RISC.
It was an Epic Risk.
The Mill is not even a paper design (I have yet to see a paper about
it), so how would I know?
- anton
On Tue, 02 Jan 2024 10:42:32 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:
The Mill is not even a paper design (I have yet to see a paper about
it), so how would I know?
- anton
List of patents:
https://millcomputing.com/patents/
Ivan has said that inside information might be had with an NDA. If
you really are interested you could ask them.
According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
The PDP-8 is just a very small computer, with a very small instruction
set, designed before the RISC design philosophy was even concieved of.
That it was designed before is irrelevant. All that matters is
that the end result is consistent with that philosophy.
I dunno, indirect addressing and these those auto-index locations 10
to 17 don't seem so RISCful. Nor does having only one register you
have to use for everything.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
Surely people don't view the Itanium as being a RISC.
What makes you think so?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:surely people don't view the Itanium as being a RISC?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.
IA64 had 3 instructions per 128-bit block, with some bits
indicating how to process the other instructions. Typically, the
instructions in the block would execute in parallel rather than
serial (so, would take a big hit in code density if the code lacked sufficient ILP, as many of these spots would hold NOPs).
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
surely people don't view the Itanium as being a RISC?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.
And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
surely people don't view the Itanium as being a RISC?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.
And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]
Begging the question.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:Begging the question.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:And I don't think so, either. That's because some of the features
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:surely people don't view the Itanium as being a RISC?
It has a lot of RISC characteristics, in particular, it's aI think the same description could be said of the IBM System/360.
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
I don't think of System/360 as a RISC, even if a subset of it
might be.
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]
Ok, let's make this more precise:
|And I don't think so, either. That's because among the features that
|the IBM 801 left away are instructions that access memory, but are not
|just a load or a store, such as the EDIT instruction. OTOH, none of
|the special features of IA-64 add instructions that access memory, but
|are not just a load or a store, [...]
IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).
- anton
Anton Ertl wrote:...
IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).
- anton
I recall reading that the original HPPA left off MUL as it would take >multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).
Hewlett-Packard Precision Architecture, Aug 1986 >https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:....
IA-64 followed early RISC practice in leaving away integer and FPI recall reading that the original HPPA left off MUL as it would take
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).
- anton
multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).
Hewlett-Packard Precision Architecture, Aug 1986
https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.
On page 18 I find what I meant:
|The architected assist instruction extensions include integer
|multiply and divide functions for applications requiring higher
|frequencies of multiplication and division.
IIRC early HPPA implementations implemented these instructions by
transfering the integer data to the FPU, using the multiplier or
divider there, and transferring the result back to an integer
register. At least I remember reading one paper that described it
this way.
It seems to me that the integer instruction set was first developed
without considering the existence of an FPU, and then once they
considered the FPU, they added the assist instruction extensions
mentioned above. I wonder if there were any HPPA implementations that
did not have these assist instruction extensions.
- anton
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:...
IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).
- anton
I recall reading that the original HPPA left off MUL as it would take
multiple clocks, violating their principle of 1 clock per instruction
(and multi-cycle floating point was handled by a coprocessor).
Hewlett-Packard Precision Architecture, Aug 1986
https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
They believed that since most MULs were by smallish values that
shift-add would be just as good. They found out they were wrong and
added MUL back in.
On page 18 I find what I meant:
|The architected assist instruction extensions include integer
|multiply and divide functions for applications requiring higher
|frequencies of multiplication and division.
IIRC early HPPA implementations implemented these instructions by
transfering the integer data to the FPU, using the multiplier or
divider there, and transferring the result back to an integer
register. At least I remember reading one paper that described it
this way.
It seems to me that the integer instruction set was first developed
without considering the existence of an FPU, and then once they
considered the FPU, they added the assist instruction extensions
mentioned above. I wonder if there were any HPPA implementations that
did not have these assist instruction extensions.
- anton
DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock >cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[Pentium]
DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock
cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.
Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.
Anton Ertl wrote:
Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.
That is of course possible, but maybe not very likely?
The integer part had 32-bit registers, so an integer DIV would
concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
source (that had to be larger than the EDX part) and still trigger one
of the 5 missing SRT table entries. When the bug was originally found,
it only generated wrong results in the last 4-5 bits, so most of these
would still have given the same 32-bit result.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.
That is of course possible, but maybe not very likely?
The integer part had 32-bit registers, so an integer DIV would
concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
source (that had to be larger than the EDX part) and still trigger one
of the 5 missing SRT table entries. When the bug was originally found,
it only generated wrong results in the last 4-5 bits, so most of these
would still have given the same 32-bit result.
The reason why I consider it more likely is because programs tend to
crash or give very wrong results if an integer computation is wrong,
because integer computations are used in addressing and for directing
program flow. By contrast, if an FP computation is wrong, very few
people notice (and the late discovery of the Pentium FDIV bug shows
this); Seymour Cray decided to make his machines FP-divide quickly
rather than precisely, because he knew his customers, and they indeed
bought his machines. I don't think he did so with integer division.
Anton Ertl wrote:...
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
Maybe they would have noticed the bug in the FP divider much earlier
if they had used the FP divider for integer division.
The reason why I consider it more likely is because programs tend to
crash or give very wrong results if an integer computation is wrong,
because integer computations are used in addressing and for directing
program flow. By contrast, if an FP computation is wrong, very few
people notice (and the late discovery of the Pentium FDIV bug shows
this); Seymour Cray decided to make his machines FP-divide quickly
rather than precisely, because he knew his customers, and they indeed
bought his machines. I don't think he did so with integer division.
OTOH, where are you using DIV in integer code?
Some modulus operations, a few perspective calculations?
By the Pentium
timeline all compilers were already converting division by constant to >reciprocal MUL,
..., where are you using DIV in integer code?
Some modulus operations, a few perspective calculations? By the Pentium >timeline all compilers were already converting division by constant to >reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few >patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.
I.e. it could well have gone undetected for a long time.
Terje
FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument.
According to Stefan Monnier <monnier@iro.umontreal.ca>:
FWIW, I think these arguments remind me of the "No true Scottsman" kind
of argument.
Yeah. It occurs to me that part of the problem is that RISC is a process, >not just a checklist of what's in an architecture.
The R is key. Each of the projects we think of as RISC (Berkeley RISC, >Stanford MIPS, IBM 801) were familiar with existing architectures and
then started making tradeoffs to try to get better performance at
lower design cost. They had less complexity in the hardware, usually
balanced by more complexity in the compiler, with the less complex
hardware allowing global performance increases like bigger cache,
deeper pipeline, or faster clock rate.
The PDP-8 wasn't like that. They started with the PDP-4, then asked >themselves how much of this 18 bit machine can we cram into a 12 bit
machine that we can afford to build and still be good enough to do
useful work, ending up with the PDP-5. There were tradeoffs but of a
very different kind.
On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:
..., where are you using DIV in integer code?
Some modulus operations, a few perspective calculations? By the Pentium
timeline all compilers were already converting division by constant to
reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few
patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.
I.e. it could well have gone undetected for a long time.
Terje
But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?
It was fairly well known that compiling integer heavy code for i486
would make it run faster on P5 than if compiled for P5. The speedup necessarily was code specific, but on average was 3-4 percent.
Somehow compiling for i486 allowed more use of the (simple) V
pipeline. This trick worked on the original P5 through at least the
(100Mhz) P54C. [Don't know if it worked on later P5 chips.]
Yes, but the results of these three projects had commonalities, and
the other architectures that are commonly identified as RISCs shared
many of these commonalities, while earlier architectures didn't.
|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. ...
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]
5. register machine
John Mashey has taken the criteria-based approach quite a bit further
in his great postings on the question ><https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
John Mashey ignored ARM.
If we apply the criteria of Patterson (and mine to the PDP-8, we get):
I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.
1. No (AFAIK), in particular not the indirect addressing
3. Don't know, but that criterion has only partially stood the test of
time.
4. No, but that criterion has not stood the test of time.
5. No.
Still, I think that a criteria-based way of classifying something as a
RISC (or not) is more practical, because criteria are generally easier
to determine than the process, and because the properties considered
by the criteria are what the implementors of the architecture have to
deal with and what the programmers play with.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.
All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
If we apply the criteria of Patterson (and mine to the PDP-8, we get):
I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.
I'd say that it makes no sense to try to evaluate a 1960s single register >machine to see whether it's a RISC.
The IBM 650 had only one addressing
mode, but it also used a spinning drum as main memory and the word size
was 10 decimal digits. Was that a RISC? The question makes no sense.
1. No (AFAIK), in particular not the indirect addressing
Given that there's only one register, load/store architecture doesn't
work. If you don't have an ADD instruction that references memory, how
are you going to do any arithmetic?
3. Don't know, but that criterion has only partially stood the test of
time.
The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.
All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.
4. No, but that criterion has not stood the test of time.
Pipeline? What's that?
5. No.
Well, it had one register.
It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.
It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.
John Levine <johnl@taugh.com> writes:
According to Stefan Monnier <monnier@iro.umontreal.ca>:
FWIW, I think these arguments remind me of the "No true Scottsman" kindYeah. It occurs to me that part of the problem is that RISC is a process, >> not just a checklist of what's in an architecture.
of argument.
The R is key. Each of the projects we think of as RISC (Berkeley RISC,
Stanford MIPS, IBM 801) were familiar with existing architectures and
then started making tradeoffs to try to get better performance at
lower design cost. They had less complexity in the hardware, usually
balanced by more complexity in the compiler, with the less complex
hardware allowing global performance increases like bigger cache,
deeper pipeline, or faster clock rate.
Yes, but the results of these three projects had commonalities, and
the other architectures that are commonly identified as RISCs shared
many of these commonalities, while earlier architectures didn't.
In 1980 Patterson and Ditzel ("The Case for the Reduced Instruction
Set Computer") indeed did not give any criteria for what constitutes a
RISC, which supports the process view.
In 1985 Patterson wrote "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, and there he
identified 4 criteria:
|1. Operations are register-to-register, with only LOAD and STORE
| accessing memory. [...]
|
|2. The operations and addressing modes are reduced. Operations
| between registers complete in one cycle, permitting a simpler,
| hardwired control for each RISC, instead of
| microcode. Multiple-cycle instructions such as floating-point
| arithmetic are either executed in software or in a special-purpose
| coprocessor. (Without a coprocessor, RISCs have mediocre
| floating-point performance.) Only two simple addressing modes,
| indexed and PC-relative, are provided. More complicated addressing
| modes can be synthesized from the simple ones.
|
|3. Instruction formats are simple and do not cross word boundaries. [...]
|
|4. RlSC branches avoid pipeline penalties. [... delayed branches ...]
which I discussed in <2023Dec9.093314@mips.complang.tuwien.ac.at>,
leaving mainly 1 and a relaxed version of 3. I also added
5. register machine
John Mashey has taken the criteria-based approach quite a bit further
in his great postings on the question <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
John Mashey ignored ARM.
If we apply the criteria of Patterson (and mine to the PDP-8, we get):
1. No (AFAIK), in particular not the indirect addressing
2. No, but that criterion has not stood the test of time.
3. Don't know, but that criterion has only partially stood the test of
time.
4. No, but that criterion has not stood the test of time.
5. No.
I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.
The PDP-8 wasn't like that. They started with the PDP-4, then asked
themselves how much of this 18 bit machine can we cram into a 12 bit
machine that we can afford to build and still be good enough to do
useful work, ending up with the PDP-5. There were tradeoffs but of a
very different kind.
Yes, a very good point in the process view. And looking at the
descendants of the PDP-8 (Nova, 16-bit Eclipse, 32-bit Eclipse), you
also see there that the process was not one that led to RISCs.
Still, I think that a criteria-based way of classifying something as a
RISC (or not) is more practical, because criteria are generally easier
to determine than the process, and because the properties considered
by the criteria are what the implementors of the architecture have to
deal with and what the programmers play with.
It would be interesting to take John Masheys criteria and evaluate a
few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
I'll put that on my ToDo list.
- anton
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
If we apply the criteria of Patterson (and mine to the PDP-8, we get):I'd say that it makes no sense to try to evaluate a 1960s single register >>machine to see whether it's a RISC.
I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria. >>
Why not? If the result is what we would arrive at with other methods,
it certainly makes sense. If the result is different, one may wonder
whether the criteria are wrong or the other methods are wrong, and, as
a result, may gain additional insights.
The IBM 650 had only one addressing
mode, but it also used a spinning drum as main memory and the word size
was 10 decimal digits. Was that a RISC? The question makes no sense.
Again, why not. Maybe, like I added the register-machine criterion
that Patterson had not considered in 1985 because the machines that he compared were all register machines, one might want to add criteria
about random-access memory (I expect that the drum memory resulted in
each instruction having a next-instruction field, right?) and binary
data.
1. No (AFAIK), in particular not the indirect addressing
Given that there's only one register, load/store architecture doesn't
work. If you don't have an ADD instruction that references memory, how
are you going to do any arithmetic?
So definitely "No".
3. Don't know, but that criterion has only partially stood the test of
time.
The PDP-8 had one instruction format for all of the memory reference >>instructions, one for the operate and skip group, and one for I/O.
It was pretty simple but it had to be.
All of the standard instructions were a single word but the amazing
680 TTY multiplexer, which could handle 64 TTY lines, scanning
characters in and out a bit at a time, modified the CPU to add a
three-word instruction that did one bit of line scanning. You could do
that kind of stuff when each card in the machine had maybe one flip
flop, and the backplane was wire wrapped.
So: Yes for the base instruction set, no with the 680 TTY multiplexer.
In a way like RISC-V, which is "yes" for the base instruction set,
"no" with the C extension, and it has provisions for longer
instruction encodings.
4. No, but that criterion has not stood the test of time.
Pipeline? What's that?
A technique that was used ILLIAC II (1962), in the IBM Stretch (1961)
and the CDC 6600 (1964). But that's not an architectural criterion,
except the existence of branch-delay slots.
5. No.
Well, it had one register.
Does that make it a register machine? Ok, the one register has all
the purposes that registers have in that architecture, so one can
argue that it is a general-purpose register. However, as far as the
way to use it is concerned, the point of a register machine is that
you have multiple GPRs so that the programmer or compiler can just use another one when one is occupied, and if you run out, you spill and
refill. So one register is definitely too few to make it a register
machine.
- anton
CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
(1991)
RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
RISC
E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860
J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600
L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX
6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
-10 1 5+ 1 0 0 1 0 1 7 7 1/10 IA-64
-12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
-12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
-22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
-28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC
On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:
..., where are you using DIV in integer code?
Some modulus operations, a few perspective calculations? By the Pentium >>timeline all compilers were already converting division by constant to >>reciprocal MUL, so it was only the few remaining variable divisor DIVs >>which remained, and those could only fail if you had one of the very few >>patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.
I.e. it could well have gone undetected for a long time.
Terje
But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?
George Neuner <gneuner2@comcast.net> schrieb:
On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:
..., where are you using DIV in integer code?
Some modulus operations, a few perspective calculations? By the Pentium
timeline all compilers were already converting division by constant to
reciprocal MUL, so it was only the few remaining variable divisor DIVs
which remained, and those could only fail if you had one of the very few >>> patterns (leading bits) in the divisor that we had to check for in the
FDIV workaround.
I.e. it could well have gone undetected for a long time.
Terje
But were i486 compilers at that time routinely converting division by
constant to reciprocal MUL?
I've had a look at the gcc 2.4.5 sources (around when the Pentium came
out), and it seems it didn't do it.
The basic difference between RISC and CISC is that, with some exceptions, >CISC cores are all a single monolithic state machines serially executing >multiple states for each instruction, whereas RISC is a bunch of concurrent >state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact >some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view. ...
The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction,
whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view.
Thomas Koenig wrote:
George Neuner <gneuner2@comcast.net> schrieb:I guess I'm mixing up my own (re-)discovery of the technique which I
On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:
..., where are you using DIV in integer code?
Some modulus operations, a few perspective calculations? By the Pentium >>>> timeline all compilers were already converting division by constant to >>>> reciprocal MUL, so it was only the few remaining variable divisor DIVs >>>> which remained, and those could only fail if you had one of the very few >>>> patterns (leading bits) in the divisor that we had to check for in the >>>> FDIV workaround.
I.e. it could well have gone undetected for a long time.
Terje
But were i486 compilers at that time routinely converting division by
constant to reciprocal MUL?
I've had a look at the gcc 2.4.5 sources (around when the Pentium came
out), and it seems it didn't do it.
then promptly used in a couple(*) of my most favorite asm algorithms and
the timing of when it became standard for compiled code.
According to EricP<ThatWouldBeTelling@thevillage.com>:
The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction, whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from the monolithic point of view. ...
That's a great insight. It's easy to see how stuff like multiple registers, and load/store memory references follow from that.
You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything
interlocks on that register.
EricP wrote:
The basic difference between RISC and CISC is that, with some exceptions,
CISC cores are all a single monolithic state machines serially executing
multiple states for each instruction,
I can tell you that 68000, 68010, 68020, 68030 all used 3 microcode
pointers
simultaneously, 1 running the address section, 1 running the Data
section, and
1 running the Fetch-Decode section. The 3 pointers could access
different lines
in µcode and have the 3 reads ORed together as they exit µstore.
whereas RISC is a bunch of
concurrent
state machines with handshakes between them, and the ISA's for each of
these is designed from those different points of view. Now after the fact
some CISC's added limited HW concurrency but the ISA's were designed from
the monolithic point of view.
On 13 Jan 2024, John Levine wrote
(in article <unuvf0$qko$1@gal.iecc.com>):
According to EricP<ThatWouldBeTelling@thevillage.com>:
That's a great insight. It's easy to see how stuff like multiple registers, >> and load/store memory references follow from that.
You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything
interlocks on that register.
While respecting the caveat "with some exceptions",
KDF9 (designed 1960-62) was made of as many as 24
concurrently running state machines.
Division code is often scary,
Long division's downright scary.
On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:
Division code is often scary,
Long division's downright scary.
Shouldn't one of the occurrences of "scary", most
probably the first, be replaced by "hairy"?
On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:
Division code is often scary,
Long division's downright scary.
Shouldn't one of the occurrences of "scary", most
probably the first, be replaced by "hairy"?
According to Bill Findlay<findlaybill@blueyonder.co.uk>:
On 13 Jan 2024, John Levine wrote
(in article <unuvf0$qko$1@gal.iecc.com>):
According to EricP<ThatWouldBeTelling@thevillage.com>:
That's a great insight. It's easy to see how stuff like multiple registers,
and load/store memory references follow from that.
You can also see why a single register machine would be hopeless even
if the instruction set is as simple as a PDP-8, because everything interlocks on that register.
While respecting the caveat "with some exceptions",
KDF9 (designed 1960-62) was made of as many as 24
concurrently running state machines.
I believe you but from what I can see it had hardware stacks and 16
index registers, so it was hardly a single register machine.
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
FWIW, I think these arguments remind me of the "No true Scottsman"
kind of argument. [...]
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
surely people don't view the Itanium as being a RISC?
It has a lot of RISC characteristics, in particular, it's a
load/store architecture with a large number of general-purpose
registers (and for the other registers, there are also many of
them, avoiding the register allocation problems that compilers
tend to have with unique registers). In 1999 I wrote
<http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:
|It's basically a RISC with lots of special features:
I think the same description could be said of the IBM System/360.
I don't think of System/360 as a RISC, even if a subset of it
might be.
And I don't think so, either. That's because some of the features
that the IBM 801 left away are non-RISC features, such as the EDIT
instruction. OTOH, none of the special features of IA-64 are
particularly non-RISCy, [...]
Begging the question.
Ok, let's make this more precise:
|And I don't think so, either. That's because among the features that
|the IBM 801 left away are instructions that access memory, but are not
|just a load or a store, such as the EDIT instruction. OTOH, none of
|the special features of IA-64 add instructions that access memory, but
|are not just a load or a store, [...]
IA-64 followed early RISC practice in leaving away integer and FP
division (a feature which is interestingly already present in
commercial RISC, and, I think HPPA, as well as the 88000 and Power,
but in the integer case not in the purist Alpha).
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
On 2024-02-13 2:35 a.m., BGB wrote:
On 2/12/2024 10:19 PM, Tim Rentsch wrote:Could an initial guess come from estimating the reciprocal then doing a multiply, then dong the NR- iterations?
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
One thing I have noted is that for floating-point division via
Newton-Raphson, there is often a need for a first stage that converges
less aggressively.
Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a first stage
of, say:
y=y*((2.0-x*y)*0.375+0.625);
Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));
...
Without something to nudge the value closer than the initial guess, the
iteration might sometimes become unstable and "fly off into space"
instead of converging towards the answer.
...
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
They shouldn't. Different people here have expressed different
ideas, but each person has expressed more or less definite ideas.
The essential element of "No true Scotsman" is that whatever the distinguishing property or quality is supposed to be is never
identified, and cannot be, because it is chosen after the fact to
make the "prediction" be correct. That's not what's happening in
the RISC discussions.
On 2/12/2024 10:19 PM, Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
One thing I have noted is that for floating-point division via Newton-Raphson, there is often a need for a first stage that
converges less aggressively.
Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a
first stage of, say:
y=y*((2.0-x*y)*0.375+0.625);
Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));
Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
Even in DFP you can keep the mantissa in binary, in which case the
problem is exactly the same (modulo som minor differences in how
to round).
Assuming DPD (BCD/base 1000 more or less) you could still do
division with an approximate reciprocal and a loop, or via a
sufficiently precise reciprocal.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
They shouldn't. Different people here have expressed different
ideas, but each person has expressed more or less definite ideas.
The essential element of "No true Scotsman" is that whatever the
distinguishing property or quality is supposed to be is never
identified, and cannot be, because it is chosen after the fact to
make the "prediction" be correct. That's not what's happening in
the RISC discussions.
I had impression from John
https://en.wikipedia.org/wiki/John_Cocke
that 801/risc was to do the opposite of the failed Future System
effort
http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html
but there is also account of some risc work overlapping with FS https://www.ibm.com/history/risc [...]
Lynn Wheeler <lynn@garlic.com> writes:
I had impression from John
https://en.wikipedia.org/wiki/John_Cocke
that 801/risc was to do the opposite of the failed Future System
effort
http://www.jfsowa.com/computer/memo125.htm
https://people.computing.clemson.edu/~mark/fs.html
but there is also account of some risc work overlapping with FS
https://www.ibm.com/history/risc [...]
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer architecture).
On 2/14/2024 9:49 AM, Tim Rentsch wrote:
BGB <cr88192@gmail.com> writes:
On 2/12/2024 10:19 PM, Tim Rentsch wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
[... division ...]
From Hacker's Delight:
I think I shall never envision
an op unlovely as division.
An op whose answer must be guessed
and then, through multiply, assessed.
An op for which we dearly pay,
In cycles wasted every day.
Division code is often scary,
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
but even God can't divide by zero!
Division in base 2 is quite straightforward.
One thing I have noted is that for floating-point division via
Newton-Raphson, there is often a need for a first stage that
converges less aggressively.
Right. It's important to be inside the radius of convergence
before using a more accelerating form that is also less stable.
Not sure how big the radius is, only that the first-stage
approximation may fall outside of it...
Say, for most stages in finding the reciprocal, one can do:
y=y*(2.0-x*y);
But, this may not always converge, so one might need to do a
first stage of, say:
y=y*((2.0-x*y)*0.375+0.625);
Where, one may generate the initial guess as, say:
*(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));
The left hand side of this assignment has undefined behavior,
by virtue of violating effective type rules.
Yeah, but:
Relying on the underlying representation of 'double' is UB;
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer architecture).
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).
one of the final nails in the FS coffin was study by the IBM Houston
Science Center if 370/195 apps were redone for FS machine made out of
the fastest available technology, they would have throughput of 370/145 (about fractor of 30 times drop in throughput).
during the FS period, which was completely different than 370 and was
going to completely replace it, internal politics was killing off 370
efforts ... the lack of new 370 during the period is credited with giving clone 370 makers their market foothold. when FS finally implodes there
as mad rush getting stuff back into the 370 product pipelines.
trivia: I continued to work on 360&370 stuff all through FS period, even periodically ridiculing what they were doing (drawing analogy with a
long running cult film playing at theater down the street in central
sq), which wasn't exactly career enhancing activity ... it was as if
there was nobody that bothered to think about how all the "wonderful"
stuff might actually be implemented (or even if was possible).
Lynn Wheeler wrote:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).
one of the final nails in the FS coffin was study by the IBM Houston
Science Center if 370/195 apps were redone for FS machine made out of
the fastest available technology, they would have throughput of 370/145
(about fractor of 30 times drop in throughput).
during the FS period, which was completely different than 370 and was
going to completely replace it, internal politics was killing off 370
efforts ... the lack of new 370 during the period is credited with giving
clone 370 makers their market foothold. when FS finally implodes there
as mad rush getting stuff back into the 370 product pipelines.
trivia: I continued to work on 360&370 stuff all through FS period, even
periodically ridiculing what they were doing (drawing analogy with a
long running cult film playing at theater down the street in central
sq), which wasn't exactly career enhancing activity ... it was as if
there was nobody that bothered to think about how all the "wonderful"
stuff might actually be implemented (or even if was possible).
Sounds similar to Intel 432 ...
A page with a bunch of links on IBM future systems:
https://people.computing.clemson.edu/~mark/fs.html#:~:text=The%20IBM%20Future%20System%20(FS,store%20with%20automatic%20data%20management.
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).
one of the final nails in the FS coffin was study by the IBM
Houston Science Center if 370/195 apps were redone for FS machine
made out of the fastest available technology, they would have
throughput of 370/145 (about fractor of 30 times drop in
throughput). [...]
Lynn Wheeler <lynn@garlic.com> writes:
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In case people are interested, I think the rest of the book is
worth reading also, but to be fair there should be a warning
that much of what is said is more about management than it is
about technical matters. Still I expect y'all will find it
interesting (and it does have some things to say about computer
architecture).
one of the final nails in the FS coffin was study by the IBM
Houston Science Center if 370/195 apps were redone for FS machine
made out of the fastest available technology, they would have
throughput of 370/145 (about fractor of 30 times drop in
throughput). [...]
Looks like they should have called it Back to the Future Systems. :)
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim Rentsch) wrote:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
Page 73 of the Addison Wesley paperback. I don't know much about Future Systems, but it seems to have had a problem that I first encountered with IA-64: complexity presented as /completeness/, reassuring many people
that it must be good because it has everything you could want. My doubts started when I was skimming the IA-64 instruction set reference and ran
into instructions that did not seem to make any sense. I went back to
them a few times, but could not figure them out.
In contrast, the most recent weird instructions I ran into were Aarch64's Branch Target Indicator family. They are not well described in the ISA reference, but after a couple of readings, they made sense. AArch64 has annoying complexity in its more obscure corners, but that's better than
the seductive complexity of IA-64.
John
John Dallman wrote:
In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim
Rentsch) wrote:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
Page 73 of the Addison Wesley paperback. I don't know much about Future
Systems, but it seems to have had a problem that I first encountered with
IA-64: complexity presented as /completeness/, reassuring many people
that it must be good because it has everything you could want. My doubts
started when I was skimming the IA-64 instruction set reference and ran
into instructions that did not seem to make any sense. I went back to
them a few times, but could not figure them out.
In contrast, the most recent weird instructions I ran into were Aarch64's
Branch Target Indicator family. They are not well described in the ISA
reference, but after a couple of readings, they made sense. AArch64 has
annoying complexity in its more obscure corners, but that's better than
the seductive complexity of IA-64.
John
I believe you mean the Branch Target Identification BTI instruction.
Looks like a landing pad for branch/calls to catch
Return Oriented Programming exploits.
I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:
"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate value that
is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical >elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of >pattern and their bitwise inverse.
John Dallman wrote:
In contrast, the most recent weird instructions I ran into were
Aarch64's Branch Target Indicator family.
I believe you mean the Branch Target Identification BTI instruction.
Looks like a landing pad for branch/calls to catch
Return Oriented Programming exploits.
I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:
"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate
value that is a 32-bit pattern or a 64-bit pattern viewed as a
vector of identical elements of size e = 2, 4, 8, 16, 32 or,
64 bits. Each element contains the same sub-pattern, that is a
single run of 1 to (e - 1) nonzero bits from bit 0 followed by
zero bits, then rotated by 0 to (e - 1) bits. This mechanism
can generate 5334 unique 64-bit patterns as 2667 pairs of
pattern and their bitwise inverse.
Note
Values that consist of only zeros or only ones cannot be described
in this way."
John Dallman wrote:
I nominate A64 logical immediate instructions for a WTF award.
They split a 12-bit immediate into two 6-bit fields that encode:
"C3.4.2 Logical (immediate)
The Logical (immediate) instructions accept a bitmask immediate value that
is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of pattern and their bitwise inverse.
Note
Values that consist of only zeros or only ones cannot be described
in this way."
In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com
(Tim Rentsch) wrote:
By coincidence I have recently been reading The Design of Design,
by Fred Brooks. In an early chapter he briefly relates (in just
a few paragrahs) an experience of doing a review of the Future
Systems architecture, and it's clear Brooks was impressed by a
lot of what he heard. It's worth reading. But I can't resist
giving away the punchline, which appears at the start of the
fourth (and last) paragraph:
I knew then that the project was doomed.
Page 73 of the Addison Wesley paperback. I don't know much about
Future Systems, but it seems to have had a problem that I first
encountered with IA-64: complexity presented as /completeness/,
reassuring many people that it must be good because it has everything
you could want. My doubts started when I was skimming the IA-64
instruction set reference and ran into instructions that did not seem
to make any sense. I went back to them a few times, but could not
figure them out.
In contrast, the most recent weird instructions I ran into were
Aarch64's Branch Target Indicator family. They are not well described
in the ISA reference, but after a couple of readings, they made
sense. AArch64 has annoying complexity in its more obscure corners,
but that's better than the seductive complexity of IA-64.
John
If we want to compare FS to Intel products then I'd expect Future
Systems complexity to be similar to i432 complexity or to complexity
of 80286 additions to x86 architecture or to those parts of BiiN
that did not become i960.
Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...
On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:
Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...
No, in the 68000 family the A- and D- registers are 32 bits.
If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial 16-bit products. Going full 32-bit was just a matter of filling in the gaps.
On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:
On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:
Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...
No, in the 68000 family the A- and D- registers are 32 bits.
If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial 16-bit
products. Going full 32-bit was just a matter of filling in the gaps.
Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
16-bit buses internally and externally. Some 68000 compilers had 16-bit
int, some had 32-bit int, and some let you choose either, since 16-bit
types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.
Yes, I was referring to its 16-bit internal bus structure.
This M68000 patent from 1978 shows it in Fig 2:
Patent US4296469 Execution unit for data processor using
segmented bus structure, 1978
https://patents.google.com/patent/US4296469A/en
David Brown wrote:
Yes, the 68000 was designed to have full support for 32-bit types and
a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU
and 16-bit buses internally and externally. Some 68000 compilers had
16-bit int, some had 32-bit int, and some let you choose either, since
16-bit types could be significantly faster on the 68000 even though
the general-purpose registers were 32-bit.
Yes, I was referring to its 16-bit internal bus structure.
This M68000 patent from 1978 shows it in Fig 2:
Patent US4296469 Execution unit for data processor using
segmented bus structure, 1978
https://patents.google.com/patent/US4296469A/en
I found a book on the uArch design of the 68000 and Micro/370
written by their senior designer Nick Tredennick.
Microprocessor Logic Design, Tredennick, 1987 https://archive.org/download/tredennick-microprocessor-logic-design/Tredennick-Microprocessor-logic-Design_text.pdf
On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:
On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:
Furthermore, the address and data registers and buses are 16 bits and
the high 16-bits are shared ...
No, in the 68000 family the A- and D- registers are 32 bits.
If you compare the earlier members with the 68020 and later, it becomes
clear that the architecture was designed as full 32-bit from the
beginning, and then implemented in a cut-down form for the initial
16-bit
products. Going full 32-bit was just a matter of filling in the gaps.
Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
16-bit buses internally and externally.
Some 68000 compilers had 16-bit
int, some had 32-bit int, and some let you choose either, since 16-bit
types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
It was pin limited and internal area limited; and close to being
power limited (NMOS).
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
On 10/1/24 3:00 PM, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
Saving an extra pass through the 16 bit ALU for a 32 bit operation would
be faster. Assuming that you didn't have to wait for another bus cycle
to get the other half of an operand.
Making it faster for register to register operations and not much else.
A 16 bit barrel roller does not make sense, and Motorola had no idea that shifts would be so important.
On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
It was pin limited and internal area limited; and close to being
power limited (NMOS).
David Schultz <david.schultz@earthlink.net> wrote:
On 10/1/24 3:00 PM, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
A 32-bit bus would have priced the 68K at 30%-50% higher simply
due to the number of pins on available packages. This would have
eliminated any chance at competing for the broad markets at that
time.
Would have an external 16-bit bus and an internal 32-bit bus have
been advantageous, or would this have blown a likely transistor
budget for little gain?
Saving an extra pass through the 16 bit ALU for a 32 bit operation would
be faster. Assuming that you didn't have to wait for another bus cycle
to get the other half of an operand.
Making it faster for register to register operations and not much else.
A 16 bit barrel roller does not make sense, and Motorola had no idea
that shifts would be so important.
In the original 68000, a barrel shifter would have blown the area
budget--it would have been about equal to the d-section; even in
16-bit form. Remember this was a 1 layer metal design before poly
silicon was in the process.
Would have an external 16-bit bus and an internal 32-bit bus have been advantageous, or would this have blown a likely transistor budget for
little gain?
My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.
On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:
My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.
The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.
And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...
The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.
And in typical big-endian fashion, they added yet another inconsistency to >the way the bits were numbered ...
On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:
My wish list for the 68k was a barrel roller for fast shifts, would have
made a HUGE difference for the first Macintosh.
The 68020 certainly had that. It also added bit-field instructions, on top
of the single-bit instructions of the original 68000.
And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...
Since you mentioned POWER and PowerPC elsewhere, the bit numbering
challenges of the m68k world are nothing compared to the PowerPC world. >Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
down to 31 as the LSB. So your 32-bit address bus had lines from A0
down to A31. Then it got extended to 64-bit (some devices had only
partial 64-bit extensions), and the chips got a wider address bus (you
never need all 64-bit lines physically) - the pins for the higher
address lines were numbered A-1, A-2, and so on. For the internal
registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
were original 32-bit and got extended to 64-bit, and so are numbered bit
-32 down to bit 31 for consistency. Others are 32-bit but numbered from
bit 32 down to bit 63.
David Brown <david.brown@hesbynett.no> writes:
Since you mentioned POWER and PowerPC elsewhere, the bit numbering
challenges of the m68k world are nothing compared to the PowerPC world.
Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
down to 31 as the LSB. So your 32-bit address bus had lines from A0
down to A31. Then it got extended to 64-bit (some devices had only
partial 64-bit extensions), and the chips got a wider address bus (you
never need all 64-bit lines physically) - the pins for the higher
address lines were numbered A-1, A-2, and so on. For the internal
registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
were original 32-bit and got extended to 64-bit, and so are numbered bit
-32 down to bit 31 for consistency. Others are 32-bit but numbered from
bit 32 down to bit 63.
Maybe they should have started with the MSB as bit -31 or -63, which
would have allowed them to always use bit 0 for the LSB while having big-endian bit ordering.
For bit ordering big-endian (as in the PowerPC manual) looked more
wrong to me than for byte ordering; I thought that that was just a
matter of getting used to the unfamiliar bit ordering, but maybe the advantage of little-endian becomes more apparent in bit ordering, and
maybe that's why Motorola and Sun chose little-endian bit ordering
despite having big-endian byte ordering.
For both bit and byte ordering, the advantage of little-endian shows
up when there are several widths involved. So why is it more obvious
for bit-ordering?
BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is
a 64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.
I expect that the s390x uses the same bit numbering as Power.
Does it have instructions where that matters?
BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
is a 64-bit architecture, and that the manual describes only the
32-bit subset. Maybe the original Power was 32-bit.
BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a 64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.
I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.
The PowerPC 601 was first shown publicly in 1993; I can’t remember when
the fully 64-bit 620 came out, but it can’t have been long after.
Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.
Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.
On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:
BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a
64-bit architecture, and that the manual describes only the 32-bit
subset. Maybe the original Power was 32-bit.
I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.
The PowerPC 601 was first shown publicly in 1993; I can’t remember when
the fully 64-bit 620 came out, but it can’t have been long after.
Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.
Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.
On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
Compare this with the pain the x86 world went through, over a much longer
time, to move to 32-bit.
The x86 started from 8-bit roots, and increased width over time, which
is a very different path.
And much of the reason for it being a slow development is that the world
was held back by MS's lack of progress in using new features. The 80386
was produced in 1986, but the MS world was firmly at 16-bit under it
gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit
from 1993, and Win32s was from around the same time, but these were >relatively small in the market.)
On 10/4/2024 12:30 PM, Anton Ertl wrote:
Say, pretty much none of the modern graphics programs (that I am aware
of) really support working with 16-color and 256-color bitmap images
with a manually specified color palette.
Typically, any modern programs are true-color internally, typically only supporting 256-color as an import/export format with an automatically generated "optimized" palette, and often not bothering with 16-color
images at all. Not so useful if one is doing something that does
actually have a need for an explicit color palette (and does not have so
much need for any "photo manipulation" features).
And, most people generally haven't bothered with this stuff since the
Win16 era (even the people doing "pixel art" are still generally doing
so using true-color PNGs or similar).
Blame PowerPoint ... No more evil tool ever existed.
The fact that the 386SX only appeared in 1988 also did not help.
MS PaintBrush became MS Paint and seemingly mostly got dumbed down as
time went on.
Closest modern alternative is Paint.NET, but still doesn't allow manual palette control in the same way as BitEdit.
On Fri, 04 Oct 2024 17:30:07 GMT, Anton Ertl wrote:
The fact that the 386SX only appeared in 1988 also did not help.
As a software guy, I liked the idea of the 386SX, and encouraged friends/ colleagues to choose it over a 286.
Of course, they wanted to compare price/performance, but I saw things in terms of future software compatibility, and the sooner the move away from braindead x86 segmentation towards a nice, flat, expansive, linear address space, the better for everybody.
Sometimes I felt like a voice crying in the wilderness ...
Didn’t it take a decade for the 386 to get a 32 bit OS
Brett <ggtgp@yahoo.com> writes:
Didn’t it take a decade for the 386 to get a 32 bit OS
386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and Windows/386 appeared in 1987.
- anton
David Brown <david.brown@hesbynett.no> writes:
On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
Compare this with the pain the x86 world went through, over a much longer >>> time, to move to 32-bit.
The x86 started from 8-bit roots, and increased width over time, which
is a very different path.
Still, the question is why they did the 286 (released 1982) with its protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).
And much of the reason for it being a slow development is that the world
was held back by MS's lack of progress in using new features. The 80386
was produced in 1986, but the MS world was firmly at 16-bit under it
gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit >>from 1993, and Win32s was from around the same time, but these were
relatively small in the market.)
At that time the market was moving much slower than nowadays. Systems
with a 286 (and maybe even the 8088) were sold for a long time after
the 386 was introduced. E.g., the IBM PS/1 Model 2011 was released in
1990 with a 10MHz 286, and the successor Model 2121 with a 386SX was
not introduced until 1992. I think it's hard to blame MS for
targeting the machines that were out there.
And looking at
<https://en.wikipedia.org/wiki/Windows_2.1x>, Windows 2.1 in 1988
already was available in a Windows/386 version (but the programs were
running in virtual 8086 mode, i.e., were still 16-bit programs).
And it was not just MS who was going in that direction. MS and IBM
worked on OS/2, and despite ambitious goals IBM insisted that the
software had to run on a 286.
The fact that the 386SX only appeared in 1988 also did not help.
- anton
On 04/10/2024 19:30, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
Compare this with the pain the x86 world went through, over a much longer >>>> time, to move to 32-bit.
The x86 started from 8-bit roots, and increased width over time, which
is a very different path.
Still, the question is why they did the 286 (released 1982) with its
protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).
I can only guess the obvious - it is what some big customer(s) were
asking for. Maybe Intel didn't see the need for 32-bit computing in the >markets they were targeting, or at least didn't see it as worth the cost.
It is fair enough to target the existing market, but they were also slow >(IMHO) to take advantage of new opportunities in hardware, re-enforcing
the situation.
I think MS and their monopoly on markets caused a
stagnation - lack of real competition meant lack of progress.
IBM were famous for poor (and perhaps cowardly) decisions at the time,
and MS happily screwed them over again and again in regards to OS/2.
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Another interpretation is that MS went faithfully into OS/2,
assigning not just their Xenix team to it (although according
to Wikipedia the Xenix abandonment by MS was due to AT&T
entering the Unix market) and reportedly also assigned the best
MS-DOS developers to OS/2. They tried to stick to OS/2 for
several years, but eventually were fed up with all the bad
decisions coming from IBM, and bowed out.
On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:
BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
is a 64-bit architecture, and that the manual describes only the
32-bit subset. Maybe the original Power was 32-bit.
I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.
The PowerPC 601 was first shown publicly in 1993; I can’t remember
when the fully 64-bit 620 came out, but it can’t have been long after.
Motorola did a similar thing with the 68000 family: if you compare
the original 68000 instruction set with the 68020, you will see the
latter only needed to fill in a few gaps to become fully 32-bit.
Compare this with the pain the x86 world went through, over a much
longer time, to move to 32-bit.
On Sat, 05 Oct 2024 18:11:55 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Brett <ggtgp@yahoo.com> writes:
Didn’t it take a decade for the 386 to get a 32 bit OS
386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
Windows/386 appeared in 1987.
- anton
SunOS for i386 in 1988.
Netware 3x in 1990.
The later sold in very high volumes.
On Sat, 05 Oct 2024 18:11:55 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Brett <ggtgp@yahoo.com> writes:
Didn’t it take a decade for the 386 to get a 32 bit OS
386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
Windows/386 appeared in 1987.
- anton
SunOS for i386 in 1988.
Netware 3x in 1990.
The later sold in very high volumes.
The first 32 bit windows was Windows 95 ...
Then MS switched emphasis, so that the Windows API was the primary personality of OS/2 3.0, and renamed it Windows NT.
That also had an OS/2 personality at the start, along with a POSIX personality.
Motorola did a similar thing with the 68000 family: if you compare theNot similar at all.
original 68000 instruction set with the 68020, you will see the latter
only needed to fill in a few gaps to become fully 32-bit.
In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Either half-remembered from older architectures, or re-invented and >considered viable a decade after the original inventors had learned
better.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.
On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.
I completely agree. Back when the 8086 was designed, 640K seemed like a
lot. They never expected it to grow beyond the minframes of their time.
Lars Poulsen wrote:
On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode,
and in the 80286 they finally got around to it. And the idea was
(like AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory
models were possible, but not really intended. The Huge memory
model was completely alien to protected mode, as was direct
hardware access, as was common on the IBM PC. And computing with
segment register contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.
I completely agree. Back when the 8086 was designed, 640K seemed
like a lot. They never expected it to grow beyond the minframes of
their time.
640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.
Afair the PC also mishandled interrupt handling?
Terje
On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:
The first 32 bit windows was Windows 95 ...
Windows NT 3.1, 1993.
jgd@cix.co.uk (John Dallman) writes:
In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.
- anton
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:
The first 32 bit windows was Windows 95 ...
Windows NT 3.1, 1993.
So 8 years, that PC would still be
in the trash can by then.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
jgd@cix.co.uk (John Dallman) writes:
In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and
in the 80286 they finally got around to it. And the idea was (like
AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory models
were possible, but not really intended. The Huge memory model was completely alien to protected mode, as was direct hardware access,
as was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.
I have for decades pointed out that the four bit offset of 8086
segments was planned obsolescence. An 8 bit offset with 16 megabytes
of address space would have kept the low end alive for too long in
Intels eyes. To control the market you need to drive complexity onto
the users, which weeds out licensed competition.
Everything Intel did drove needless patentable complexity into the
follow on CPUs.
- anton
On Mon, 7 Oct 2024 16:32:34 -0000 (UTC)
Brett <ggtgp@yahoo.com> wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:You forget that Intel didn't and couldn't expect that 8088 would be
jgd@cix.co.uk (John Dallman) writes:
In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Either half-remembered from older architectures, or re-invented and
considered viable a decade after the original inventors had learned
better.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and
in the 80286 they finally got around to it. And the idea was (like
AFAIK in the iAPX432) to have one segment per object and per
procedure, i.e., the large memory model. The smaller memory models
were possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access,
as was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would
have been too inefficient, and also that 8192 segments is not
enough for that kind of usage, given 640KB of RAM (not to mention
the 16MB that the 286 supported); and with 640KB having the
segments limited to 64KB is too restrictive for a number of
applications.
I have for decades pointed out that the four bit offset of 8086
segments was planned obsolescence. An 8 bit offset with 16 megabytes
of address space would have kept the low end alive for too long in
Intels eyes. To control the market you need to drive complexity onto
the users, which weeds out licensed competition.
Everything Intel did drove needless patentable complexity into the
follow on CPUs.
such stunning success. Not just that. According to Oral history they
didn't realize what they have in hands until 1983.
Not every PC made in those years was crap. Some of them were quite
reliable and lasted long.
Not every PC made in those years was crap. Some of them were quite reliable and lasted long.
But back then, Dennard scaling meant that an 8 year-old PC was so much
slower than a current PC that it was difficult to find people willing
to still use it.
Nowadays, for a large proportion of tasks, you can't really tell the difference between a last-generation CPU and an 8 year-old CPU, so the reliability is much more of a factor.
Stefan
The 80386 was introduced as pre-production samples for software
development workstations in October 1985.[5] Manufacturing of the chips
in significant quantities commenced in June 1986.
Here's another speculation: The 286 protected mode was what they already
had in mind when they built the 8086, but there were not enough
transistors to do it in the 8086, so they did real mode, and in the
80286 they finally got around to it.
640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.
jgd@cix.co.uk (John Dallman) writes:
In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
I find it hard to believe that many customers would ask Intel
for something the 80286 protected mode with segments limited
to 64KB, and even if, that Intel would listen to them. This
looks much more like an idee fixe to me that one or more of
the 286 project leaders had, and all customer input was made
to fit into this idea, or was ignored.
Either half-remembered from older architectures, or re-invented and >>considered viable a decade after the original inventors had learned
better.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model. The smaller memory models were
possible, but not really intended. The Huge memory model was
completely alien to protected mode, as was direct hardware access, as
was common on the IBM PC. And computing with segment register
contents was also not intended.
If programmers had used the 8086 in the intended way, porting to
protected mode would have been easy, but the programmers used it in
other ways, and the protected mode flopped.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, given 640KB of RAM (not to mention the 16MB that
the 286 supported); and with 640KB having the segments limited to 64KB
is too restrictive for a number of applications.
- anton
Is protected mode not "how Pascal" thinks of memory and objects in
memory ??
Whereas by the time 286 got out, everybody was wanting flat memory ala
C.
Is protected mode not "how Pascal" thinks of memory and objects
in memory ??
Whereas by the time 286 got out, everybody was wanting flat
memory ala C.
On Mon, 7 Oct 2024 19:57:44 +0300, Michael S wrote:
The 80386 was introduced as pre-production samples for software
development workstations in October 1985.[5] Manufacturing of the chips
in significant quantities commenced in June 1986.
And the first vendor to offer a Microsoft-compatible PC product based on
that chip? Compaq, with its “Deskpro 386” that same year, I believe.
On Mon, 7 Oct 2024 15:17:36 +0200, Terje Mathisen wrote:
640K was an artifact of the frame buffer placement selected by the
original IBM PC, it could just as well have been 900+ K.
Another MS-DOS machine, the DEC Rainbow, could be upgraded to 896KiB of
RAM. I know because our Comp Sci department had one.
That was the one with the dual Z80 and 8086 (8088?) chips, so it could run
3 different OSes: CP/M-80, CP/M-86, and MS-DOS. Not more than one at once, though (that would have been some trick).
But it was not fully hardware-compatible with the IBM PC.
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model.
Would it have been differently if the 8086/8088 had already had
protected mode? I think that having one segment per object would have
been too inefficient, and also that 8192 segments is not enough for
that kind of usage, ...
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Here's another speculation: The 286 protected mode was what they
already had in mind when they built the 8086, but there were not
enough transistors to do it in the 8086, so they did real mode, and in
the 80286 they finally got around to it. And the idea was (like AFAIK
in the iAPX432) to have one segment per object and per procedure,
i.e., the large memory model.
If you look at the 8086 manuals, that's clearly what they had in mind.
What I don't get is that the 286's segment stuff was so slow.
On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:
Is protected mode not "how Pascal" thinks of memory and objects in
memory ??
How is that different from C?
Whereas by the time 286 got out, everybody was wanting flat memory ala
C.
When did they not want that?
If you look at the 8086 manuals, that's clearly what they had in mind.
What I don't get is that the 286's segment stuff was so slow.
It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.
Dave Cutler came from DEC (where he was one of the resident
Unix-haters) to mastermind the Windows NT project in 1988. When did
the OS/2-NT pivot take place?
Funny, you'd think they would use that same _personality_ system to
implement WSL1, the Linux-emulation layer. But they didn't.
I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by
that point.
On Tue, 8 Oct 2024 6:16:12 +0000, Lawrence D'Oliveiro wrote:
On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:
Is protected mode not "how Pascal" thinks of memory and objects in
memory ??
How is that different from C?
In pascal you cannot subtract pointers to different objects,
in C you can.
Whereas by the time 286 got out, everybody was wanting flat memory ala
C.
When did they not want that?
The Algol family of block structure gave the illusion that flat
was less necessary and it could all be done with lexical address-
ing and block scoping rules.
Then malloc() and mmap() came along.
mitchalsup@aol.com (MitchAlsup1) writes:
Whereas by the time 286 got out, everybody was wanting flat
memory ala C.
It's interesting that, when C was standardized, the segmentation found
its way into it by disallowing subtracting and comparing between
addresses in different objects.
This disallows performing certain
forms of induction variable elimination by hand. So while flat memory
is C culture so much that you write "flat memory ala C", the
standardized subset of C (what standard C fanatics claim is the only
meaning of "C") actually specifies a segmented memory model.
An interesting case is the Forth standard. It specifies "contiguous regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.
On 08/10/2024 09:28, Anton Ertl wrote:.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)
On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:
On 08/10/2024 09:28, Anton Ertl wrote:.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)
Somebody has to write memmove() and they want to use C to do it.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".
In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:
Dave Cutler came from DEC (where he was one of the resident
Unix-haters) to mastermind the Windows NT project in 1988. When did
the OS/2-NT pivot take place?
1990, after the release of Windows 3.0, which was an immediate commercial success. It was the first version that you could get serious work out of. It's been compared to a camel: a vicious brute at times, but capable of
doing a lot of carrying.
<https://en.wikipedia.org/wiki/OS/2#1990:_Breakup>
Funny, you'd think they would use that same _personality_ system to
implement WSL1, the Linux-emulation layer. But they didn't.
They were called subsystems in Windows NT, and ran on top of the NT
kernel. The POSIX one came first, and was very limited, followed by the Interix one that was called Windows Services for Unix. Programs for both
of these were in PE-COFF format, not ELF. There was also the OS/2
subsystem, but it only ran text-mode programs.
The POSIX subsystem was there to meet US government purchasing
requirements, not to be used for anything serious. I can't imagine Dave Cutler was keen on it.
WSL1 seems to have been something odd: rather than a single subsystem, a bunch of mini-subsystems. However, VMS/NT kernels just have different assumptions about programs from Unix-style kernels, so they went to lightweight virtualisation in WSL2.
<https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux#History>
The same problem seems to have messed up all the attempts to provide good Unix emulation on VMS. It's notable that MICA started out trying to
provide both VMS and Unix APIs, but this was dropped in favour of a
separate Unix OS before MICA was cancelled.
<https://en.wikipedia.org/wiki/DEC_MICA#Design_goals>
I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by
that point.
Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.
John
In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations
in page table and page file handling.
*nix needs to maintain various data structures to support forking
memory just in case it happens.
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:
On 08/10/2024 09:28, Anton Ertl wrote:.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)
Somebody has to write memmove() and they want to use C to do it.
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library memmove() function!).
On 09/10/2024 18:28, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:
On 08/10/2024 09:28, Anton Ertl wrote:.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library. (Standard
library implementations don't need to be portable, and can rely on
extensions or other compiler-specific features.)
Somebody has to write memmove() and they want to use C to do it.
They don't have to write it in standard, portable C. Standard libraries will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they
want.
You will find that most implementations of memmove() are done by
converting the pointers to a unsigned integer type and comparing those values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
writers can use C99 for theirs).
Such implementations will not be portable to all systems. They won't
work on a target that has some kind of "fat" pointers or segmented
pointers that can't be translated properly to integers.
That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.
The avrlibc library used by gcc for the AVR has its memmove()
implemented in assembly for speed, as does musl for some architectures.
There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
things are in the standard library in the first place.
In the VMS/WinNT way, each memory section is defined as either shared
or private when created and cannot be changed. This allows optimizations
in page table and page file handling.
Interesting. Do you happen to have a pointer for further reading
about it?
*nix needs to maintain various data structures to support forking
memory just in case it happens.
I can't imagine what those datastructures would be (which might be just >another way to say that I was brought up on POSIX and can't imagine the
world differently).
On 10/9/2024 1:20 PM, David Brown wrote:
There are lots of parts of the standard C library that cannot be written
completely in portable standard C. (How would you write a function that
handles files?
You need non-portable OS calls.) That's why these
things are in the standard library in the first place.
On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:
On 10/9/2024 1:20 PM, David Brown wrote:
There are lots of parts of the standard C library that cannot be written >>> completely in portable standard C. (How would you write a function that >>> handles files?
Do you mean things other than open(), close(), read(), write(), lseek()
??
You need non-portable OS calls.) That's why these
things are in the standard library in the first place.
On 10/9/2024 1:20 PM, David Brown wrote:
There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write a
function that handles files? You need non-portable OS calls.) That's
why these things are in the standard library in the first place.
I agree with everything you say up until the last sentence. There are several languages, mostly older ones like Fortran and COBOL, where the
file handling/I/O are defined portably within the language proper, not
in a separate library. It just moves the non-portable stuff from the library writer (as in C) to the compiler writer (as in Fortran, COBOL,
etc.)
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects? >>>>> For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no >connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be >eliminated when the compiler optimises the functions inline - when the >compiler knows the size of the move/copy, it can optimise directly.
The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.
In general, there will be many aspects of a C compiler's code generator,
its run-time support library, and C standard libraries that can work
better if they are optimised for each new generation of processor.
Sometimes you just need to re-compile the library with a newer compiler
and appropriate flags, other times you need to modify the library source >code. None of this is specific to memmove().
But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects? >>>>>> For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can
implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
David Brown <david.brown@hesbynett.no> writes:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise >directly.
The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.
In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().
But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.
Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.
They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.
The prolog instruction preconditions the copy and copies
an IMPDEF portion.
The body instruction performs an IMPDEF Portion and
the epilog instruction finalizes the copy.
The three instructions are issued consecutively.
On Thu, 10 Oct 2024 20:00:29 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
David Brown <david.brown@hesbynett.no> writes:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
The use of wider register sizes can help to some extent, but not
once you have reached the width of the internal buses or cache
bandwidth.
In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries
that can work better if they are optimised for each new generation
of processor. Sometimes you just need to re-compile the library with
a newer compiler and appropriate flags, other times you need to
modify the library source code. None of this is specific to
memmove().
But it is true that you get an easier and more future-proof
memmove() and memcopy() if you have an ISA that supports scalable
vector processing of some kind, such as ARM and RISC-V have, rather
than explicitly sized SIMD registers.
Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
memcpy and memset.
They're three-instruction sets; prolog/body/epilog. There are
separate sets for forward vs. forward-or-backward copies.
The prolog instruction preconditions the copy and copies
an IMPDEF portion.
The body instruction performs an IMPDEF Portion and
the epilog instruction finalizes the copy.
The three instructions are issued consecutively.
People that have more clue about Arm Inc schedule than myself
expect Arm Cortex cores that implement these instructions to be
announced next May and to appear in actual [expensive] phones in 2026.
Which probably means 2027 at best for Neoverse cores.
It's hard to make an educated guess about schedule of other Arm core >designers.
On 10/10/2024 20:38, MitchAlsup1 wrote:
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.
That's why I said there was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers.
Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.
On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C.
{
memmove( p, q, size );
}
Where the compiler produces the MM instruction itself. Looks damn
close to standard C to me !!
OR
for( int i = 0, i < size; i++ )
p[i] = q[i];
Which gets compiled to memcpy()--also looks to be standard C.
OR
p_struct = q_struct;
gets compiled to::
memmove( &p_struct, &q_struct, sizeof( q_struct ) );
also looks to be std C.
On 10/10/24 2:21 PM, David Brown wrote:
[ SNIP]
If the compiler generates the memmove instruction, then one doesn't
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.
have to write memmove() is C - it is never called/used.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
The use of wider register sizes can help to some extent, but not once
you have reached the width of the internal buses or cache bandwidth.
In general, there will be many aspects of a C compiler's code
generator, its run-time support library, and C standard libraries that
can work better if they are optimised for each new generation of
processor. Sometimes you just need to re-compile the library with a
newer compiler and appropriate flags, other times you need to modify
the library source code. None of this is specific to memmove().
But it is true that you get an easier and more future-proof memmove()
and memcopy() if you have an ISA that supports scalable vector
processing of some kind, such as ARM and RISC-V have, rather than
explicitly sized SIMD registers.
Not applicable.
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
Brian probably meant to say that that it is not applicable to his my66k
LLVM back end.
But I am pretty sure that what you suggest is applicable, but bad idea
for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification, i.e.
exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
effort.
On 10/9/2024 1:20 PM, David Brown wrote:
There are lots of parts of the standard C library that cannot be
written completely in portable standard C. (How would you write
a function that handles files? You need non-portable OS calls.)
That's why these things are in the standard library in the first
place.
I agree with everything you say up until the last sentence. There
are several languages, mostly older ones like Fortran and COBOL,
where the file handling/I/O are defined portably within the
language proper, not in a separate library. It just moves the
non-portable stuff from the library writer (as in C) to the
compiler writer (as in Fortran, COBOL, etc.)
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
Stefan Monnier <monnier@iro.umontreal.ca> writes:
In the VMS/WinNT way, each memory section is defined as either sharedInteresting. Do you happen to have a pointer for further reading
or private when created and cannot be changed. This allows optimizations >>> in page table and page file handling.
about it?
*nix needs to maintain various data structures to support forkingI can't imagine what those datastructures would be (which might be just
memory just in case it happens.
another way to say that I was brought up on POSIX and can't imagine the
world differently).
http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you
are not clueless.
This discussion has become pointless.
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different
objects? For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they
can implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard
library memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
Scott Lurndal wrote:
Stefan Monnier <monnier@iro.umontreal.ca> writes:
In the VMS/WinNT way, each memory section is defined as either sharedInteresting. Do you happen to have a pointer for further reading
or private when created and cannot be changed. This allows optimizations >>>> in page table and page file handling.
about it?
*nix needs to maintain various data structures to support forkingI can't imagine what those datastructures would be (which might be just
memory just in case it happens.
another way to say that I was brought up on POSIX and can't imagine the
world differently).
http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf
Yeah, that's a great book on how VMS works in detail.
My copy is v1.0 from 1981.
On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you
are not clueless.
This discussion has become pointless.
The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".
One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.
Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
On 10/12/24 12:06 AM, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:Yes.
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.
#include <string.h>
void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}
void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret
Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.
On 12/10/2024 01:32, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you
are not clueless.
This discussion has become pointless.
The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".
One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.
Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?
Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.
I believe there is an interesting discussion to be had here, and I would enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.
But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.
Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.
Brian G. Lucas <bagel99@gmail.com> wrote:
On 10/12/24 12:06 AM, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:Yes.
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.
#include <string.h>
void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}
void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret
Excellent!
Though I guess forwarding a const is probably a thing today to improve
branch prediction, which is normally HORRIBLE for short branch counts.
What is the default virtual loop count if the register count is not available?
Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is loaded and that count value could be say 5.
Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.
On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:
Brian G. Lucas <bagel99@gmail.com> wrote:
On 10/12/24 12:06 AM, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:Yes.
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
Can R3 be a const, that causes issues for restartability, but branch
prediction is easier and the code is shorter.
#include <string.h>
void memmoverr(char to[], char fm[], size_t cnt)
{
memmove(to, fm, cnt);
}
void memmoverd(char to[], char fm[])
{
memmove(to, fm, 0x100000000);
}
Yields:
memmoverr: ; @memmoverr
mm r1,r2,r3
ret
memmoverd: ; @memmoverd
mm r1,r2,#4294967296
ret
Excellent!
Though I guess forwarding a const is probably a thing today to improve >>>> branch prediction, which is normally HORRIBLE for short branch counts.
What is the default virtual loop count if the register count is not
available?
There is always a count available; it can come from a register or an immediate.
Worst case the source and dest are in cache, and the count is 150 cycles
away in memory. So hundreds of chars could be copied until the value is
loaded and that count value could be say 5.
The instruction cannot start until the count in known. You don't start
an FMAC until all 3 operands are ready, either.
Lots of work and time
discarded, so you play the odds, perhaps to the low side and over
prefetch to cover being wrong.
On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:
MitchAlsup1 <mitchalsup@aol.com> wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.
The 3rd Operand can, indeed, be a constant.
That causes no restartability problem when you have a place to
store the current count==index, so that when control returns
and you re-execute MM, it sees that x amount has already been
done, and C-X is left.
David Brown <david.brown@hesbynett.no> wrote:
On 12/10/2024 01:32, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.
This discussion has become pointless.
The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".
One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.
Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?
Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.
I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.
But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.
In short your complaints are wrong headed in not understanding what
hardware memcpy can do.
On 11/10/2024 14:13, Michael S wrote:
On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.
But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable" architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.
That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.
Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)
On 2024-10-12 21:33, Brett wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 12/10/2024 01:32, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I
know you are not clueless.
This discussion has become pointless.
The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".
One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.
Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
Again, I have to ask - do you bother to read the posts you reply
to? Are you interested in replying, and engaging in the
discussion? Or are you just looking for a chance to promote your
own architecture, no matter how tenuous the connection might be to
other posts?
Again, let me say that I agree with what you are saying - I agree
that an ISA should have instructions that are efficient for what
people actually want to do. I agree that it is a good thing to
have instructions that let performance scale with advances in
hardware ideally without needing changes in compiled binaries, and
at least without needing changes in source code.
I believe there is an interesting discussion to be had here, and I
would enjoy hearing about comparisons of different ways things
functions like memcpy() and memset() can be implemented in
different architectures and optimised for different sizes, or how
scalable vector instructions can work in comparison to fixed-size
SIMD instructions.
But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.
[ snip discussion of HW ]
In short your complaints are wrong headed in not understanding what hardware memcpy can do.
I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.
Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().
On 12.10.24 17:16, David Brown wrote:
[snip rant]
You are aware that this is c.arch, not c.lang.c?
On Sun, 13 Oct 2024 10:31:49 +0300
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2024-10-12 21:33, Brett wrote:
David Brown <david.brown@hesbynett.no> wrote:
But at the moment, this potential is lost because you are posting
total shite about implementing memmove() in standard C. It is
disappointing that someone with your extensive knowledge and
experience cannot see this. I am finding it all very frustrating.
[ snip discussion of HW ]
In short your complaints are wrong headed in not understanding what
hardware memcpy can do.
I think your reply proves David's complaint: you did not read, or did
not understand, what David is frustrated about. The true fact that
David is defending is that memmove() cannot be implemented
"efficiently" in /standard/ C source code, on /any/ HW, because it
would require comparing /C pointers/ that point to potentially
different /C objects/, which is not defined behavior in standard C,
whether compiled to machine code, or executed by an interpreter of C
code, or executed by a human programmer performing what was called
"desk testing" in the 1960s.
Obviously memmove() can be implemented efficently in non-standard C
where such pointers can be compared, or by sequences of ordinary ALU
instructions, or by dedicated instructions such as Mitch's MM, and
David is not disputing that. But Mitch seems not to understand or not
to see the issue about standard C vs memmove().
Sufficiently advanced compiler can recognize patterns and replace them
with built-in sequences.
In case of memmove() the most easily recognizable pattern in 100%
standard C99 appears to be:
void *memmove( void *dest, const void *src, size_t count)
{
if (count > 0) {
char tmp[count];
memcpy(tmp, src, count);
memcpy(dest, tmp, count);
}
return dest;
}
I don't suggest that real implementation in Brian's compiler is like
that. Much more likely his implementation uses non-standard C and looks approximately like:
void *memmove(void *dest, const void *src, size_t count {
return __builtin_memmove(dest, src, count);
}
However, implementing the first variant efficiently is well within
abilities of good compiler.
On Fri, 11 Oct 2024 16:54:13 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 11/10/2024 14:13, Michael S wrote:
On Fri, 11 Oct 2024 13:37:03 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 10/10/2024 23:19, Brian G. Lucas wrote:
Not applicable.
I don't understand what you mean by that. /What/ is not applicable
to /what/ ?
Brian probably meant to say that that it is not applicable to his
my66k LLVM back end.
But I am pretty sure that what you suggest is applicable, but bad
idea for memcpy/memmove routine that targets Arm+SVE.
Dynamic dispatch based on concrete core features/identification,
i.e. exactly the same mechanism that is done on "non-scalable"
architectures, would provide better performance. And memcpy/memmove
is certainly sufficiently important to justify an additional
development effort.
That explanation helps a little, but only a little. I wasn't
suggesting anything - or if I was, it was several posts ago and the
context has long since been snipped.
You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.
Can you be more explicit about
what you think I was suggesting, and why it might not be a good idea
for targeting a "my66k" ISA? (That is not a processor I have heard
of, so you'll have to give a brief summary of any particular features
that are relevant here.)
The proper spelling appears to be My 66000.
For starter, My 66000 has no SIMD. It does not even have dedicated FP register file. Both FP and Int share common 32x64bit register space.
More importantly, it has dedicate instruction with exactly the same
semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
that in at least several out of multitude of implementations it will
suck.
David Brown <david.brown@hesbynett.no> wrote:
On 12/10/2024 01:32, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:
On 11/10/2024 20:55, MitchAlsup1 wrote:
On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:
Do you think you can just write this :
void * memmove(void * s1, const void * s2, size_t n)
{
return memmove(s1, s2, n);
}
in your library's source?
.global memmove
memmove:
MM R2,R1,R3
RET
sure !
You are either totally clueless, or you are trolling. And I know you >>>> are not clueless.
This discussion has become pointless.
The point is that there are a few things that may be hard to do
with {decode, pipeline, calculations, specifications...}; but
because they are so universally needed; these, too, should
"get into ISA".
One good reason to put them in ISA is to preserve the programmers
efforts over decades, so they don't have to re-write libc every-
time a new set of instructions come out.
Moving an arbitrary amount of memory from point a to point b
happens to fall into that universal need. Setting an arbitrary
amount of memory to a value also falls into that universal
need.
Again, I have to ask - do you bother to read the posts you reply to?
Are you interested in replying, and engaging in the discussion? Or are
you just looking for a chance to promote your own architecture, no
matter how tenuous the connection might be to other posts?
Again, let me say that I agree with what you are saying - I agree that
an ISA should have instructions that are efficient for what people
actually want to do. I agree that it is a good thing to have
instructions that let performance scale with advances in hardware
ideally without needing changes in compiled binaries, and at least
without needing changes in source code.
I believe there is an interesting discussion to be had here, and I would
enjoy hearing about comparisons of different ways things functions like
memcpy() and memset() can be implemented in different architectures and
optimised for different sizes, or how scalable vector instructions can
work in comparison to fixed-size SIMD instructions.
But at the moment, this potential is lost because you are posting total
shite about implementing memmove() in standard C. It is disappointing
that someone with your extensive knowledge and experience cannot see
this. I am finding it all very frustrating.
There are only two decisions to make in memcpy, are the copies less than
copy sized aligned, and do the pointers overlap in copy size.
For hardware this simplifies down to perhaps two types of copies, easy and hard.
If you make hard fast, and you will, then two versions is all you need, not the dozens of choices with 1k of code you need in C.
Often you know which of the two you want at compile time from the pointer type.
In short your complaints are wrong headed in not understanding what
hardware memcpy can do.
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.
An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.
Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block >identifier and part of it as an index into that block, just as a C >implementation can.
On 12/10/2024 19:26, Bernd Linsel wrote:
On 12.10.24 17:16, David Brown wrote:
[snip rant]
You are aware that this is c.arch, not c.lang.c?
Absolutely, yes.
But in a thread branch discussing C, details of C are relevant.
I don't expect any random regular here to know "language lawyer" details
of the C standards. I don't expect people here to care about them.
People in comp.lang.c care about them - for people here, the main
interest in C is for programs to run on the computer architectures that
are the real interest.
But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
what other posters write. The point under discussion was that you
cannot implement an efficient "memmove()" function in fully portable
standard C. That's a fact - it is a well-established fact. Another
clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.
All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
architecture discussions.
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different objects? >>>>>>> For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers,
rather than having only a valid pointer or NULL. A compiler,
for example, might want to store the fact that an error occurred
while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can
rely on what application programmers cannot, their implementation
details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can >>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>> memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you write
an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up that
is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.
David Brown <david.brown@hesbynett.no> wrote:<snip>
All I am asking Mitch to do is to understand this, and to stop saying
silly things (such as implementing memmove() by calling memmove(), or
that the /reason/ you can't implement memmove() efficiently in portable
standard C is weaknesses in the x86 ISA), so that we can clear up his
misunderstandings and move on to the more interesting computer
architecture discussions.
My 66000 only has one MM instruction because when you throw enough hardware >at the problem, one instruction is all you need.
And it also covers MemCopy, and yes there is a backwards copy version.
I detailed the hardware to do this several years ago on Real World Tech.
Brett <ggtgp@yahoo.com> writes:
David Brown <david.brown@hesbynett.no> wrote:
<snip>All I am asking Mitch to do is to understand this, and to stop
saying silly things (such as implementing memmove() by calling
memmove(), or that the /reason/ you can't implement memmove()
efficiently in portable standard C is weaknesses in the x86 ISA),
so that we can clear up his misunderstandings and move on to the
more interesting computer architecture discussions.
My 66000 only has one MM instruction because when you throw enough
hardware at the problem, one instruction is all you need.
And it also covers MemCopy, and yes there is a backwards copy
version.
I detailed the hardware to do this several years ago on Real World
Tech.
Such hardware (memcpy/memmove/memfill) was available in 1965 on the
Burroughs medium systems mainframes. In the 80s, support was added
for hashing strings as well.
It's not a new concept. In fact, there were some tricks that could
be used with overlapping source and destination buffers that would
replicate chunks of data).
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different
objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers, >>>>>>> rather than having only a valid pointer or NULL. A compiler, >>>>>>> for example, might want to store the fact that an error occurred >>>>>>> while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can >>>>>>> rely on what application programmers cannot, their implementation >>>>>>> details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can >>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>> memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there
was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.
I.e. totally removing the need for compiler tricks or wide register operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today), and
then the memmove() calls will usually be inlined.
David Brown <david.brown@hesbynett.no> writes:
When would you ever /need/ to compare pointers to different objects?
For almost all C programmers, the answer is "never". Pretty much the
only example people ever give of needing such comparisons is to
implement memmove() efficiently - but you don't need to implement
memmove(), because it is already in the standard library.
When you implements something like, say
vsum(double *a, double *b, double *c, size_t n);
where a, b, and c may point to arrays in different objects, or may
point to overlapping parts of the same object, and the result vector c
in the overlap case should be the same as in the no-overlap case
(similar to memmove()), being able to compare pointers to possibly
different objects also comes in handy.
Another example is when the programmer uses the address as a key in,
e.g., a binary search tree. And, as you write, casting to intptr_t is
not guarenteed to work by the C standard, either.
An example that probably compares pointers to the same object as far
as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
When you have two free variables, and you unify them, in the
implementation one variable points to the other one. Now which should
point to which? The younger variable should point to the older one,
because it will die sooner. How do you know which variable is
younger? You compare the addresses; the variables reside on a stack,
so the younger one is closer to the top.
If that stack is one object as far as the C standard is concerned,
there is no problem with that solution. If the stack is implemented
as several objects (to make it easier growable; I don't know if there
is a Prolog implementation that does that), you first have to check in
which piece it is (maybe with a binary search), and then possibly
compare within the stack piece at hand.
An interesting case is the Forth standard. It specifies "contiguous
regions", which correspond to objects in C, but in Forth each address
is a cell and can be added, subtracted, compared, etc. irrespective of
where it came from. So Forth really has a flat-memory model. It has
had that since its origins in the 1970s. Some of the 8086
implementations had some extensions for dealing with more than 64KB,
but they were never standardized and are now mostly forgotten.
Forth does not require a flat memory model in the hardware, as far as I
am aware, any more than C does. (I appreciate that your knowledge of
Forth is /vastly/ greater than mine.) A Forth implementation could
interpret part of the address value as the segment or other memory block
identifier and part of it as an index into that block, just as a C
implementation can.
I.e., what you are saying is that one can simulate a flat-memory model
on a segmented memory model.
Certainly. In the case of the 8086 (and
even more so on the 286) the costs of that are so high that no
widely-used Forth system went there.
One can also simulate segmented memory (a natural fit for many
programming languages) on flat memory. In this case the cost is much smaller, plus it gives the maximum flexibility about segment/object
sizes and numbers. That is why flat memory has won.
On 13/10/2024 21:21, Terje Mathisen wrote:
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:For example, if ISA contains an MM instruction which is the
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different >>>>>>>>> objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>> rather than having only a valid pointer or NULL. A compiler, >>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>> while parsing a subexpression as a special pointer constant.
Compilers often have the unfair advantage, though, that they can >>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>> details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they can >>>>>>> implement an efficient memmove() even though a pure standard C
programmer cannot (other than by simply calling the standard library >>>>>>> memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA. >>>>
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said there >>> was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly or
using inline assembly, rather than in non-portable C (which is the
common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that
can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has
the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very close
to optimal manner, for both short and long transfers.
I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and memmove. (For my own kind of work, I'd worry about such looping instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)
And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.
What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.
I.e. totally removing the need for compiler tricks or wide register
operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.
The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
is independent of any ISA, any specialist instructions for memory moves,
and any compiler optimisations. And it is independent of the fact that some good compilers can inline at least some calls to memcpy() and
memmove() today, using whatever instructions are most efficient for the target.
On 14/10/2024 16:40, Terje Mathisen wrote:
David Brown wrote:
On 13/10/2024 21:21, Terje Mathisen wrote:
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:Standard library authors have the same superpowers, so that
When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL. A compiler, for example, might want to store the >>>>>>>>> fact that an error occurred while parsing a subexpression
as a special pointer constant.
Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>> implementation details. (Some do not, such as f2c). >>>>>>>>
they can
implement an efficient memmove() even though a pure standard >>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the >>>>>> ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C. That's why I
said there was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.
What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.
I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)
And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.
What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.
I.e. totally removing the need for compiler tricks or wide
register operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.
The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.
David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.
I agree. It's a "god dag mann, økseskaft" situation.
I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.
a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).
Yes.
(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)
b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.
Yes.
Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)
Yes.
REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.
With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.
I agree on all of that.
I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep" instructions.
And I fully agree that these would be useful features
in general-purpose processors.
My only point of contention is that the existence or lack of such instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.
They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.
David Brown wrote:
On 13/10/2024 21:21, Terje Mathisen wrote:
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:For example, if ISA contains an MM instruction which is the
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
When would you ever /need/ to compare pointers to different >>>>>>>>>> objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>>> rather than having only a valid pointer or NULL. A compiler, >>>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>>> while parsing a subexpression as a special pointer constant. >>>>>>>>>
Compilers often have the unfair advantage, though, that they can >>>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>>> details. (Some do not, such as f2c).
Standard library authors have the same superpowers, so that they >>>>>>>> can
implement an efficient memmove() even though a pure standard C >>>>>>>> programmer cannot (other than by simply calling the standard
library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of libc
writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the ISA. >>>>>
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let you
write an efficient memmove() in standard C. That's why I said
there was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in assembly
or using inline assembly, rather than in non-portable C (which is
the common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things up
that is proportionally more costly for small transfers. Often that >>>> can be eliminated when the compiler optimises the functions inline -
when the compiler knows the size of the move/copy, it can optimise
directly.
What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and has
the inside knowledge about cache (residency at level x? width in
bytes)/memory ranges/access rights/etc needed to do so in a very
close to optimal manner, for both short and long transfers.
I am not missing that at all. And I agree that an advanced hardware
MM instruction could be a very efficient way to implement both memcpy
and memmove. (For my own kind of work, I'd worry about such looping
instructions causing an unbounded increased in interrupt latency, but
that too is solvable given enough hardware effort.)
And I agree that once you have an "MM" (or similar) instruction, you
don't need to re-write the implementation for your memmove() and
memcpy() library functions for every new generation of processors of a
given target family.
What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will /sometimes/
get benefits from doing so, but it is not as simple as Mitch made out.
I.e. totally removing the need for compiler tricks or wide register
operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and to
recognize common patterns (just as most compilers already do today),
and then the memmove() calls will usually be inlined.
The original compile library issue was that it is impossible to write
an efficient memmove() implementation using pure portable standard C.
That is independent of any ISA, any specialist instructions for memory
moves, and any compiler optimisations. And it is independent of the
fact that some good compilers can inline at least some calls to
memcpy() and memmove() today, using whatever instructions are most
efficient for the target.
David, you and Mitch are among my most cherished writers here on c.arch,
I really don't think any of us really disagree, it is just that we have
been discussing two (mostly) orthogonal issues.
a) memmove/memcpy are so important that people have been spending a lot
of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
comparison of arbitrary pointers).
b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
fact so fast that it obviates any need for tricky coding to replace it.
Ideally, it should be able to copy a single object, up to a cache line
in size, in the same or less time needed to do so manually with a SIMD 512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)
REP MOVSB on x86 does the canonical memcpy() operation, originally by
moving single bytes, and this was so slow that we also had REP MOVSW
(moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on 64-bit cpus.
With a suitable chunk of logic, the basic MOVSB operation could in fact handle any kinds of alignments and sizes, while doing the actual
transfer at maximum bus speeds, i.e. at least one cache line/cycle for
things already in $L1.
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
The Algol family of block structure gave the illusion that flat was less necessary and it could all be done with lexical address-
ing and block scoping rules.
Then malloc() and mmap() came along.
The Posix interface support was there so *MS* could bid on US government
and military contracts which, at that time frame, were making noise
about it being standard for all their contracts.
The Posix DLLs didn't come with WinNT, you had to ask MS for them
specially.
Back then "object oriented" and "micro-kernel" buzzwords were all the
rage.
The same problem seems to have messed up all the attempts to provide
good Unix emulation on VMS.
In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:
I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by that
point.
Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.
On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for equality).
Rarely needing something does not mean /never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
That's their problem. The rest of the C world shouldn't suffer because
of odd birds.
On Tue, 8 Oct 2024 22:28 +0100 (BST), John Dallman wrote:
The same problem seems to have messed up all the attempts to provide
good Unix emulation on VMS.
Was it the Perl build scripts that, at some point their compatibility
tests on a *nix system, would announce “Congratulations! You’re not running EUNICE!”.
In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
I think the whole _personality_ concept, along with the supposed
portability to non-x86 architectures, had just bit-rotted away by that
point.
Some combination of that, Microsoft confidence that "of course we can do
something better now!" - they are very prone to overconfidence - and the
terrible tendency of programmers to ignore the details of the old code.
It was the Microsoft management that did it -- the culmination of a
whole
sequence of short-term, profit-oriented decisions over many years ... decades. What may have started out as an “elegant design” finally became unrecognizable as such.
Compare what was happening to Linux over the same time interval, where
the
programmers were (largely) not beholden to managers and bean counters.
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
If you look at the 8086 manuals, that's clearly what they had in mind.
What I don't get is that the 286's segment stuff was so slow.
It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.
Right, and they appeared not to care or realize it was a performance
problem.
On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:
On Mon, 14 Oct 2024 19:02:51 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
That's their problem. The rest of the C world shouldn't suffer
because of odd birds.
So, you are saying that 286 in its hey-day was/is odd ?!?
On Wed, 09 Oct 2024 13:37:41 -0400, EricP wrote:
The Posix interface support was there so *MS* could bid on US
government and military contracts which, at that time frame, were
making noise about it being standard for all their contracts.
The Posix DLLs didn't come with WinNT, you had to ask MS for them specially.
And that whole POSIX subsystem was so sadistically, unusably awful,
it just had to be intended for show as a box-ticking exercise,
nothing more.
<https://www.youtube.com/watch?v=BOeku3hDzrM>
Back then "object oriented" and "micro-kernel" buzzwords were all
the rage.
OO still lives on in higher-level languages. Microsoft’s one attempt
to incorporate its OO architecture--Dotnet--into the lower layers of
the OS, in Windows Vista, was an abject, embarrassing failure which
hopefully nobody will try to repeat.
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless, and
they come out of the woodwork in a futile attempt to disagree ...
On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 14/10/2024 16:40, Terje Mathisen wrote:
David Brown wrote:
On 13/10/2024 21:21, Terje Mathisen wrote:
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:
On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:
On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:Standard library authors have the same superpowers, so that >>>>>>>>>> they can
When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".
Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL. A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.
Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot, their >>>>>>>>>>> implementation details. (Some do not, such as f2c). >>>>>>>>>>
implement an efficient memmove() even though a pure standard >>>>>>>>>> C programmer cannot (other than by simply calling the
standard library
memmove() function!).
This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.
No, it is not. It has absolutely /nothing/ to do with the >>>>>>>> ISA.
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.
The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C. That's why I
said there was no connection between the two concepts.
For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable C
(which is the common case).
Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
It is not that simple.
There can often be trade-offs between the speed of memmove() and
memcpy() on large transfers, and the overhead in setting things
up that is proportionally more costly for small transfers.Â
Often that can be eliminated when the compiler optimises the
functions inline - when the compiler knows the size of the
move/copy, it can optimise directly.
What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation, and
has the inside knowledge about cache (residency at level x? width
in bytes)/memory ranges/access rights/etc needed to do so in a
very close to optimal manner, for both short and long transfers.
I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to implement
both memcpy and memmove. (For my own kind of work, I'd worry
about such looping instructions causing an unbounded increased in
interrupt latency, but that too is solvable given enough hardware
effort.)
And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.
What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple as
Mitch made out.
I.e. totally removing the need for compiler tricks or wide
register operations.
Also apropos the compiler library issue:
You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.
The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.
David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.
I agree. It's a "god dag mann, økseskaft" situation.
I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I have
found this very frustrating.
a) memmove/memcpy are so important that people have been spending a
lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).
Yes.
(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires
implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the implementation.)
b) Mitch have, like Andy ("Crazy") Glew many years before, realized
that if a cpu architecture actually has an instruction designed to
do this particular job, it behooves cpu architects to make sure
that it is in fact so fast that it obviates any need for tricky
coding to replace it.
Yes.
Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)
Yes.
REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.
With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.
I agree on all of that.
I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep"
instructions.
No, that's not true. And according to my understanding, that's not what
Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in PSW instead of being part of the opcode).
REP MOVSW/D/Q were introduced because back then processors were small
and stupid. When your processor is big and smart you don't need them
any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin to
REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the next logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.
Now, is all that a good idea?
I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary for 1st-class implementation of these instructions is useful not only for
memory copy.
So, may be, it makes sense to expose this hardware in more generic ways.
May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better ways
that I was not thinking about.
And I fully agree that these would be useful features
in general-purpose processors.
My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.
You are moving a goalpost.
One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.
They would make it easier to write efficient
implementations of these standard library functions for targets that
had such instructions - but that would be implementation-specific
code. And that is one of the reasons that C standard library
implementations are tied to the specific compiler and target, and the
writers of these libraries have "superpowers" and are not limited to
standard C.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
If you look at the 8086 manuals, that's clearly what they had in
mind.
What I don't get is that the 286's segment stuff was so slow.
It had to load the whole segment descriptor from RAM and possibly
perform some additional setup.
Right, and they appeared not to care or realize it was a performance
problem.
They didn't even do obvious things like see if you're reloading the
same value into the segment register and skip the rest of the setup.
Sure, you could put checks in your code and skip the segment load but
that would make your code a lot bigger and uglier.
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard way to
compare independent pointers (other than just for equality). Rarely
needing something does not mean /never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
On 14/10/2024 21:02, MitchAlsup1 wrote:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
void * p = ...
void * q = ...
uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;
if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}
If your comparison needs to actually match up with the real virtual addresses, then this will not work. But does that actually matter?
Think about using this comparison for memmove().
Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the pointers came from different mallocs, they could not overlap and
memmove() can run either direction.
The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.
On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 14/10/2024 21:02, MitchAlsup1 wrote:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
void * p = ...
void * q = ...
uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;
if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}
If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?
Think about using this comparison for memmove().
Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.
The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.
It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
OO still lives on in higher-level languages. Microsoft_s one
attempt to incorporate its OO architecture--Dotnet--into the
lower layers of the OS, in Windows Vista, was an abject,
embarrassing failure which hopefully nobody will try to repeat.
AFAIK, .net is hugely successful application development technology
that was never incorporated into lower layers of the OS.
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.
On 15/10/2024 13:22, Michael S wrote:
On Tue, 15 Oct 2024 12:38:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 14/10/2024 21:02, MitchAlsup1 wrote:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
void * p = ...
void * q = ...
uintptr_t pu = (uintptr_t) p;
uintptr_t qu = (uintptr_t) q;
if (pu > qu) {
...
} else if (pu < qu) {
...
} else {
...
}
If your comparison needs to actually match up with the real virtual
addresses, then this will not work. But does that actually matter?
Think about using this comparison for memmove().
Consider where these pointers come from. Maybe they are pointers to
statically allocated data. Then you would expect the segment to be
the same in each case, and the uintptr_t comparison will be fine for
memmove(). Maybe they come from malloc() and are in different
segments. Then the comparison here might not give the same result as
a full virtual address comparison - but that does not matter. If the
pointers came from different mallocs, they could not overlap and
memmove() can run either direction.
The same applies to other uses, such as indexing in a binary search
tree or a hash map - the comparison above will be correct when it
matters.
It's all fine for as long as there are no objects bigger than 64KB.
But with 16MB of virtual memory and with several* MB of physical memory
one does want objects that are bigger than 64KB!
I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)
But I would expect that in almost any practical system where you can use "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.
The exceptions would be systems where pointers hold more than just
addresses, such as access control information or bounds that mean they
are larger than the largest integer type on the target.
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
Stefan
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.
<https://en.wikipedia.org/wiki/Hybrid_kernel>
Windows NT and Apple's XNU, used in all their operating systems, are both >hybrid kernels, so the idea is somewhat practical.
John
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
It still is part of the ISO C standard.
https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html
POSIX adds some extensions (marked 'CX').
On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of the
advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS
components, but not many people actually want that.
Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.
YMMV.
George Neuner wrote:
On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of
the
advantages of a microkernel to its developers, and avoids the need for
lots of context switches. It doesn't let you easily replace low-level OS >>> components, but not many people actually want that.
Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.
YMMV.
This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular server/service, then you replace them in groups so that the service sees
zero downtime even though all the servers have been updated/replaced.
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
I don't see an advantage in being able to implement them in standard C.There is an advantage to the C approach of separating out someIt goes a bit further: for a general purpose language, any existing
facilities and supplying them only in the standard library.
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
The reason why you might want your own special memmove, or your own special malloc, is that you are doing niche and specialised software.
On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
It still is part of the ISO C standard.
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.
I don't see an advantage in being able to implement them in standard C.There is an advantage to the C approach of separating out someIt goes a bit further: for a general purpose language, any existing
facilities and supplying them only in the standard library.
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.
E.g. say if your application wants to use a region/pool/zone-based
memory management.
The fact that malloc can't be implemented in standard C is evidence
that standard C may not be general-purpose enough to accommodate an application that wants to use a custom-designed allocator.
I don't disagree with you, from a practical perspective:
- in practice, C serves us well for Emacs's GC, even though that can't
be written in standard C.
- it's not like there are lots of other languages out there that offer
you portability together with the ability to define your own `malloc`.
But it's still a weakness, just a fairly minor one.
The reason why you might want your own special memmove, or your own special >> malloc, is that you are doing niche and specialised software.
Region/pool/zone-based memory management is common enough that I would
not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
Can't think of a practical reason to implement my own `memove`, OTOH.
Stefan
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?
MitchAlsup1 <mitchalsup@aol.com> schrieb:
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?
You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).
But more problematic is the implementation of free() without
knowing how to compare pointers.
On 16/10/2024 07:36, Terje Mathisen wrote:
George Neuner wrote:
On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of
the
advantages of a microkernel to its developers, and avoids the need for >>>> lots of context switches. It doesn't let you easily replace low-level OS >>>> components, but not many people actually want that.
Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.
YMMV.
This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular
server/service, then you replace them in groups so that the service sees
zero downtime even though all the servers have been updated/replaced.
That's fine - /if/ you have a service that can easily be spread across >multiple systems, and you can justify the cost of that. Setting up a >database server is simple enough.
Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.
Clouds do nothing to help any of that.
But clouds /do/ mean that your virtual machine can be migrated (with
zero, or almost zero, downtime) to another physical server if there are >hardware problems or during hardware maintenance. And if you can do
easy snapshots with your cloud / VM infrastructure, then you can roll
back if things go badly wrong. So you have a single server instance,
you plan a short period of downtime, take a snapshot, stop the service, >upgrade, restart. That's what almost everyone does, other than the
/really/ big or /really/ critical service providers.
Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.
Clouds do nothing to help any of that.
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of >>>>> a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it >>>>entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and >>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
It still is part of the ISO C standard.
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.
K&R may have been 'de facto' standard C, but not 'de jure'.
Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.
Those are both kernel system calls.
It's a very good philosophy in programming language design that the core >language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and >otherwise change things in libraries than the core language.
And it is also fine, IMHO, that some things in the standard library need >non-standard C - the standard library is part of the implementation.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
The function has always been available in C since the language was >standardised, and AFAIK it was in K&R C. But no one (in authority) ever >claimed it could be implemented purely in standard C. What do you think
has changed?
MitchAlsup1 <mitchalsup@aol.com> schrieb:
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C. I am not asking
if it is still in the std libraries, I am asking what happened
to make it impossible to write malloc() in std. C ?!?
You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).
But more problematic is the implementation of free() without knowing
how to compare pointers.
On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C. I am not
asking if it is still in the std libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?
You need to reserve memory by some way from the operating system,
which is, by necessity, outside of the scope of C (via brk(),
GETMAIN, mmap() or whatever).
Agreed, but once you HAVE a way of getting memory (by whatever name)
you can write malloc in standard C.
On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any
existing functionality that cannot be written using the language
is a sign of a weakness because it shows that despite being
"general purpose" it fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT need to have it
entirely built-into the language.
In an ideal world, it would be better if we could define `malloc`
and `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be standard K&R C--what dropped if from the
standard??
It still is part of the ISO C standard.
The paragraph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.
I am not
asking if it is still in the standard libraries, I am asking what
happened to make it impossible to write malloc() in standard C ?!?
On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
malloc() used to be standard K&R C--what dropped it from the
standard ??
It still is part of the ISO C standard.
The paragraaph with 3 >'s indicates malloc() cannot be written in
standard C. It used to be written in standard K&R C.
K&R may have been 'de facto' standard C, but not 'de jure'.
Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.
Those are both kernel system calls.
Yes, but malloc() subdivides an already provided space.
Because that space can be treated as a single array of char,
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing
functionality that cannot be written using the language is a sign of
a weakness because it shows that despite being "general purpose" it
fails to cover this specific "purpose".
According to David Brown <david.brown@hesbynett.no>:
Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder
still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to
duplicate the whole thing so you can do test runs. And if the database
server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.
Clouds do nothing to help any of that.
AWS provides a database service that does most of that. You can spin
up databases, read-only mirrors, failover from one region to another,
staging environments to test upgrades. They offer MySQL and
PostgreSQL, as well as Oracle and DB2.
It's still a fair amount of work, but way less than doing it all yourself.
On Wed, 16 Oct 2024 09:17:03 +0200, David Brown
<david.brown@hesbynett.no> wrote:
On 16/10/2024 07:36, Terje Mathisen wrote:
George Neuner wrote:
On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:
In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
D'Oliveiro) wrote:
On the other hand, some stubborn holdouts are still fond of
microkernels -- you just have to say the whole idea is pointless,
and they come out of the woodwork in a futile attempt to disagree
The idea is impractical, not pointless. A hybrid kernel gives most of >>>>> the
advantages of a microkernel to its developers, and avoids the need for >>>>> lots of context switches. It doesn't let you easily replace low-level OS >>>>> components, but not many people actually want that.
Actually, I think there are a whole lot of people who can't afford
non-stop server hardware but would greatly appreciate not having to
waste time with a shutdown/reboot every time some OS component gets
updated.
YMMV.
This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
As soon as you have more than a single instance of a particular
server/service, then you replace them in groups so that the service sees >>> zero downtime even though all the servers have been updated/replaced.
That's fine - /if/ you have a service that can easily be spread across
multiple systems, and you can justify the cost of that. Setting up a
database server is simple enough.
Setting up a database server along with a couple of read-only
replications is harder. Adding a writeable failover secondary is harder
still. Making sure that everything works /perfectly/ when the primary
goes down for maintenance, and that everything is consistent afterwards,
is even harder. Being sure it still all works even while the different
parts have different versions during updates typically means you have to
duplicate the whole thing so you can do test runs. And if the database
server is not open source, your license costs will be absurd, compared
to what you actually need to provide the service - usually just one
server instance.
Clouds do nothing to help any of that.
But clouds /do/ mean that your virtual machine can be migrated (with
zero, or almost zero, downtime) to another physical server if there are
hardware problems or during hardware maintenance. And if you can do
easy snapshots with your cloud / VM infrastructure, then you can roll
back if things go badly wrong. So you have a single server instance,
you plan a short period of downtime, take a snapshot, stop the service,
upgrade, restart. That's what almost everyone does, other than the
/really/ big or /really/ critical service providers.
For various definitions of "short period of downtime". 8-)
Fortunately, Linux installs updates - or stages updates for restart -
much faster than Windoze. But rebooting to the point that all the
services are running still can take several minutes.
That can feel like an eternity when it's the only <whatever> server in
a small business.
On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:
There is an advantage to the C approach of separating out some
facilities and supplying them only in the standard library.
It goes a bit further: for a general purpose language, any existing >>>>>> functionality that cannot be written using the language is a sign of >>>>>> a weakness because it shows that despite being "general purpose" it >>>>>> fails to cover this specific "purpose".
One of the key ways C got into the minds of programmers was that
one could write stuff like printf() in C and NOT needd to have it
entirely built-into the language.
In an ideal world, it would be better if we could define `malloc` and >>>>>> `memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
It still is part of the ISO C standard.
The paragraaph with 3 >'s indicates malloc() cannot be written
in std. C. It used to be written in std. K&R C.
K&R may have been 'de facto' standard C, but not 'de jure'.
Unix V6 malloc used the 'brk' system call to allocate space
for the heap. Later versions used 'sbrk'.
Those are both kernel system calls.
Yes, but malloc() subdivides an already provided space. Because that
space can be treated as a single array of char, and comparing pointers
to elements of the same array is legal, the only thing I can see that prevents writing malloc() in standard C would be the need to somhow
define the array from the /language's/ POV (not the compiler's) prior
to using it.
On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
<david.brown@hesbynett.no> wrote:
It's a very good philosophy in programming language design that the core
language should only contain what it has to contain - if a desired
feature can be put in a library and be equally efficient and convenient
to use, then it should be in the standard library, not the core
language. It is much easier to develop, implement, enhance, adapt, and
otherwise change things in libraries than the core language.
And it is also fine, IMHO, that some things in the standard library need
non-standard C - the standard library is part of the implementation.
But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
compiler flags to be using a different compiler.]
Why? Because once these things are discovered, many programmers will
see their advantages and lack the discipline to avoid using them for
more general application work.
In an ideal world, it would be better if we could define `malloc` and
`memmove` efficiently in standard C, but at least they can be
implemented in non-standard C.
malloc() used to be std. K&R C--what dropped if from the std ??
The function has always been available in C since the language was
standardised, and AFAIK it was in K&R C. But no one (in authority) ever
claimed it could be implemented purely in standard C. What do you think
has changed?
AFAIK, .net is hugely successful application development technology that
was never incorporated into lower layers of the OS.
Windows NT and Apple's XNU, used in all their operating systems, are
both hybrid kernels, so the idea is somewhat practical.
The question is how slowness of 80286 segments compares to
contemporaries that used segment-based protected memory.
Wikipedia lists following machines as examples of segmentation:
- Burroughs B5000 and following Burroughs Large Systems
- GE 645 -> Honeywell 6080
- Prime 400 and successors
- IBM System/38
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.
You are moving a goalpost.
One does not need "good implementation" in a sense you have in mind.
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very easily
done in standard C. For memmove, I had shown how to do it in one of the
posts below. For memcpy its very obvious, so no need to show.
[...] I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/ needing
it.
OK, take a segmented memory model with 16-bit pointers and a 24-bit
virtual address space. How do you actually compare to segmented
pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
Michael S <already5chosen@yahoo.com> writes:
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.
OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.
In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.
In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.
Binaries compiled in 1966 ran on all
generations without recompilation.
There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).
Unisys discontinued that line of systems in 1992.
I don't see an advantage in being able to implement them in standard C.
I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require specific
time constraints on these functions. In such cases, you are not
interested in writing fully portable software - it will already contain
many implementation-specific features or use compiler extensions.
On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully standard
way to compare independent pointers (other than just for
equality). Rarely needing something does not mean /never/
needing it.
OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.
In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.
In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.
Binaries compiled in 1966 ran on all
generations without recompilation.
There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).
So, can it be said that ar least some of B6500-compatible models
suffered from the same problem as 80286 - the segment of maximal size
didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits in
the single segment?
Unisys discontinued that line of systems in 1992.
I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.
On 16/10/2024 08:21, David Brown wrote:
I have a vague feeling that once upon a time I wrote a malloc for an
I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in
non-standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler extensions.
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.
But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.
_That_ does require assembler, or compiler extensions, not standard C.
Michael S <already5chosen@yahoo.com> writes:
On Fri, 18 Oct 2024 14:06:17 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 14 Oct 2024 19:39:41 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:
On 13/10/2024 17:45, Anton Ertl wrote:
I do think it would be convenient if there were a fully
standard way to compare independent pointers (other than
just for equality). Rarely needing something does not mean
/never/ needing it.
OK, take a segmented memory model with 16-bit pointers and a
24-bit virtual address space. How do you actually compare to
segmented pointers ??
Depends. On the Burroughs mainframe there could be eight
active segments and the segment number was part of the pointer.
Pointers were 32-bits (actually 8 BCD digits)
S s OOOOOO
Where 'S' was a sign digit (C or D), 's' was the
segment number (0-7) and OOOOOO was the six digit
offset within the segment (500kB/1000kD each).
A particular task (process) could have up to
one million "environments", each environment
could have up to 100 "memory areas (up to 1000kD)
of which the first eight were loaded into the
processor base/limit registers. Index registers
were 8 digits and were loaded with a pointer as
described above. Operands could optionally select
one of the index registers and the operand address
was treated as an offset to the index register;
there were 7 index registers.
Access to memory areas 8-99 use string instructions
where the pointer was 16 BCD digits:
EEEEEEMM SsOOOOOO
Where EEEEEE was the evironment number (0-999999);
environments starting with D00000 were reserved for
the MCP (Operating System). MM was the memory area
number and the remaining eight digits described the
data within the memory area. A subroutine call could
call within a memory area or switch to a new environment.
Memory area 1 was the code region for the segment,
Memory area 0 held the stack and some global variables
and was typically shared by all environments.
Memory areas 2-7 were application dependent and could
be configured to be shared between environments at
link time.
What was the size of phiscal address space ?
I would suppose, more than 1,000,000 words?
It varied based on the generation. In the
1960s, a half megabyte (10^6 digits)
was the limit.
In the 1970s, the architecture supported
10^8 digits, the largest B4800 systems
were shipped with 2 million digits (1MB).
In 1979, the B4900 was introduced supporting
up to 10MB (20 MD), later increased to
20MB/40MD.
In the 1980s, the largest systems (V500)
supported up to 10^9 digits. It
was that generation of machine where the
environment scheme was introduced.
Binaries compiled in 1966 ran on all
generations without recompilation.
There was room in the segmentation structures
for up to 10^18 digit physical addresses
(where the segments were aligned on 10^3
digit boundaries).
So, can it be said that ar least some of B6500-compatible models
No. The systems I described above are from the medium
systems family (B2000/B3000/B4000).
The B5000/B6000/B7000
(large) family systems were a completely different stack based
architecture with a 48-bit word size. The Small systems (B1000)
supported task-specific dynamic microcode loading (different
microcode for a cobol app vs. a fortran app).
Medium systems evolved from the Electrodata Datatron and 220 (1954)
through the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
was also developed at the old Electrodata plant in Pasadena
(where I worked in the 80s) - eventually large systems moved
out - the more capable large systems (B7XXX) were designed in
Tredyffrin Pa, the less capable large systems (B5XXX) were designed
in Mission Viejo, Ca.
suffered from the same problem as 80286 - the segment of maximal size >didn't cover all linear (or physical) address space?
Or their index register width was increased to accomodate 1e9 digits
in the single segment?
Unisys discontinued that line of systems in 1992.
I thought it lasted longer. My impresion was that there were still
hardware implemntation (alongside with emulation on Xeons) sold up
until 15 years ago.
Large systems still exist today in emulation[*], as do the
former Univac (Sperry 2200) systems. The last medium system
(V380) was retired by the City of Santa Ana in 2010 (almost two
decades after Unisys cancelled the product line) and was moved
to the Living Computer Museum.
City of Santa Ana replaced the single 1980 vintage V380 with
29 windows servers.
After the merger of Burroughs and Sperry in '86 there were six
different mainframe architectures - by 1990, all but
two (2200 and large systems) had been terminated.
[*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/
On 18/10/2024 18:38, Vir Campestris wrote:
On 16/10/2024 08:21, David Brown wrote:
I have a vague feeling that once upon a time I wrote a malloc for an
I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised software.
For example, you might be making real-time software and require
specific time constraints on these functions. In such cases, you are
not interested in writing fully portable software - it will already
contain many implementation-specific features or use compiler
extensions.
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.
Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a particular implementation (or set of implementations). It is normal to write this kind of thing in C, but it is non-portable C. (Or at least,
not fully portable C.)
But memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.
_That_ does require assembler, or compiler extensions, not standard C.
It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C is
the comparison of the pointers so you know which direction to do the
copying.
On 18/10/2024 20:45, David Brown wrote:
On 18/10/2024 18:38, Vir Campestris wrote:Ah, I see your point. Because some implementations will require
On 16/10/2024 08:21, David Brown wrote:
I have a vague feeling that once upon a time I wrote a malloc for an
I don't see an advantage in being able to implement them in standard
C. I /do/ see an advantage in being able to do so well in non-
standard, implementation-specific C.
The reason why you might want your own special memmove, or your own
special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions. In such
cases, you are not interested in writing fully portable software -
it will already contain many implementation-specific features or use
compiler extensions.
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.
Sure - but you are not writing portable standard C. You are relying
on implementation details, or writing code that is only suitable for a
particular implementation (or set of implementations). It is normal
to write this kind of thing in C, but it is non-portable C. (Or at
least, not fully portable C.)
communication with the OS there cannot be a truly portable malloc.
It's a long time since I had to mistrust a compiler so much that I was pulling the assembler apart. It sounds as though they have got smarterBut memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.
_That_ does require assembler, or compiler extensions, not standard C.
It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard C
is the comparison of the pointers so you know which direction to do
the copying.
in the meantime.
I just checked BTW, and you are correct.
On 20/10/2024 22:51, Vir Campestris wrote:
On 18/10/2024 20:45, David Brown wrote:
On 18/10/2024 18:38, Vir Campestris wrote:Ah, I see your point. Because some implementations will require
On 16/10/2024 08:21, David Brown wrote:
I have a vague feeling that once upon a time I wrote a malloc for an
I don't see an advantage in being able to implement them in
standard C. I /do/ see an advantage in being able to do so well in
non- standard, implementation-specific C.
The reason why you might want your own special memmove, or your own >>>>> special malloc, is that you are doing niche and specialised
software. For example, you might be making real-time software and
require specific time constraints on these functions. In such
cases, you are not interested in writing fully portable software -
it will already contain many implementation-specific features or
use compiler extensions.
embedded system. Having only one process it had access to the entire
memory range, and didn't need to talk to the OS. Entirely C is quite
feasible there.
Sure - but you are not writing portable standard C. You are relying >>> on implementation details, or writing code that is only suitable for
a particular implementation (or set of implementations). It is
normal to write this kind of thing in C, but it is non-portable C.Â
(Or at least, not fully portable C.)
communication with the OS there cannot be a truly portable malloc.
Yes.
I think /every/ implementation will require communication with the OS,
if there is an OS - otherwise it will need support from other parts of
the toolchain (such as symbols created in a linker script to define the
heap area - that's the typical implementation in small embedded systems).
The nearest you could get to a portable implementation would be using a local unsigned char array as the heap, but I don't believe that would be fully correct according to the effective type rules (or the "strict aliasing" or type-based aliasing rules, if you prefer those terms). It would also not be good enough for the needs of many programs.
Of course, a fair amount of the code for malloc/free can written in
fully portable C - and almost all of it can be written in a somewhat
vaguely defined "widely portable C" where you can mask pointer bits to
handle alignment, and other such conveniences.
It's a long time since I had to mistrust a compiler so much that I wasBut memmove? On an 80286 it will be using rep movsw, rather than a
software loop, to copy the memory contents to the new location.
_That_ does require assembler, or compiler extensions, not standard C. >>>>
It would normally be written in C, and the compiler will generate the
"rep" assembly. The bit you can't write in fully portable standard
C is the comparison of the pointers so you know which direction to do
the copying.
pulling the assembler apart. It sounds as though they have got smarter
in the meantime.
I just checked BTW, and you are correct.
Looking at the generated assembly is usually not a matter of mistrusting
the compiler. One of the reasons I do so is to check that the compiler
can generate efficient object code from my source code, in cases where I need maximal efficiency. I'd rather not write assembly unless I really have to!
That makes no sense to me. We are talking about implementing standard library functions. If you want to implement other functions, go ahead.I don't see an advantage in being able to implement them in standard C.It means you can likely also implement a related yet different API
without having your code "demoted" to non-standard.
Because some implementations will require
communication with the OS there cannot be a truly portable malloc.
On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:
Because some implementations will require
communication with the OS there cannot be a truly portable malloc.
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:
On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:
Because some implementations will require communication with the OS
there cannot be a truly portable malloc.
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
POSIX is an environment not an OS.
For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.
One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.
My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[C vs assembly]
For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.
One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.
My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)
I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.
If you do write such I book I guarantee I will want to buy one.
On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:
On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:
On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:
Because some implementations will require communication with the OS
there cannot be a truly portable malloc.
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
POSIX is an environment not an OS.
Guess what the “OS” part of “POSIX” stands for.
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[C vs assembly]
For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.
One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.
My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)
I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.
If you do write such a book I guarantee I will want to buy one.
Thank you Tim!
Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
And start working for "HER". (Honeydew list).
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
Exactly!
I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.
We recently started (officially) on the 754-2029 revision.
I'm still connected to Mill Computing as well.
Terje
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[C vs assembly]
For near-light-speed code I used to write it first in C, optimize
that, then I would translate it into (inline) asm and re-optimize
based on having the full cpu architecture available, before in the
final stage I would use the asm experience to tweak the C just
enough to let the compiler generate machine code quite close
(90+%) to my best asm, while still being portable to any cpu with
more or less the same capabilities.
One example: When I won an international competition to write the
fastest Pentomino solver (capable of finding all 2339/1010/368/2
solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
portable C version.
My asm submission was twice as fast as anyone else, while the C
version was still fast enough that a couple of years later I got a
prize in the mail: Someone in France had submitted my C code,
with my name & address, to a similar competition there and it was
still faster than anyone else. :-)
I hope you will consider writing a book, "Writing Fast Code" (or
something along those lines). The core of the book could be, oh,
let's say between 8 and 12 case studies, starting with a problem
statement and tracing through the process that you followed, or
would follow, with stops along the way showing the code at each
of the different stages.
If you do write such a book I guarantee I will want to buy one.
Thank you Tim!
I know from past experience you are good at this. I would love
to hear what you have to say.
Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:
You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy
the time.
P.S. Is the email address in your message a good way to reach you?
On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week
before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy >>>> the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
Exactly!
I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.
We recently started (officially) on the 754-2029 revision.
Are you going to put in something equivalent to quires ??
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:
You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)
Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Probably not a book but I would consider writing a series of blog
posts similar to that, now that I am about to retire:
You could try writing one blog post a month on the subject. By
this time next year you will have plenty of material and be well
on your way to putting a book together. (First drafts are always
the hardest part...)
One thing I have thought of is a wiki of optimization techniques that contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.
On 24/10/2024 08:55, Anton Ertl wrote:
One thing I have thought of is a wiki of optimization techniques that
contains descriptions of the techniques and case studies, but I have
not yet implemented this idea.
Would it make sense to start something under Wikibooks on Wikipedia?
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week >>>>>> before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual
vacation and self-chosen "work". In any case I hope you both enjoy >>>>> the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
Exactly!
I have unlimited amounts of potential/available mapping work, and I do
want to get back to NTP Hackers.
We recently started (officially) on the 754-2029 revision.
Are you going to put in something equivalent to quires ??
I don't know that usage, I thought quires was a typesetting/printing
measure?
Terje
On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:
Because some implementations will require
communication with the OS there cannot be a truly portable malloc.
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
My wife do have a small list of things that we (i.e. I) could do when we retire...
I'm pretty sure you don't get POSIX in your 64kb (max).
On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:
On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:
Because some implementations will require
communication with the OS there cannot be a truly portable malloc.
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).
There can if you have a portable OS API. The only serious candidate for
that is POSIX.
One of the other groups I'm following just for the hell of it is
comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).
On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
My wife and I will both go on "permanent vacation" starting a week >>>>>>> before Christmas. :-)
I'm guessing that permanent vacation will be some mixture of actual >>>>>> vacation and self-chosen "work". In any case I hope you both >>>>>> enjoy
the time.
Just remember, retirement does not mean you "stop working"
it means you "stop working for HIM".
Exactly!
I have unlimited amounts of potential/available mapping work, and I do >>>> want to get back to NTP Hackers.
We recently started (officially) on the 754-2029 revision.
Are you going to put in something equivalent to quires ??
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
On 10/28/2024 9:30 AM, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
Another newer alternative. This came up on my news feed. I haven't
looked at the details at all, so I can't comment on it.
https://arxiv.org/abs/2410.03692
MitchAlsup1 <mitchalsup@aol.com> schrieb:
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
These would be very large registers. You'd need some way to store and load the these for register spills, fills and task switch, as well as move
and manage them.
Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:
"A floating-point accumulator occupies a 168-byte storage area that is aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory accumulator.
Of course, once you have 168-byte registers people are going to
think of new uses for them.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
and manage them.
Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:
"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory accumulator.
Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.
But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.
SIMD from hell? Pretend that a CPU is a graphics card? :-)
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
In posits, a quire is an accumulator with as many binary digitsNot restricted to posits, I believe (but the term may differ).
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
and manage them.
Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:
"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory accumulator.
Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.
But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:These would be very large registers. You'd need some way to store and load >>> the these for register spills, fills and task switch, as well as move
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>
and manage them.
Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit:
"A floating-point accumulator occupies a 168-byte storage area that is
aligned on a 256-byte boundary. An accumulator consists of a four-byte
status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory accumulator.
Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.
At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.
But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.
With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:These would be very large registers. You'd need some way to store and load >>>> the these for register spills, fills and task switch, as well as move
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
and manage them.
Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory accumulator.
Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.
At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.
But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.
With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.
IBM format had one sign bit, seven exponent bits and six or fourteen >hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:These would be very large registers. You'd need some way to store and
In posits, a quire is an accumulator with as many binary digits
as to cover max-exponent to min-exponent; so one can accumulate
an essentially unbounded number of sums without loss of precision
--to obtain a sum with a single rounding.
Not restricted to posits, I believe (but the term may differ).
At university, I had my programming
courses on a Pascal compiler which implemented
https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
a hardware implementation was on the 4361 as an option
https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
load
the these for register spills, fills and task switch, as well as move
and manage them.
Karlsruhe above has a link to
http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf
which describes their large accumulators as residing in memory, which
avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
"A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."
The operands are specified by virtual address of their in-memory
accumulator.
Makes sense, given the time this was implemented. This was also a
mid-range machine, not a number cruncher. I do not find the
number of cycles that the instructions took.
At the time, memory was just a few clock cycles away from the CPU, so
not really that problematic. Today, such a super-accumulator would stay
in $L1 most of the time, or at least the central, in-use cache line of
it, would do so.
But this was also for hex floating point. A similar scheme for IEEE
double would need a bit more than 2048 bits, so five AVX-512 registers.
With 1312 bits of storage, their fp inputs (hex fp?) must have had a
smaller exponent range than ieee double.
IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.
(Insert fear and loathing for hex float here).
(Insert fear and loathing for hex float here).
Heck, watching Kahan's notes on FP problems leaves one in fear of
binary floating point representations.
David Brown <david.brown@hesbynett.no> writes:
On 04/10/2024 19:30, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
Compare this with the pain the x86 world went through, over a much longer >>>>> time, to move to 32-bit.
The x86 started from 8-bit roots, and increased width over time, which >>>> is a very different path.
Still, the question is why they did the 286 (released 1982) with its
protected mode instead of adding IA-32 to the architecture, maybe at
the start with a 386SX-like package and with real-mode only, or with
the MMU in a separate chip (like the 68020/68551).
I can only guess the obvious - it is what some big customer(s) were
asking for. Maybe Intel didn't see the need for 32-bit computing in the >>markets they were targeting, or at least didn't see it as worth the cost.
Anyone could see the problems that the PDP-11 had with its 16-bit
limitation. Intel saw it in the iAPX 432 starting in 1975. It is
obvious that, as soon as memory grows beyond 64KB (and already the
8086 catered for that), the protected mode of the 80286 would be more
of a hindrance than even the real mode of the 8086. I find it hard to believe that many customers would ask Intel for something the 80286
protected mode with segments limited to 64KB, and even if, that Intel
would listen to them. This looks much more like an idee fixe to me
that one or more of the 286 project leaders had, and all customer
input was made to fit into this idea, or was ignored.
Concerning the cost, ther 80286 has 134,000 transistors, compared to supposedly 68,000 for the 68000, and the 190,000 of the 68020. I am
sure that Intel could have managed a 32-bit 8086 (maybe even with the
nice addressing modes that the 386 has in 32-bit mode) with those
134,000 transistors if Motorola could build the 68000 with half of
that.
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Concerning code "model", I think that Intel assumend that
most procedures would fit in a single segment and that
average procedure will be of order of single kilobytes.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.
What went wrong? IIUC there were several control systems
using 286 features, so there was some success. But PC-s
became main user of x86 chips and significant fraction
of PC-s was used for gaming. Game authors wanted direct
access to hardware which in case of 286 forced real mode.
But IIUC first paging Unix appeared _after_ release of 286.
In 286 time Multics was highly regarded and it heavily depended
on segmentaion. MVS was using paging hardware, but was
talking about segments, except for that MVS segmentation
was flawed because some addresses far outside a segment were
considered as part of different segment. I think that also
in VMS there was some taliking about segments. So creators
of 286 could believe that they are providing "right thing"
and not a fake possible with paging hardware.
And I do not think thay could make
32-bit processor with segmentation in available transistor
buget,
and even it they managed it would be slowed down by too
long addresses (segment + 32-bit offset).
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
But IIUC first paging Unix appeared _after_ release of 286.
From ><https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:
|The kernel of 32V was largely rewritten by Berkeley graduate student
|Özalp Babaoğlu to include a virtual memory implementation, and a
|complete operating system including the new kernel, ports of the 2BSD >|utilities to the VAX, and the utilities from 32V was released as 3BSD
|at the end of 1979.
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).
Concerning code "model", I think that Intel assumend that
most procedures would fit in a single segment and that
average procedure will be of order of single kilobytes.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.
With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.
That would be faster than
what you outline, as soon as one call happens. But apparently 16-bit branches are not that beneficial, or they would have gone that way on
the 386.
Another usage of segments for code would be to put the code segment of
a shared object (known as DLL among Windowsheads) in a segment, and
use far calls to call functions in other shared objects, while using
near calls within a shared object. This allows to share the code
segments between different programs, and to locate them anywhere in
physical memory. However, AFAIK shared objects were not a thing in
the 80286 timeframe; Unix only got them in the late 1980s.
I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.
What went wrong? IIUC there were several control systems
using 286 features, so there was some success. But PC-s
became main user of x86 chips and significant fraction
of PC-s was used for gaming. Game authors wanted direct
access to hardware which in case of 286 forced real mode.
Every successful software used direct access to hardware because of performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.
But IIUC first paging Unix appeared _after_ release of 286.
From <https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:
|The kernel of 32V was largely rewritten by Berkeley graduate student |Ã?zalp BabaoÄ?lu to include a virtual memory implementation, and a |complete operating system including the new kernel, ports of the 2BSD |utilities to the VAX, and the utilities from 32V was released as 3BSD
|at the end of 1979.
The 80286 was introduced on February 1, 1982.
In 286 time Multics was highly regarded and it heavily depended
on segmentaion. MVS was using paging hardware, but was
talking about segments, except for that MVS segmentation
was flawed because some addresses far outside a segment were
considered as part of different segment. I think that also
in VMS there was some taliking about segments. So creators
of 286 could believe that they are providing "right thing"
and not a fake possible with paging hardware.
There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.
So if they really had wanted protected mode to succeed, they should
have designed in 32-bit data segments (and maybe also 32-bit code
segments). Alternatively, if protected mode and the 32-bit addresses
do not fit in the 286 transistor budget, a CPU that implements the
32-bit feature and leaves away protected mode would have been more
popular than the 80286; and (depending on how the 32-bit extension was implemented) it might have been a better stepping stone towards the
kind of CPU with protected mode that they imagined; but the alt-386
designers probably would not have designed in this kind of protected
mode that they did.
Concerning paging, all these scenarios are without paging. Paging was primarily a virtual-memory feature, not a memory-protection feature.
It acquired memory protection only as far as it was easy with pages
(i.e., at page granularity). So paging was not designed as a
competition to segments as far as protection was concerned. If
computer architects had managed to design segmentation with
competetive performance, we would be seeing hardware with both paging
and segmentation nowadays. Or maybe even without paging, now that
memories tend to be big enough to make virtual memory mostly
unnecessary.
And I do not think thay could make
32-bit processor with segmentation in available transistor
buget,
Maybe not.
and even it they managed it would be slowed down by too
long addresses (segment + 32-bit offset).
On the contrary, every program that does not fit in the medium memory
model on the 80286 would run at least as fast on such a CPU in real
mode and significantly faster in protected mode.
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case >>there are too few segments (if you have one segment per object).
Intel clearly had some strong opinions about how people would program
the 286, which turned out to bear almost no relation to the way we
actually wanted to program it.
Some of the stuff they did was just perverse, like putting flag
bits in the low part of the segment number rather than the high
bit. If you had objects bigger than 64K, you had to shift
the segment number three bits to the left when computing
addresses.
They also apparently didn't expect people to switch segments much.
If you loaded a segment register with the value it already contained,
it still fetched all of the stuff from memory.
How many gates would
it have taken to check for the same value and bypass the loads?
If
they had done that, we could have used large model calls everywhere
since long and short calls would be about the same speed, and not
had to screw around deciding what was a long call and what was short
and writing glue codes to allow both kinds.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.
AFAICS that covered wast majority of programs during eighties.
Turbo Pascal offered only medium memory model and was quite
popular. Its code generator produced mediocre output, but
real Turbo Pascal programs used a lot of inline assembly
and performance was not bad.
Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.
The 8086 has branches with 8-bit offsets and branches and calls
with 16-bit offsets. The 386 in 32-bit mode has branches with
8-bit offsets and branches and calls with 32-bit offsets; if
16-bit offsets for branches would be useful enough for performance,
they could instead have designed the longer branch length to be
16 bits, and maybe a prefix for 32-bit branch offsets. That would
be faster than what you outline, as soon as one call happens.
But apparently 16-bit branches are not that beneficial, or they
would have gone that way on the 386.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).
In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem.
But there is a lot of loading segment registers
and slow loading is a problem.
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.
With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.
At that time Intel apparently wanted to avoid having too many
instructions.
I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.
Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.
Every successful software used direct access to hardware because of
performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.
For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.
More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so.
And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).
There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.
AFAICS that covered wast majority of programs during eighties.
Turbo Pascal offered only medium memory model
Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.
Intel probably assumend that 286 would cover most needs,
especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286.
IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version.
But I think that Intel segmentation had some
attractive features during eighties.
Another thing is 386. I think that designers of 286 thought
that 386 will remove some limitations. And 386 allowed
bigger segmensts removing one major limitation. OTOH
for 32-bit processor with segementation it would be natural
to have 32-bit segment registers. It is not clear to
me if 16-bit segment registers in 386 were deemed necessary
for backward compatibility or maybe in 386 period flat
fraction in Intel won and they kept segmentation mostly
for compatibility.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Another usage of segments for code would be to put the code segment of
a shared object (known as DLL among Windowsheads) in a segment, and
use far calls to call functions in other shared objects, while using
near calls within a shared object. This allows to share the code
segments between different programs, and to locate them anywhere in
physical memory. However, AFAIK shared objects were not a thing in
the 80286 timeframe; Unix only got them in the late 1980s.
IIUC shared segments were widely used on Multics.
In article <6d5fa21e63e14491948ffb6a9d08485a@www.novabbs.org>, >mitchalsup@aol.com (MitchAlsup1) wrote:
On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:
They also apparently didn't expect people to switch segments much.
Clearly. They expected segments to be essentially stagnant--unlike
the people trying to do things with x86s...
An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not. Intel may still have been thinking in
terms of embedded systems when designing the 80286.
On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:
They also apparently didn't expect people to switch segments much.
Clearly. They expected segments to be essentially stagnant--unlike
the people trying to do things with x86s...
Xenix, apart from OS/2 the only other notable protected-mode OS for the
286, was ported to the 386 in 1987, after SCO secured "knowledge from Microsoft insiders that Microsoft was no longer developing Xenix", so
SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped the 286 ship
ASAP.
An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not.
Intel may still have been thinking in
terms of embedded systems when designing the 80286.
The IBM PC was launched in August 1981 and around a year passed before it >became clear that this machine was having a huge and lasting effect on
the market. The 80286 was released on February 1st 1982, although it
wasn't used much in PCs until the IBM PC/AT in August 1984.
The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
to have been the first version of x86 where it was obvious at the start
of design that use in general-purpose computers would be important.
Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.
jgd@cix.co.uk (John Dallman) writes:
An idea: The target markets for the 8080 and 8085 were clearly embedded >>systems. The Z80 and 6502 rapidly became popular in the micro-computer >>market, but the 808[05] did not.
The 8080 was used in the first microcomputers, e.g., the 1974 Altair
8800 and the IMSAI 8080. It was important for all the CP/M machines,
because the CP/M software (both the OS and the programs running on it)
were written to use the 8080 instruction set, not the Z80 instruction
set. And CP/M was the professional microcomputer OS before the IBM PC compatible market took off, despite the fact that the most popular microcomputers of the time (such as the Apple II, TRS-80 ad PET) did
not use it; there was an add-on card for the Apple II with a Z80 for
running CP/M, though, which shows the importance of CP/M.
Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.
Intel released the MCS-51 (aka 8051) in 1980 for embedded systems, and
it's very successful there, and before that came the MCS-48 (8048) in
1976.
Intel may still have been thinking in
terms of embedded systems when designing the 80286.
I very much doubt that the segments and the 24 address bits were
designed for embedded systems. The segments look more like an echo of
the iAPX432 than of anything designed for embedded systems.
The idea of some static allocation of memory for which segments might
work may come from old mainframe systems, where programs were (in my impression) more static than PC programs and modern computing. Even
in Unix programs, which were more dynamic than mainframe programs had
quite a bit of static allocation in the early days; this is reflected
in the paragraph in the GNU coding standards:
|Avoid arbitrary limits on the length or number of any data structure, |including file names, lines, files, and symbols, by allocating all
|data structures dynamically. In most Unix utilities, "long lines are |silently truncated". This is not acceptable in a GNU utility.
The IBM PC was launched in August 1981 and around a year passed before it >>became clear that this machine was having a huge and lasting effect on
the market. The 80286 was released on February 1st 1982, although it
wasn't used much in PCs until the IBM PC/AT in August 1984.
The 80286 project was started in 1978, before any use of the 8086. <https://timeline.intel.com/1978/kicking-off-the-80286> claims that
they "spent six months on field research into customers' needs alone"; Judging by the results, maybe the customers were clueless, or maybe
Intel asked the wrong questions.
The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
to have been the first version of x86 where it was obvious at the start
of design that use in general-purpose computers would be important.
Actually, reading the oral history of the 386, at the start the 386
project was just an unimportant followon of the 286, while the main
action was expected to be on the BiiN project (from which the i960
came). Only sometime during that project the IBM PC market exploded
and the 386 became the most important project of the company.
But yes, they were very much aware of the needs of programmers in the
386 project, and would probably have done something with just paging
and no segments if they did not have the 8086 and 80286 legacy.
- anton
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the >>iAPX432 clearly showed that they wanted to be dominant there. It's an >>irony of history that the 8086/8088 actually went where the action
was.
I have heard that the IBM PC was originally designed with a Z80, and
fairly late
in the process someone decided (not unreasonably) that it wouldn't be different
enough from all the other Z80 boxes to be an interesting product. They
wanted a
16 bit processor but for time and money reasons they stayed with the 8
bit bus
they already had. The options were 68008 and 8088. Moto was only
shipping
samples of the 68008 while Intel could provide 8088 in quantity, so they
went
with the 8088.
If Moto had been a little farther along, the history of the PC industry
could have been quite different.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.
I have heard that the IBM PC was originally designed with a Z80, and fairly late
in the process someone decided (not unreasonably) that it wouldn't be different
enough from all the other Z80 boxes to be an interesting product. They wanted a
16 bit processor but for time and money reasons they stayed with the 8 bit bus
they already had. The options were 68008 and 8088. Moto was only shipping samples of the 68008 while Intel could provide 8088 in quantity, so they went with the 8088.
If Moto had been a little farther along, the history of the PC industry
could have been quite different.
On Sun, 5 Jan 2025 20:01:25 +0000, John Levine wrote:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Anyway, while Zilog may have taken their sales, I very much believe
that Intel was aware of the general-purpose computing market, and the
iAPX432 clearly showed that they wanted to be dominant there. It's an
irony of history that the 8086/8088 actually went where the action
was.
I have heard that the IBM PC was originally designed with a Z80, and
fairly late
in the process someone decided (not unreasonably) that it wouldn't be
different
enough from all the other Z80 boxes to be an interesting product. They
wanted a
16 bit processor but for time and money reasons they stayed with the 8
bit bus
they already had. The options were 68008 and 8088. Moto was only
shipping
samples of the 68008 while Intel could provide 8088 in quantity, so they
went
with the 8088.
If Moto had been a little farther along, the history of the PC industry
could have been quite different.
If Moto had done 68008 first, it may very well have turned out
differently.
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
From my point of view main drawbacks of 286 is poor support for
large arrays and problem for Lisp-like system which have a lot
of small data structures and traverse then via pointers.
Yes. In the first case the segments are too small, in the latter case
there are too few segments (if you have one segment per object).
In the second case one can pack several objects into single
segment, so except for loct security properties this is not
a big problem.
If you go that way, you lose all the benefits of segments, and run
into the "segments too small" problem. Which you then want to
circumvent by using segment and offset in your addressing of the small
data structures, which leads to:
But there is a lot of loading segment registers
and slow loading is a problem.
...
Using 16-bit offsets for jumps inside procedure and
segment-offset pair for calls is likely to lead to better
or similar performance as purely 32-bit machine.
With the 80286's segments and their slowness, that is very doubtful.
The 8086 has branches with 8-bit offsets and branches and calls with
16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
offsets and branches and calls with 32-bit offsets; if 16-bit offsets
for branches would be useful enough for performance, they could
instead have designed the longer branch length to be 16 bits, and
maybe a prefix for 32-bit branch offsets.
At that time Intel apparently wanted to avoid having too many
instructions.
Looking in my Pentium manual, the section on CALL has a 20 lines for
"call intersegment", "call gate" (with priviledge variants) and "call
to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
286), and the "Operation" section (the specification in pseudocode)
consumes about 4 pages, followed by a 1.5 page "Description" section.
9 of these 10 far call variants deal with protected-mode things, so
Intel obviously had no qualms about adding instruction variants. If
they instead had no protected mode, but some 32-bit support, including
the near call with 32-bit offset that I suggest, that would have
reduced the number of instruction variants.
I used Xenix on a 286 in 1986 or 1987; my impression is that programs
were limited to 64KB code and 64KB data size, exactly the PDP-11 model
you denounce.
Maybe. I have seen many cases where sofware essentiallt "wastes"
good things offered by hardware.
Which "good things offered by hardware" do you see "wasted" by this
usage in Xenix?
To me this seems to be the only workable way to use
the 286 protected mode. Ok, the medium model (near data, far code)
may also have been somewhat workable, but looking at the cycle counts
for the protected-mode far calls on the Pentium (and on the 286 they
were probably even more costly), which start at 22 cycles for a "call
gate, same priviledge" (compared to 1 cycle on the Pentium for a
direct call near), one would strongly prefer the small model.
Every successful software used direct access to hardware because of
performance; the rest waned. Using BIOS calls was just too slow.
Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
writing directly to video.
For most early graphic cards direct screen access could be allowed
just by allocating appropriate segment. And most non-games
could gain good performance with better system interface.
I think that variaty of tricks used in games and their
popularity made protected mode system much less appealing
to vendors. And that discouraged work on better interfaces
for non-games.
MicroSoft and IBM invested lots of work in a 286 protected-mode
interface: OS/2 1.x. It was limited to the 286 at the insistence of
IBM, even though work started in August 1985, when they already knew
that the 386 was coming soon. OS/2 1.0 was released in April 1987,
1.5 years after the 386.
OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
too late, so the 286 killed OS/2; here we have a case of a software
project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
few years).
Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.
Also, Microsoft started NT OS/2 in November 1988 to target the 386
while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
parted ways, NT OS/2 became Windows NT, which is the starting point of
all remaining Windowses from Windows XP onwards.
Xenix, apart from OS/2 the only other notable protected-mode OS for
the 286, was ported to the 386 in 1987, after SCO secured "knowledge
from Microsoft insiders that Microsoft was no longer developing
Xenix", so SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped
the 286 ship ASAP.
The verdict is: The only good use of the 286 is as a faster 8086;
small memory model multi-tasking use is possible, but the 64KB
segments are so limiting that everybody who understood software either decided to skip this twist (MicroSoft, except on their OS/2 death
march), or jumped ship ASAP (SCO).
More generally, vendors could release separate versions of
programs for 8086 and 286 but few did so.
Were there any who released software both as 8086 and a protected-mode
80286 variants? Microsoft/SCO with Xenix, anyone else?
And users having
only binaries wanted to use 8086 on their new systems which
led to heroic efforts like OS/2 DOS box and later Linux
dosemu. But integration of 8086 programs with protected
mode was solved too late for 286 model to gain traction
(and on 286 "DOS box" had to run in real mode, breaking
normal system protection).
Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
which does not require heroic efforts AFAIK.
There was various segmented hardware around, first and foremost (for
the designers of the 80286), the iAPX432. And as you write, all the
good reasons that resulted in segments on the iAPX432 also persisted
in the 80286. However, given the slowness of segmentation, only the
tiny (all in one segment), small (one segment for code and one for
data), and maybe medium memory models (one data segment) are
competetive in protected mode compared to real mode.
AFAICS that covered wast majority of programs during eighties.
The "vast majority" is not enough; if a key application like Lotus
1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
did not limit themselves to 64KB of data.
Turbo Pascal offered only medium memory model
Acoording to Terje Mathiesen, it also offered the large memory model.
On its Wikipedia page, I find: "Besides allowing applications larger
than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
Turbo Pascal 4.0 introduced support for the large memory model in
1988.
Intel apparently assumed that programmers are willing to spend
extra work to get good performance and IMO this was right
as a general statement. Intel probably did not realize that
programmers will be very reluctant to spent work on security
features and in particular to spent work on making programs
fast in 286 protected mode.
80286 protected mode is never faster than real mode on the same CPU,
so the way to make programs fast on the 286 is to stick with real
mode; using the small memory model is an alternative, but as
mentioned, the memory limits are too restrictive.
Intel probably assumend that 286 would cover most needs,
As far as protected mode was concerned, they hardly could have been
more wrong.
especially
given that most system had much less memory than 16 MB theoreticlly
allowed by 286.
They provided 24 address pins, so they obviously assumed that there
would be 80286 systems with >8MB. 64KB segments are already too
limiting on systems with 1MB (which was supported by the 8086),
probably even for anything beyond 128KB.
IMO this is partially true: there
is a class of programs which with some work fit into medium
model, but using flat address space is easier. I think that
on 286 (that is with 16 bit bus) those programs (assuming enough
tuning) run faster than flat 32-bit version.
Maybe in real mode. Certainly not in protected mode. Just run your
tuned large-model protected-mode program against a 32-bit small-model
program for the same task on a 386SX (which is reported as having a
very similar speed to the 80286 on 16-bit programs).
And even if you
find one case where the protected-mode program wins, nobody found it
worth their time to do this nonsense.
And so OS/2 flopped despite
being backed by IBM and, until 1990, Microsoft.
But I think that Intel segmentation had some
attractive features during eighties.
You are one of a tiny minority. Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.
I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).
In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no (Terje Mathisen) wrote:
I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).
I recall reading back in the 1980s that the PC was intended to be
incapable of competing with the System/36 minis, and the previous
System/34 and /32 machines. It rather failed at that.
John
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
addition to the base (8086) variant of Windows 2.0, which was released
in December 1987), which used 386 protected mode and virtual 8086 mode
(which was missing in the "brain-damaged" (Bill Gates) 286). So
Windows completely ignored 286 protected mode. Windows eventually
became a big success.
What I recall is a bit different. IIRC first successful version of
Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
286 protected mode and 386 protected mode. Only later Microsoft
dropped requirement for 8086 compatiblity.
I think still later it dropped 286 support.
Windows 95 was supposed to be 32-bit, but contained quite a lot
of 16-bit code.
IIRC system interface to Windows 3.0 and 3.1 was 16-bit and only
later Microsoft released extention allowing 32-bit system calls.
I have no information about Windows internals except for some
public statements by Microsoft and other people, but I think
it reasonable to assume that Windows was actually a succesful
example of 8086/286/386 compatibility. That is their 16 bit
code could use real mode segmentation or protected mode
segmentation the later both for 286 and 386. For 32-bit
version they added translation layer to transform arguments
between 16-bit world and 32-bit world. It is possible
that this translation layer involved a lot of effort.
16 bit dispatching "thunk" DLL to translate calls for everyfunction of every board that we might possibly want to use ...
Anyway, it seems that Windows was at least as tied to 286
as OS/2 when it became sucessful and dropped 286 support
later. And for long time after dropping 286 support
Windows massively used 16-bit segments.
IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
to say "supported on Windows". That is Windows 3.0 on 286 almost
surely used 286 protected mode and probably run "Windows" programs
in protected mode. But Windows also supported 8086 and Microsoft
guidelines insisted that proper "Windows program" should run on
8086.
... Even Intel finally saw the light, as
did everybody else, and nowadays segments are just a bad memory.
Well, 16-bit segments clearly are too limited when one has several
megabytes of memory. And consistently 32-bit segmented system
increases memory use which is nontrivial cost. OTOH there is
question how much customers are going to pay for security
features. I think recent times show that secuity has significant
costs. But lack of security may lead to big losses. So
there is no easy choice.
Now people talk more about capabilities. AFAICS capabilities
offer more than segments, but are going to have higher cost.
So abstractly, for some systems segments still may look
attractive. OTOH we now understand that software ecosystem
is much more varied than prevalent view in seventies, so
system that fit well to segments are a tiny part.
And considering bad memory, do you remember PAE? That had
similar spirit to 8086 segmentation. I guess that due
to bad feeling for segments among programmers (and possibly
more relevant compatiblity troubles) Intel did not extend
this to segments, but spirit was still there.
The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
Intel had a chance to do it right with the 386, but instead they
doubled down and expanded the existing poor implementation to support
larger segments.
I realize that transistor counts at the time might have made an
on-chip SMU impossible, but ISTM the SMU would have been a very small >component that (if necessary) could have been implemented on-die as a >coprocessor.
How would the addresses be divided into segment and offset in your
model? What would the SMU have to do?
- anton
On Thu, 1 Jan 1970 0:00:00 +0000, John Dallman wrote:
In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no
(Terje Mathisen) wrote:
I do believe that IBM did seriously consider the risk of making the
PC too good, so that it would compete directly with their low-end
systems (8100?).
I recall reading back in the 1980s that the PC was intended to be
incapable of competing with the System/36 minis, and the previous
System/34 and /32 machines. It rather failed at that.
Perhaps IBM should have made them more performant !?!
George Neuner <gneuner2@comcast.net> writes:
The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds accesses (e.g., buffer overflow exploits).
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) >allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
Terje Mathisen <terje.mathisen@tmsw.no> writes:
The best idea I have seen to help detect out of bounds accesses, is to >>round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that >>the end of the requested region coincides with the end of the (last) >>allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
It also does not help for out-of-bounds accesses that are not just
adjacent to an earlier in-bounds access. That may also be a less
common vulnerability than adjacent positive-stride buffer overflows.
But if we throw hardware on the problem, do we want to spend hardware
on something that does not catch all out-of-bounds accesses?
- anton
According to George Neuner <gneuner2@comcast.net>:
The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
The whole point of a segmented architecture is that the segments are visible and
meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.
Muitics and the Burroughs machines had (still have, I suppose for emulated
Anton Ertl wrote:
George Neuner <gneuner2@comcast.net> writes:
The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds
accesses (e.g., buffer overflow exploits).
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page.
This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
Terje
According to George Neuner <gneuner2@comcast.net>:
The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
The whole point of a segmented architecture is that the segments are visible and
meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.
Muitics and the Burroughs machines had (still have, I suppose for emulated >Burroughs) visible segments and programmers liked them just fine. The problems >were that the segment sizes were too small as memories got bigger, and that they
weren't byte addressed which these days is practically mandatory. The 286 added
additional flaws that there weren't enough segment registers and segment loads >were very slow.
What you're describing is multi-level page tables. Every virtual memory system >has them. Sometimes the operating systems make the higher level tables visible >to applications, sometimes they don't. For example, in IBM mainframes the second
level page table entries, which they call segments, can be shared between >applications.
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last)
allocated page. This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
It is also problematic to allocate 8K (or more) for a small entity, or
on the stack.
Bounds checking should ideally impart minimum overhead so that it
can be enabled in production code.
Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)
This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)
This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
Memory access is to base + index, with one additional point:
If index > ubound, then the instruction raises an exception.
This works less well with C's pointers, for which you would have
to pass some sort of fat pointer. Compilers would have to make
sure that the address of the base object is passed.
Comments?
What you're describing is multi-level page tables. Every virtual
memory system has them. Sometimes the operating systems make the
higher level tables visible to applications, sometimes they don't. For example, in IBM mainframes the second level page table entries, which
they call segments, can be shared between applications.
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
On Mon, 6 Jan 2025 15:05:02 +0000, Terje Mathisen wrote:
Anton Ertl wrote:
George Neuner <gneuner2@comcast.net> writes:
The bad taste of segments is from exposure to Intel's half-assed
implementation which exposed the segment selector as part of the
address.
Segments /should/ have been implemented similar to the way paging is
done: the program using flat 32-bit addresses and the MMU (SMU?)
consulting some kind of segment "database" [using the term loosely].
What benefits do you expect from segments? One benefit usually
mentioned is security, in particular, protection against out-of-bounds
accesses (e.g., buffer overflow exploits).
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last)
allocated page.
This does require at least 8kB for every allocation, but
I guess they can all share a single trapping segment?
You allocate no more actual memory, but you do consume an additional
virtual address PTE on those pages marked no-access. If, later, you
expand that allocated area, you can THEN allocate a page and update
the PTE.
(This idea does not help locate negative buffer overruns (underruns?)
but they seem to be much less common?)
Use an unallocated page prior to the buffer, too.
On Mon, 6 Jan 2025 22:02:30 +0000, Thomas Koenig wrote:
Hmm... a beginning of an idea (for which I am ready to be shot
down, this is comp.arch :-)
This would work best for languages which explicitly pass
array bounds or sizes (like Fortran's assumed size arrays,
or, if I read this correctly, Rust's slices).
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
Memory access is to base + index, with one additional point:
If index > ubound, then the instruction raises an exception.
Now, you are only checking the ubound and not the lbound; so,
you only stumble over ½ the bound errors.
Where you should START is with a data structure that defines
the memory region::
First Byte accessible Possibly lbound
Last Byte accessible Possibly ubound
other stuff as needed
Then figure out how to efficiently perform the checks in ISA
of choice (or add to ISA).
This works less well with C's pointers, for which you would have
to pass some sort of fat pointer. Compilers would have to make
sure that the address of the base object is passed.
I blame the programmers for not using FAT pointers (and then
teaching the compilers how to get rid of most of the checks.)
Nothing is preventing C programmers from using FAT pointers,
and thereby avoid all those buffer overruns.
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.
Michael S <already5chosen@yahoo.com> writes:
On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly >>caught both in Matlab and in Octave. And Matlab has Fortran roots.
A compiler is free to create row or column capabilities for C or
FORTRAN if the goal is more than just memory safety.
On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.
On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:
On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.
WATFIV would catch a(9,11)
On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:
On Tue, 07 Jan 2025 14:43:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
Scott Lurndal <scott@slp53.sl.home> schrieb:
Assume a class of load and store instructions containing
- One source or destination register
- One base register
- One index register
- One ubound register
See aforementioned CHERI.
CHERY targets C, which on the one hand, I understand (there's a
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
The floating point size is weird (and also does not catch all
errors, such as writing one element past a huge array).
I haven't seen any consideration of how CHERI would integrate
with languages which have multidimensional arrays. How would
do j=1,10
do i=1,11
a(i,j) = 42.
end do
end do
interact with Cheri if a was a 10*10 array ? Would it be
necessary to create a capability for a(:,j)?
A multidimensional array is a single contiguous
blob of memory, the capability would encompass the
entire region of memory containing the array.
Then out of bound access like a(9,11) would not be caught.
I don't know if it has to be caught by Fortran rules. It is certainly
caught both in Matlab and in Octave. And Matlab has Fortran roots.
WATFIV would catch a(9,11)
MitchAlsup1 <mitchalsup@aol.com> schrieb:-----------------
WATFIV would catch a(9,11)
Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.
It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)
On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:-----------------
WATFIV would catch a(9,11)
Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.
It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)
How long does it take SW to construct the kinds of slices
a FORTRAN subroutine can hand off to another subroutine ??
That is a CHERRI capability that allows for access to every
even byte in a structure but no odd byte in the same structure ??
The best idea I have seen to help detect out of bounds accesses, is to
round all requested memory blocks up to the next 4K boundary and mark
the next page as unavailable, then return a skewed pointer back, so that
the end of the requested region coincides with the end of the (last) allocated page.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:-----------------
WATFIV would catch a(9,11)
Every Fortran compiler I know has a bounds checking, but it is
optional for all of them. People still leave it off for production
code because it is too slow.
It would really be great if the performance degradation was
small enough so people would simply leave on the checks
for production code. (Side question: What does CHERI
do with SIMD-vectorized code?)
How long does it take SW to construct the kinds of slices
a FORTRAN subroutine can hand off to another subroutine ??
For the snippet
subroutine bar
interface
subroutine foo(a)
real, intent(in), dimension(:,:) :: a
end subroutine foo
end interface
real, dimension(10,10) :: a
call foo(a)
end
(which calls foo with an assumed-shape array) gfortran hands this
to the middle end:
__attribute__((fn spec (". ")))
void bar ()
{
real(kind=4) a[100];
{
struct array02_real(kind=4) parm.0;
parm.0.span = 4;
parm.0.dtype = {.elem_len=4, .version=0, .rank=2, .type=3};
parm.0.dim[0].lbound = 1;
parm.0.dim[0].ubound = 10;
parm.0.dim[0].stride = 1;
parm.0.dim[1].lbound = 1;
parm.0.dim[1].ubound = 10;
parm.0.dim[1].stride = 10;
parm.0.data = (void *) &a[0];
parm.0.offset = -11;
foo (&parm.0);
}
}
The middle and back ends are then free to optimize.
That is a CHERRI capability that allows for access to every
even byte in a structure but no odd byte in the same structure ??
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.
What happens for mismatched array bounds between caller
and callee? Nothing, I guess?
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99.
Thomas Koenig <tkoenig@netcologne.de> writes:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
In most but not all contexts. For example, `sizeof arr` yields the size
of the array, not the size of a pointer.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
In C, multidimensional arrays are nothing more or less than arrays of
arrays. You can also build data structures using pointers that are
accessed using the same a[i][j] syntax as is used for a multidimensional array. And yes, they can be difficult to work with.
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99. So, more like four
decades. Or 33 years since Fortran got its first standard.
There are no array descriptors generated automatically by
compiler. But saying that there is no support is incorrect.
What happens for mismatched array bounds between caller
and callee? Nothing, I guess?
I don't know. I didn't read this part of the standard. Or any part of
any C standard past C89.
Never used them, too. For me, multi-dimensional arrays look mostly like source of confusion rather than useful feature. At least as long as
there are no automatically generated descriptors. With exception for
VERY conservative cases like array fields in structure, with all
dimensions fixed at compile time.
I don't know, but I can guess. And in case I am wrong Keith Thompson
will correct me.
Most likely the standard says that mismatched array bounds between
caller and callee is UB.
And most likely in practice it works as expected. I.e. if caller
defined the matrix as X[M][N] and caller is treating it as Y[P][Q] then access to Y[i][j] for as long as k=i*Q+j < M*N will go to X[k/N][k%N].
However, you have to pay attention that in practice something like that happening by mistake with variably-modified types is far less likely
than it is with classic C multi-dimensional arrays.
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike
C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation that
dimensions have to be known in compile time.
C99 lifted that limitation, making C support for multi-dimensional
arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
Well, apart from playing with what is mandatory and what is not, arrays
stuff in C had not changed (AFAIK) since C99.
It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:
array[y][x] -> array[y*width + x]
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's >>>>>>>>> a ton of C code out there), but trying to retrofit a safe[...]
memory model onto C seems a bit awkward - it might have been >>>>>>>>> better to target a language which has arrays in the first
place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation
that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for
multi-dimensional arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
Well, apart from playing with what is mandatory and what is not,
arrays stuff in C had not changed (AFAIK) since C99.
It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.
Not really, no. The world of C programmers can be divided into those
that work with C compilers and can freely use pretty much anything in
C, and those that have to content with limited, non-standard or
otherwise problematic compilers and write code that works for them.
Such compilers include embedded toolchains for very small
microcontrollers or DSPs, and MS's C compiler.
Some C code needs to be written in a way that works on MS's C
compiler as well as other tools, but most is free from such
limitations. Even for those targeting Windows, it's common to use
clang or gcc for serious C coding.
MS used to have a long-term policy of specifically not supporting C
well because that might make it easier for people to write
cross-platform C code for Windows and Linux. Instead, they preferred
to push developers towards C# and Windows-only programming - or if
that failed, C++ which was not as commonly used on *nix. Now, I
think, they just don't care much about C - they don't see many people
using their tools for C and haven't bothered supporting any feature
that needs much effort. They know that they can't catch up with
other C compilers, so have made it easier to integrate clang with
their IDE's and development tools.
Microsoft does care about C, but only in one specific area - kernel programming.
The only other language officially allowed for Windows
kernel programming is C++, but coding kernel drivers in C++ is
discouraged.
I suppose that driver written in C++ would have major
difficulties passing Windows HLK tests and getting WHQL signing.
As you can guess, in kernel drivers VLA are unwelcome.
VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's[...]
a ton of C code out there), but trying to retrofit a safe
memory model onto C seems a bit awkward - it might have been
better to target a language which has arrays in the first
place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
C language always had multi-dimensional arrays, with limitation
that dimensions have to be known in compile time.
C99 lifted that limitation, making C support for
multi-dimensional arrays comparable to that in old Fortran.
C11 said that lifting is optional.
Now C23 makes part of the lifting (variably-modified types) again
mandatory.
I'd missed that one.
Relatively to F90, support for multi-dimensional arrays in C23 is
primitive.
From what you describe, support for multi-dimensional arrays
in C23 now reached the level of Fortran II, released in
1958. Only a bit more than six decades, can't complain
about that.
Well, apart from playing with what is mandatory and what is not,
arrays stuff in C had not changed (AFAIK) since C99.
It's not mandatory, so compilers are free to ignore it (and a
major compiler, from a certain company in Redmond, does
not support it). That's as good as sayhing that it does not
exist in the language.
Not really, no. The world of C programmers can be divided into those
that work with C compilers and can freely use pretty much anything in
C, and those that have to content with limited, non-standard or
otherwise problematic compilers and write code that works for them.
Such compilers include embedded toolchains for very small
microcontrollers or DSPs, and MS's C compiler.
Some C code needs to be written in a way that works on MS's C
compiler as well as other tools, but most is free from such
limitations. Even for those targeting Windows, it's common to use
clang or gcc for serious C coding.
MS used to have a long-term policy of specifically not supporting C
well because that might make it easier for people to write
cross-platform C code for Windows and Linux. Instead, they preferred
to push developers towards C# and Windows-only programming - or if
that failed, C++ which was not as commonly used on *nix. Now, I
think, they just don't care much about C - they don't see many people
using their tools for C and haven't bothered supporting any feature
that needs much effort. They know that they can't catch up with
other C compilers, so have made it easier to integrate clang with
their IDE's and development tools.
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a header
which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:
array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication
to get the actual position:
array[y][x] -> array[y*width + x]
Terje
On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a header >>> which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:
array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack.
Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.
I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.
On 1/16/2025 9:24 AM, MitchAlsup1 wrote:
On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a header >>>> which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>>> to get the actual position:
array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.
OK, so Terje's observation of it being faster doing the calculation
himself is due to him not doing these additional checks?
Waldek Hebisch <antispam@fricas.org> schrieb:
David Brown <david.brown@hesbynett.no> wrote:
Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.
Waldek Hebisch <antispam@fricas.org> schrieb:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why.
I've never understood why people think there is something
"dangerous" about VLAs, or why they think using heap allocations
is somehow "safer".
VLA normally allocate on the stack.
You can pass them as VLAs (which Fortran has had since 1958)
or you can declare them. It is the latter which would need
to allocate on the stack.
But allocating them on the stack is an implementation detail.
Since Fortran 90, you can also do
subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(n,m) :: a
which will delcare the array a with the bounds of n and m.
(Fortran can also do dynamic memory allocation, so
subroutine foo(n,m)
integer, intent(in) :: n,m
real, dimension(:,:), allocatable :: c
allocate (c(n,m))
would also work, and also automatically release the memory).
Because Fortran users are used to large arrays, any good Fortran
compiler will also allocate a on the heap if it is too large.
Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.
If you have a memory allocation pattern like
p1 = malloc(chunk_1); /* Fill it */
p2 = malloc(chunk_2);
/* Use it */
free (p2);
p3 = malloc(chunk_3);
/* Use it */
free (p3)
/* Use p1 */
There is a chance that p2 still pollutes the cache and parts of
p1 may have been removed unnecessarily. This would not have been
the case p2 and p3 had been allocated on the stack.
In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.
Stack limits are artificial, but
I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.
... for kernels maybe less so.
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:
 array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:
array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.
On 1/16/2025 9:24 AM, MitchAlsup1 wrote:
On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a
header
which contains the starting point and current length, along with
allocated size. For multidimendional work, the natural mapping is
Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication
to get the actual position:
  array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler generate
code for that itself?
The compiler does; but with a dope-vector in view, the compiler
inserts additional checks on the arithmetic and addressing.
OK, so Terje's observation of it being faster doing the calculation
himself is due to him not doing these additional checks?
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
IIUC popular current processors are still quite far from having
64-bit virtual address space, so there is still reason to limit
stack size, simply limit can be much bigger than on 32-bit
systems.
Waldek Hebisch <antispam@fricas.org> schrieb:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack.
You can pass them as VLAs (which Fortran has had since 1958)
or you can declare them.
Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Stacks are small because OS people make them small, not because of
a valid technical reason that has ever been explained to me.
"To avoid infinite recursion" is not a valid reason, IMHO.
Basically, to use VLA one needs rather small bound on maximal
size of array. Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
Allocating data on the stack promotes cache locality, which can
increase performance by quite a lot.
In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.
Stack limits are artificial, but
I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.
... for kernels maybe less so.
David Brown <david.brown@hesbynett.no> writes:
On 16/01/2025 10:11, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:
array[y][x] -> array[y*width + x]
Note that this will inhibit bounds checking on the inner dimension.
That might be part of the reason for the improvement in speed.
For example, given int array[10][10], array[0][11] is out of bounds,
even if it logically refers to the same location as array[1][0]. This results in undefined behavior in C, and perhaps some kind of exception
in a language that requires bounds checking. If you do this manually by defining a 1d array, any checking applies only to the entire array.
That does not surprise me. Vec<> in Rust is very similar to
std::vector<> in C++, as far as I know (correct me if that's wrong).
So a vector of vectors of int is not contiguous or consistent - each
subvector can have a different current size and capacity. Doing a
bounds check for accessing xs[i][j] (or in C++ syntax, xs.at(i).at(j)
when you want bounds checking) means first reading the current size
member of the outer vector, and checking "i" against that. Then xs[i]
is found (by adding "i * sizeof(vector)" to the data pointer stored in
the outer vector). That is looked up to find the current size of this
inner vector for bounds checking, then the actual data can be found.
I'm not familiar with Rust's Vec<>, but C++'s std::vector<> guarantees
that the elements are stored contiguously. But the std::vector<> object itself doesn't contain those elements; it's a fixed-size chunk of data (basically a struct in C terms) whose size doesn't change regardless of
the number of elements (and typically regardless of the element type).
So a std::vector<std::vector<int>> will result in the data for each row
being stored contiguously, but the rows themselves will be allocated dynamically.
This is /completely/ different from classic C multi-dimensional
arrays. It is more akin to a one-dimensional C array of pointers to
individually allocated one-dimensional C arrays - but even less
efficient due to an extra layer of indirection.
If you know the size of the data at compile time, then in C++ you have
std::array<> where the information about size is carried in the type,
with no run-time cost. A nested std::array<> is a perfectly good and
efficient multi-dimensional array with runtime bounds checking if you
want to use it, as well as having value semantics (no decay to pointer
types in expressions). I would guess there is something equivalent in
Rust ?
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.
I do not know about Windows, but IIUC in some period Linux limit
for kernel stack was something like 2 kB (single page shared
with some other per-process data structures). I think it
was increased later, but even moderate size arrays are
unwelcame on kernel stack due to size limits.
VMTs are, may
be, tolerable (I wonder what is current policy of Linux and BSD
kernels), but hardly desirable.
IMO VMT-s are vastly superior to raw pointers, but to fully
get their advantages one would need better tools. Also,
kernel needs to deal with variable size arrays embedded in
various data structures. This is possible using pointers,
but current VMT-s are too weak for many such uses.
On 16/01/2025 17:46, Waldek Hebisch wrote:
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate >anything on the heap without knowing the bounds and being sure it is >appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
The stack on Linux is 10 MB by default, and 1 MB by default on Windows.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope
vector, so no heap allocation.
Some implementations of C++ std::string do this. For example, the GNU implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
The stack on Linux is 10 MB by default, and 1 MB by default on Windows. That's a /lot/ if you are working with fairly small but non-constant
sizes. So if you are working with a selection of short-lived
medium-sized bits of data - say, parts of strings for some formatting
work - putting them on the stack is safe and can be significantly faster
than using the heap.
Using VLAs (or the older but related technique, alloca) means you don't
waste space. Maybe you are working with file paths, and want to support
up to 4096 characters per path - but in reality most paths are less than
100 characters. With fixed size arrays, allocating 16 of these and initialising them will use up your entire level 1 cache - with VLAs, it
will use only a tiny fraction.
These things can make a big difference
to code that aims to be fast.
Fixed size arrays are certainly easier to analyse and are often a good choice, but VLA's definitely have their advantages in some situations,
and they are perfectly safe and reliable if you use them appropriately
and correctly.
In the past I was a fan of VLA and stack allocation in general.
But I saw enough bug reports due to programs exceeding their
stack limits that I changed my view.
Other people might have bad uses of VLAs - it doesn't mean /you/ have to
use them badly too!
Far and away my most common use of VLAs is, however, not variable length
at all. It's more like :
const int no_of_whatsits = 20;
const int size_of_whatsit = 4;
uint8_t whatsits_data[no_of_whatsits * size_of_whatsit];
Technically in C, that is a VLA because the size expression is not a
constant expression according to the rules of the language. But of
course it is a size that is known at compile-time, and the compiler
generates exactly the same code as if it was a constant expression.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.
Some implementations of C++ std::string do this. For example, the GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.
Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!
I don't understand. What pointer are you referring to?
In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.
David Brown <david.brown@hesbynett.no> writes:
On 16/01/2025 17:46, Waldek Hebisch wrote:
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
The stack on Linux is 10 MB by default, and 1 MB by default on Windows.
On all the linux systems I use, the stack limit defaults to 8192KB.
That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.
Now, that's for the primary thread stack, for which the OS
manages the growth. For other threads in the process,
the size varies based on the threads library in use
and whether the application is compiled for 32-bit or
64-bit systems.
antispam@fricas.org (Waldek Hebisch) writes:
[...]
Well, AFAICS VLA-s may get allocated on function entry.[...]
That would rarely be possible for objects with automatic storage
duration (local variables). For example:
void func(void) {
do_this();
do_that();
int vla[rand() % 10 + 1];
}
Memory for `vla` can't be allocated until its size is known,
and it can't be known until the definition is reached. For most automatically allocated objects, the lifetime begins when execution
reaches the `{` of the enclosing block; the lifetime of `vla`
begins at its definition.
Or did you have something else in mind?
(Should this part of the discussion migrate to comp.lang.c, or is it
still sufficiently relevant to computer architecture?)
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
Well, AFAICS VLA-s may get allocated on function entry.
In such
case caller have to check for allocation size, which spreads
allocation related code between caller and called function.
In case of 'malloc' one can simply check return value.
In fact,
in many programs simple wrapper that exits in case of allocation
failure is enough (if application can not do its work without
memory and there is no memory, then there is no point in continuing execution).
On 1/17/25 4:20 AM, David Brown wrote:
On 16/01/2025 22:40, Scott Lurndal wrote:On linux, one can call the routine setrlimit(RLIMIT_STACK, ...) to change
David Brown <david.brown@hesbynett.no> writes:
On 16/01/2025 17:46, Waldek Hebisch wrote:On all the linux systems I use, the stack limit defaults to 8192KB.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack. You don't allocate >>>> anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off >>>> point that is different.
The stack on Linux is 10 MB by default, and 1 MB by default on Windows. >>>
the stack size.
On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>>> vector, so no heap allocation.
Some implementations of C++ std::string do this. For example, the GNU >>>> implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.
Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!
I don't understand. What pointer are you referring to?
The pointer which would have had to point elsewhere had the string
not been contained within.
In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've
never understood why people think there is something "dangerous" about
VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
You don't allocate
anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
On 17/01/2025 03:10, MitchAlsup1 wrote:
On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:I don't understand. What pointer are you referring to?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the
dope
vector, so no heap allocation.
Some implementations of C++ std::string do this. For example, the >>>>> GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.
Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !! >>>
The pointer which would have had to point elsewhere had the string
not been contained within.
There are a couple of ways you can do "small string optimisation". One would be to have a structure something like this :
struct String1 {
size_t capacity;
char * data;
char small_string[16];
}
Then "data" would point to "small_string" for a capacity of 16, and if
that's not enough, use malloc to allocate more space.
An alternative would be to have something like this (I'm being /really/ sloppy with alignments, rules for unions, and so on - this is
illustrative only, not real code!) :
struct String2 {
bool is_small;
union {
char small_string[31];
struct {
size_t capacity;
char * data;
}
}
}
This second version lets you put more characters in the local
small_string area, reusing space that would otherwise be used for the pointer and capacity. But it has more runtime overhead when using the string :
void print1(String1 s) {
printf(s.data);
}
void print2(String2 s) {
if (s.is_small) {
printf(s.small_string);
} else {
printf(s.data);
}
}
There are, of course, many other ways to make string types (such as supporting copy-on-write), but I suspect that Mitch is thinking of style String2 while Keith is thinking of style String1.
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
These may be points that you are looking at for your embedded work,
but the average programmer does not.
An example, Fortran-specific: Fortran 2018 made all procedures
recursive by default. This means that some Fortran codes will start
crashing because of stack overruns when this is implemented :-(
You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
What would you recommend as a limt? (See fmax-stack-var-size=N
in gfortran).
On 17/01/2025 17:42, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
That would be another way of saying you have no idea when your program
is going to blow up from lack of stack space. You don't need VLAs to
cause such problems.
In reality, you /do/ know a fair amount. Often your knowledge is
approximate - you know you are not going to need anything like as much
stack as the system provides, and you don't worry about it. In other situations (such as in small embedded systems), you think about it all
the time - again, regardless of any VLAs.
If you are in a position where you suspect you might be pushing close to
the limits of your stack, "standard" doesn't come into it - you are
dealing with a particular target, and you can use whatever functions or support that target provides.
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
[...]
I do know that several people have created fast string libraries,
where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.
Some implementations of C++ std::string do this. For example, the GNU
implementation appears to store up to 16 characters (including the
trailing null character) in the std::string object.
Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!
I don't understand. What pointer are you referring to?
In the implementation I'm referring to, std::string happens to be 32
bytes in size. If the string has a length of 15 or less, the string
data is stored directly in the std::string object (in the last 16 bytes
as it happens). If the string is longer than that it's stored
elsewhere, and that 16 bytes is presumably use to manage the
heap-allocated data.
On all the linux systems I use, the stack limit defaults to 8192KB.
That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.
Now, that's for the primary thread stack, for which the OS
manages the growth. For other threads in the process,
the size varies based on the threads library in use
and whether the application is compiled for 32-bit or
64-bit systems.
Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:
 array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler
generate code for that itself?
Because Rust really doesn't have multi-dim vectors, instead using
vectors of pointers to vectors?
OTOH, it is perfectly OK to create your own multi-dim data structure,
and using macros you can probably get the compiler to generate near-
optimal code as well, but afaik, nothing like that is part of the core language.
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
These may be points that you are looking at for your embedded work,
but the average programmer does not.
An example, Fortran-specific: Fortran 2018 made all procedures
recursive by default. This means that some Fortran codes will start
crashing because of stack overruns when this is implemented :-(
You don't allocate
anything on the heap without knowing the bounds and being sure it is
appropriate. There's no fundamental difference - it's just the cut-off
point that is different.
What would you recommend as a limt? (See fmax-stack-var-size=N
in gfortran).
On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.
On 1/16/2025 11:12 AM, Terje Mathisen wrote:
Stephen Fuld wrote:
On 1/16/2025 1:11 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:Rust provides an interesting data point here:
Thomas Koenig <tkoenig@netcologne.de> writes:
[...]
CHERY targets C, which on the one hand, I understand (there's a[...]
ton of C code out there), but trying to retrofit a safe memory
model onto C seems a bit awkward - it might have been better to
target a language which has arrays in the first place, unlike C.
C does have arrays.
Sort of - they decay into pointers at first sight.
But what I should have written was "multi-dimensional arrays",
with a reasonable way of handling them.
It has Vec<> which is always implemented as a dope vector, i.e. a
header which contains the starting point and current length, along
with allocated size. For multidimendional work, the natural mapping
is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
boundary checking.
However, in my own testing I have found that it is often faster to
flatten those multi-dim vectors, and instead use explicit
multiplication to get the actual position:
 Â array[y][x] -> array[y*width + x]
Terje
I am obviously missing something, but why doesn't the compiler
generate code for that itself?
Because Rust really doesn't have multi-dim vectors, instead using
vectors of pointers to vectors?
OTOH, it is perfectly OK to create your own multi-dim data structure,
and using macros you can probably get the compiler to generate near-
optimal code as well, but afaik, nothing like that is part of the core
language.
That surprised me. So I did a search for "Rust Multi dimensional
arrays", and got several hits. It seems there are various ways to do
this depending upon whether you want an array of arrays or a
"traditional" multi-dimensional array. There is a crate for the latter.
I don't know enough Rust to get all the details in the various search results, but it seems there are options.
On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:
On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>>>> never understood why people think there is something "dangerous" about >>>>>> VLAs, or why they think using heap allocations is somehow "safer".
VLA normally allocate on the stack. Which at first glance look
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.
On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:
On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:
David Brown <david.brown@hesbynett.no> schrieb:
On 16/01/2025 17:46, Waldek Hebisch wrote:
David Brown <david.brown@hesbynett.no> wrote:
On 16/01/2025 13:35, Michael S wrote:VLA normally allocate on the stack. Which at first glance look
On Thu, 16 Jan 2025 12:36:45 +0100
David Brown <david.brown@hesbynett.no> wrote:
On 15/01/2025 21:59, Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
As you can guess, in kernel drivers VLA are unwelcome.
I can imagine that they are - but I really don't understand why. I've >>>>>>> never understood why people think there is something "dangerous" about >>>>>>> VLAs, or why they think using heap allocations is somehow "safer". >>>>>>
great. But once one realize how small are stacks in modern
systems (compared to whole memory), this no longer looks good.
Basically, to use VLA one needs rather small bound on maximal
size of array.
Sure.
Given such bound always allocating maximal
size is simpler. Without _small_ bound on size heap is
safer, as it is desined to handle also big allocations.
You don't allocate anything in a VLA without knowing the bounds and
being sure it is appropriate to put on the stack.
In general, that is a hard thing to know - there is no standard
way to enquire the size of the stack, how much you have already
used, how deep you are going to recurse, or how much stack
a function will use.
Not standard compliant for sure, but you certainly can approximate
stack use in C: just store (as byte*) the address of a local in your
top level function, and check the (absolute value of) the difference
to the address of a local in the current function.
On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.
I would discourage programmers from relying on that for any reason whatsoever. The aux vectors are pushed before the envp entries.
Notice what I wrote above, Rust allows for compile-time code generation
in the form of macros which are in some ways even more powerful than C++ templates, so I'n not surprised to learn that there already exists
public crate(s) to handle this. :-)
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.
I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and subtract
SP from it.
I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and subtract >>>>>SP from it.
I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and >>>>>subtract SP from it.
I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the envp
entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or want to
do.
However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.
On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.
I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the envp
entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or want to
do.
However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
Of those that do support signals, not every one supports catching
SIGSEGV.
Of those that do support catching SIGSEGV, not every one can recover
after that.
Michael S <already5chosen@yahoo.com> writes:
On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.
I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the
envp entries.
This brings into question what is "on" the stack ?? to be
included in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or want
to do.
However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
Linux and MAC can't touch windows in terms of volume - but I'd
argue that in the universe of programmers, they're close to
if not equal to windows. The vast vast majority of windows
users don't have a compiler. Those that do are working at
a higher level where the knowlege of the stack base address
would not be a useful value to know.
Unix (bsd/sysv) and linux support the ucontex argument
on the signal handler which provides processor state so
the signal handler can recover from the fault in whatever
fashion makes sense then transfer control to a known
starting point (either siglongjmp or by manipulating the
return context provided to the signal handler). This is
clearly going to be processor and implementation specific.
Yes, Windows is an abberation. I offered a solution, not
"the" solution. I haven't seen any valid reason for a program[*]
to need to know the base address of the process stack; if there
were a need, the implementation would provide. I believe windows
does have a functional equivalent to SIGSEGV, no? A quick search
shows "EXCEPTION_ACCESS_VIOLATION" for windows.
[*] Leaving aside the rare system utility or diagnostic
utility or library (e.g. valgrind, et alia may find
that a useful datum).
Michael S <already5chosen@yahoo.com> writes:
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
"man raise" tells me that raise() is C99. "man signal" tells me that signal() is C99.
Of those that do support signals, not every one supports catching
SIGSEGV.
"man 7 signal" tells me that SIGSEGV is P1990, i.e., 'the original POSIX.1-1990 standard'. I.e., there were even some Windows systems
that support it.
Of those that do support catching SIGSEGV, not every one can recover
after that.
Gforth catches and recovers from SIGSEGV in order to return to
Gforth's command line rather than terminating the process; in
snapshots from recent years that's also used for determining whether
some number is probably an address (try to read from that address; if
there's a signal, it's not an address). I tried building Gforth on a
number of Unix systems, and even the most rudimentary ones (e.g.,
Ultrix), supported catching SIGSEGV.
There is a port to Windows with
Cygwin done by Bernd Paysan. I don't know if that could catch
SIGSEGV, but I am sure that it's possible in Windows in some way, even
if that way is not available through Cygwin.
- anton
=46rom cppreference: https://en.cppreference.com/w/c/program/signal
"If the user defined function returns when handling SIGFPE, SIGILL or >SIGSEGV, the behavior is undefined."
Michael S <already5chosen@yahoo.com> writes:
=46rom cppreference: https://en.cppreference.com/w/c/program/signal
"If the user defined function returns when handling SIGFPE, SIGILL or >>SIGSEGV, the behavior is undefined."
As is almost everything else occurring in production code. So such >references are not particularly relevant for production code; what
actual (in this case) operating system kernels and libraries do is
relevant.
And my experience from three decades across a wide variety of Unix
systems on a wide variety of hardware is that what we do in our
SIGSEGV handler works. But our signal handlers don't return, they
longjmp() (in the cases that do not terminate the process).
On Thu, 23 Jan 2025 08:14:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
Not every system capable of running C supports signals. I would=20
think that those that support signals are not even majority. =20
"man raise" tells me that raise() is C99. "man signal" tells me that
signal() is C99.
=20
I would guess that it belongs to the part of the standard that defines >requirements for hosted implementation. My use of C for "real work"
(as opposed to hobby) is almost exclusively in freestanding
implementations.
Even for hosted implementations, signal handled is guaranteed to be
invoked only when signal is raised by raise(). It is not our case.
On Thu, 23 Jan 2025 01:00:49 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 22 Jan 2025 22:44:45 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and
subtract SP from it.
I would discourage programmers from relying on that for any
reason whatsoever. The aux vectors are pushed before the
envp entries.
This brings into question what is "on" the stack ?? to be
included in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or
want to do.
However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
Linux and MAC can't touch windows in terms of volume - but I'd
argue that in the universe of programmers, they're close to
if not equal to windows. The vast vast majority of windows
users don't have a compiler. Those that do are working at
a higher level where the knowlege of the stack base address
would not be a useful value to know.
I did not have "big" computers in mind. In fact, if we only look at
"big" things then Android dwarfs anything else. And while Android is
not POSIX complaint, it is probably similar enough for your method to
work.
I had in mind smaller things.
All but one of very many embedded environments that I touched in
last 3 decades had no signals. The exceptional one was running
Linux.
Unix (bsd/sysv) and linux support the ucontex argument
on the signal handler which provides processor state so
the signal handler can recover from the fault in whatever
fashion makes sense then transfer control to a known
starting point (either siglongjmp or by manipulating the
return context provided to the signal handler). This is
clearly going to be processor and implementation specific.
Yes, Windows is an abberation. I offered a solution, not
"the" solution. I haven't seen any valid reason for a program[*]
to need to know the base address of the process stack; if there
were a need, the implementation would provide. I believe windows
does have a functional equivalent to SIGSEGV, no? A quick search
shows "EXCEPTION_ACCESS_VIOLATION" for windows.
But then one would have to use SEH which is not the same as signals.
Although a specific case of SIGSEGV is the one where the SEH and
signals happen to be rather similar.
I can try it one day, but not today.
[*] Leaving aside the rare system utility or diagnostic
utility or library (e.g. valgrind, et alia may find
that a useful datum).
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
(MitchAlsup1)
On a Linux machine, you can find the last envp[*] entry and subtract >>>>>>SP from it.
I would discourage programmers from relying on that for any reason
whatsoever. The aux vectors are pushed before the envp entries.
This brings into question what is "on" the stack ?? to be included
in the measurement of stack size.
Only user data ??
Data that is present when control arrives ??
Could <equivalent> CRT0 store SP at arrival ??
I think we have an illdefined measurement !!
Everything between the base address of the stack
and the limit address of the stack. The kernel exec(2)
family system calls will allocate the initial
stack region (with guard pages to handle extension)
and populate it with the AUX, ENVP and ARG vectors
before invoking the CRT in usermode.
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or want to
do.
However, if the OS they're using has a guard page to prevent
stack underflow, one could write a subroutine which accesses
page-aligned addresses towards the beginning of the stack
region (anti the direction of growth) until a
SIGSEGV is delivered.
On Wed, 22 Jan 2025 22:44:45 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:
So, how does one find the base (highest address on the stack) ??
in a way that works on every system capable of running C-code ??
It's not something that a programmer generally would need, or want to
do.
https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html
or any problem requiring potentially unbounded recursion.
At the end, I can not resist myself and did it today, wasting an hour
and a half during which I was supposed to do real work.
With Microsoft's language extensions it was trivial.
But I don't know how to do it (on Windows) with gcc.
Code:
static void test(char** res, int depth) {
*res = (char*)&res;
if (depth > 0)
test(res, depth-1);
}
int main()
{
char* res=(char*)&res;
__try { test(&res, 1000000); }
__except(1) { // 1==EXCEPTION_EXECUTE_HANDLER
printf("SEH __except block\n");
}
printf("%p - %p = %zd\n", &res, res, (char*)&res - res);
return 0;
}
It prints:
SEH __except block
000000000029F990 - 00000000001A4020 = 1030512
On Thu, 23 Jan 2025 08:14:52 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
Not every system capable of running C supports signals. I would
think that those that support signals are not even majority.
"man raise" tells me that raise() is C99. "man signal" tells me
that signal() is C99.
I would guess that it belongs to the part of the standard that
defines requirements for hosted implementation. [...]
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (3 / 13) |
Uptime: | 05:47:24 |
Calls: | 10,388 |
Calls today: | 3 |
Files: | 14,061 |
Messages: | 6,416,799 |