• Re: Reduced compared to what, Concertina II Progress

    From John Levine@21:1/5 to All on Fri Dec 8 19:35:43 2023
    According to Scott Lurndal <slp53@pacbell.net>:
    David Brown <david.brown@hesbynett.no> writes:
    I've always considered RISC to be parsed as (RI)SC rather than R(ISC) -
    it meant small and simple instructions, rather than a small number of >>different instructions. The idea was that instructions should, on the >>whole, be single-cycle and implemented directly in the hardware, rather >>than multi-cycle using sequencers or microcode. ...

    Surely then, the PDP-8 can be counted as a RISC processor. There are
    only 8 instructions defined by a 3-bit opcode, and due to the
    instruction encoding, a single operate instruction can perform multiple >(sequential) operations.

    The PDP-8 was a a PDP-5 reimplemented with newer transistors. The 12
    bit PDP-5 was a cut down version of the 18 bit PDP-4 which was later reimplented as the PDP-7, -9, and somewhat fancier -15. The PDP-4 was
    a redesign of the PDP-1 to make it a lot simpler and cheaper and
    modestly slower.

    All of them were single accumulator single address machines, single
    cycle everything where the cycles were based on the memory speed. The
    PDP-8 did one cycle to fetch and decode an instruction, a second cycle
    to fetch the indirect address if it was a memory reference and the
    indirect bit was set, and a third cycle for memory refs to fetch or
    store the operand. I think that I/O instructions might sometimes have
    been a little slower to allow for the time it took signals to
    propagate on the I/O bus. All of the others had the same general
    design. The -15, the last in the line, was tarted up with an index
    register but by that time it was clear that the PDP-11 was the winner.

    I suppose you could say RISC but they weren't really reduced from
    anything, they were born simple. The PDP-8 did a fantastic job of
    hitting a sweet spot that could be implemented cheaply using late
    1960s and 1970s technology while still being capable enough to do
    significant work.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Dec 8 19:44:43 2023
    According to David Brown <david.brown@hesbynett.no>:
    By my logic (such as it is - I don't claim it is in any sense
    "correct"), the PDP-8 would definitely be /CISC/. It only has a few >instructions, but that is irrelevant (that was my point) - the
    instructions are complex, and therefore it is CISC.

    Having actually programmed a PDP-8 I find this assertion hard to
    understand. It's true, it had instructions that both fetched an
    operand and did something with it, but with only one register what
    else were they going to do?

    It was very RISC-like in that you used a sequence of simple
    instructions to do what would would be one instruction on more complex machines. For example, you got the effect of a load by clearing the
    register (CLA) and then adding the memory word. To do subtraction,
    clear, add the second operand, negate, add the first operand, maybe
    negate again depending on whether you wanted A-B or B-A. We all knew a
    long list of these idioms.

    There was a microcontroller that we once considered for a project, which
    had only a single instruction - "move". We ended up with a different
    chip, so I never got to play with it in practice.

    I saw some of those, and the one-instruction thing was a conceit. They
    all had plenty of instructions, just with the details in the operand
    specifiers rather than the op code.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Levine on Fri Dec 8 21:14:05 2023
    On 08/12/2023 20:44, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    By my logic (such as it is - I don't claim it is in any sense
    "correct"), the PDP-8 would definitely be /CISC/. It only has a few
    instructions, but that is irrelevant (that was my point) - the
    instructions are complex, and therefore it is CISC.

    Having actually programmed a PDP-8 I find this assertion hard to
    understand. It's true, it had instructions that both fetched an
    operand and did something with it, but with only one register what
    else were they going to do?

    I have never used one, and don't know about the PDP-8 in any kind of
    detail. I am just responding to Scott's post. I wrote that my
    understanding of "RISC" was it meant simpler instructions, not fewer instructions, and then Scott suggested that meant the PDP-8 was "RISC"
    because it had few instructions. I have no idea if the PDP-8 is/was
    generally considered "RISC", but I /do/ know that Scott appeared to have
    got my post completely backwards.

    Based solely on the information Scott gave, however, I would suggest
    that the "OPR" instruction - "microcoded operation" - and the "IOT"
    operation would mean it certainly was not RISC. (This is true even if
    it has other attributes commonly associated with RISC architectures.)


    It was very RISC-like in that you used a sequence of simple
    instructions to do what would would be one instruction on more complex machines. For example, you got the effect of a load by clearing the
    register (CLA) and then adding the memory word. To do subtraction,
    clear, add the second operand, negate, add the first operand, maybe
    negate again depending on whether you wanted A-B or B-A. We all knew a
    long list of these idioms.

    There was a microcontroller that we once considered for a project, which
    had only a single instruction - "move". We ended up with a different
    chip, so I never got to play with it in practice.

    I saw some of those, and the one-instruction thing was a conceit. They
    all had plenty of instructions, just with the details in the operand specifiers rather than the op code.


    The one I am thinking of was the MAXQ. No, it is/was not a conceit - it
    was a real transfer-triggered architecture.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to david.brown@hesbynett.no on Fri Dec 8 22:16:26 2023
    It appears that David Brown <david.brown@hesbynett.no> said:
    Based solely on the information Scott gave, however, I would suggest
    that the "OPR" instruction - "microcoded operation" - and the "IOT"
    operation would mean it certainly was not RISC. (This is true even if
    it has other attributes commonly associated with RISC architectures.)

    The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
    so this is a purely hypothetical argument.

    The OPR opcode had a bunch of bits that did various things to the
    accumulator and the link (approximately the carry bit.) It wasn't
    microcoded in the modern sense, it was that you could set more than
    one bit to get more than one operation, e.g. octal 7040 bit complemented
    the accumulator and 7001 incremented it, so 7041 negated it.

    The IOT instructions were extremely simple, a six bit device address
    field it sent out on the I/O bus, and the three low bits that sent
    pulses out on three control lines. The devices did whatever they did,
    read the contents of the accumulator, send back a value to put in it,
    or tell the CPU to skip the next instruction which was how you tested
    a flag.

    Keep in mind that the PDP-8 was built from 1400 discrete transistors
    and 10,000 diodes. It had to be simple.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Dec 9 02:42:43 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    PDP-8 certainly is simple nor does it have many instructions,
    but it certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    That's a reasonable set of criteria but I think the last one is the
    most important. The IBM 801 ruthlessly took everything out of the
    hardware that could be done as fast in software, and the rest of the
    design followed from that.

    Since then RISC-y designs have incorporated virtual memory and
    floating point even though the 801 didn't, because they aren't things
    they tried to make the 801 do, and they turn out to be a lot faster
    with hardware support.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sat Dec 9 08:33:14 2023
    David Brown <david.brown@hesbynett.no> writes:
    On 08/12/2023 16:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    (I'm snipping because I pretty much agree with the rest of what you wrote.)


    Is Coldfire a load/store architecture? If not, it's not a RISC.


    I agree that there's a fairly clear boundary between a "load/store >architecture" and a "non-load/store architecture". And I agree that it
    is usually a more important distinction than the number of instructions,
    or the complexity of the instructions, or any other distinctions.

    Not sure what you mean by "the complexity of the instructions".

    But does that mean LSA vs. NLSA should be used to /define/ RISC vs CISC?

    Skimming in Patterson and Ditzel's 1980 paper "The Case for the
    Reduced Instruction Set Computer", I fail to see a definition of RISC
    (which may be one of the reasons for it being particularly easy to
    claim RISCness for everything under the sky). In his 1985 paper
    "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, Patterson mentioned
    the 801, Berkeley RISC-I and RISC-II, and Stanford MIPS as actual RISC machines, and saw the following common characteristics among them:

    |1. Operations are register-to-register, with only LOAD and STORE
    | accessing memory. [...]
    |
    |2. The operations and addressing modes are reduced. Operations
    | between registers complete in one cycle, permitting a simpler,
    | hardwired control for each RISC, instead of
    | microcode. Multiple-cycle instructions such as floating-point
    | arithmetic are either executed in software or in a special-purpose
    | coprocessor. (Without a coprocessor, RISCs have mediocre
    | floating-point performance.) Only two simple addressing modes,
    | indexed and PC-relative, are provided. More complicated addressing
    | modes can be synthesized from the simple ones.
    |
    |3. Instruction formats are simple and do not cross word boundaries. [...]
    |
    |4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

    One commonality that Patterson fails to mention is that these are all
    register machines (machines where most instructions working with
    addresses or scalar integers work on general-purpose registers) with
    16 or more general-purpose registers. He probably did not mention
    this because the VAX and the S/360 are also register machines. Let's
    call this commonality "5.".

    Looking at these criteria:

    2. is quite implementation-oriented. The commercial MIPS R2000 and
    R3000 (descendant of Stanford MIPS) have the FPU in a coprocessor,
    but the R4000 already has it on the same chip as the integer unit;
    is the R4000 not a RISC? Likewise for SPARC (descendant of
    Berkeley RISC) and the RSC implementation of Power (the descendant
    of the 801). And commercial MIPS already has multiply and divide
    instructions that do not complete in one cycle (and they involve
    special-purpose registers).

    RISC-V (with involvement from Patterson) has the M extension which
    includes integer multiply and divide instructions which typically
    take more than one cycle. Is RISC-V with the M extension not a
    RISC?

    Where 2. is ISA-oriented, in the addressing modes, ARM (A32), HPPA
    and 88000 provided more addressing modes. Are they not RISC? I
    think they have enough in common with these three RISCs that they
    can be considered RISC, and that this commonality between the three
    research RISCs is not a relevant criterion.

    3. The ROMP (the first commercial offspring of the 801) has mixed
    16-bit and 32-bit instructions, and the 32-bit instructions can
    cross word boundaries. This has later been adopted by the ARM
    Thumb2 (T32) instruction set (after experiments with the
    16-bit-only Thumb ISA), microMIPS (after the 16-bit only MIPS16e)
    and in the RISC-V C extension. Are these all not RISCs? I think
    they are RISCs, so that criterion has to be relaxed to include
    instruction sets with two instruction sizes with a factor of two
    between them. Interestingly, ARM A64 satisfies 3 in unrelaxed
    form.

    4. The delayed branches are an example of an implementation-oriented
    instruction set feature. The ARM architects had the wisdom to
    avoid it from the start, and early RISC architectures that have it
    found it to be a burden after a few years. The 88000 (1988)
    already had both delayed and nondelayed branches, Power (1990) and
    Alpha (1992) do not have delayed branches. Are they not RISCs?

    So the only commonalities that stood the test of time are 1. and 5.

    And looking at ARM A64 and RISC-V (say, RV64GC), we see two recent architectures for general-purpose computers that mostly satisfy these
    criteria. Interestingly, some RISC-V advocates (not sure if it is
    Patterson) now use ARM A64 as a counterexample to the RISC idea like
    Patterson used the VAX in the 1980s, but I have not heard hard
    criteria for that from them.

    And while AMD64 includes load-and-op and RMW instructions, it mostly
    just uses one memory location. The number of GPRs has been raised
    with AMD64 to 16, and APX will raise it it 32, so they took lessons
    from RISC principles.

    Things have changed a lot since the term "RISC" was first coined, and
    maybe architectural and ISA features are so mixed that the terms "RISC"
    and "CISC" have lost any real meaning.

    Architecture is ISA. Architectural features and ISA features are the
    same thing.

    Sure, many people have tried to put the "RISC" label on many things,
    and if you accept that, it really has no meaning; and actually, it did
    not have a meaning in the 1980 paper, and was only given meaning by
    looking at the commonalities of the 3-4 prototypes in 1985; and with
    hindsight, we see that only commonality 1 (and the unmentioned
    commonality 5) has stood the test of time.

    But we see that ARM A64 and RISC-V actually satisfy 1 and 5, more than
    30 years after the early research RISCs, so these criteria provide
    some benefits even in the very different implementation world of the
    2010s.

    ARM A64 also satisfies criterion 3, so that apparently has a benefit,
    too, and RISC-V C satisfies the relaxed version of 3 while AMD64 (and
    VAX and 68000 do not).

    So I claim that an architecture that satisfies criterion 1, 5, and the
    relaxed 3 are RISCs.

    As for "CISC", that term really has no meaning other than "not RISC".
    I am not aware of a proper definition, nor an enumeration of
    architectures that are labeled as CISCs vs. non-CISCs. Basically, we
    know that Patterson considered the VAX to be a CISC. Otherwise, the
    term has been often used as "non-RISC".

    If that's the case, then we
    should simply talk about LSA and NLSA architectures, and stop using
    "RISC" and "CISC".

    As we see above, RISC is still a little more than just load/store
    architecture.

    I don't think trying to redefine "RISC" to mean
    something different from its original purpose helps.

    That would mean that "RISC" has an original definition; what is it?

    As for the purpose, the purpose is still an architecture that is a
    good and stable interface to software, and can be implemented
    efficiently for a wide range of performance targets over a long time.
    ARM A64 seems to do quite well in that respect (although they leave
    the real low end to A32/T32, where they do very well), RISC-V
    currently only covers the low end of the market while the
    not-quite-RISC AMD64 only covers the high end. RISC criteria 1, 5,
    and relaxed 3 seem to work quite well after four decades, while
    criteria 2 and 4 went by the wayside quite soon.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Sat Dec 9 12:35:46 2023
    On Sat, 09 Dec 2023 08:33:14 +0000, Anton Ertl wrote:

    That would mean that "RISC" has an original definition; what is it?

    It certainly does make sense to say that the _description_ of
    characteristics of existing RISC implementations in the Patterson
    and Ditzel paper which you cited doesn't constitute an _original_
    definition of RISC.

    However, Patterson also wrote a popular article about one of the
    early RISC processors with which he was connected for _Scientific
    American_, and he mentioned most or all of those characteristics
    there, including not having hardware floating-point, because it
    took more than one cycle to execute, and, although my memory may
    be wrong, it seems to me that in that article he did make the leap
    to treating that as a definition rather than just an observation.

    Whether or not that is true, it is indeed something like the list
    from Patterson and Ditzel that is being taken as the "definition"
    of RISC by those who say that what passes for RISC these days is
    such as to deprive the term of meaning.

    Obviously, not having hardware floating-point for the sake of
    RISC purism is such a stupid idea that basically no one does that
    any more. Since, however, there are still many current designs
    that incorporate most of the _other_ characteristics of RISC, it
    would be inappropriate to draw the conclusion from this that RISC
    is now a dead and obsolete concept.

    On the other hand, a lot of RISC architectures - all the instructions
    are 32 bits long, the register banks have at least 32 registers in
    them, the architecture is load-store - currently have OoO
    implementations. Like having hardware floating-point, this is done
    to get the best possible speed given the much larger number of transistors
    we can put on a die this day.

    Unlike allowing hardware floating-point, though, I think this
    change strikes directly at the _raison d'être_ of RISC itself.

    If RISC exists because it's designed around making pipelining
    fast and efficient, once you've got an OoO implementation, of
    what benefit is RISC? Maybe the OoO circuitry doesn't have to
    work so hard, or OoO plus 32 registers can delay register
    hazards even longer than OoO plus 8 registers. I don't find this
    so implausible as to dismiss RISC as being now nothing more than
    a marketing gimmick.

    Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
    z/Architecture, ColdFire, maybe even Mitch's MY 66000) and
    there's CISC than which _anything_ else would be better
    (such as the world's most popular ISA, devised by a company
    that made MOS memories, and then branched out into making
    some of the world's first single-chip microprocessors)...

    Given that situation, where "good CISC" is relatively
    minor in its market presence compared to bad, bad, very
    bad CISC, some architecture designers have chosen to
    incorporate as much of Patterson's original description,
    if not definition, of RISC into their designs as is
    practical in order to distance themselves more convincingly
    from x86 and x86-64.

    In designing Concertina II, which might well be described
    as a half-breed architecture from Hell that hasn't made
    up its mind whether to be RISC, CISC, or VLIW, even I have
    been affected by that concern.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Quadibloc on Sat Dec 9 13:03:00 2023
    In article <ul1mv2$26n3i$1@dont-email.me>, quadibloc@servername.invalid (Quadibloc) wrote:

    On the other hand, a lot of RISC architectures - all the
    instructions are 32 bits long, the register banks have at
    least 32 registers in them, the architecture is load-store
    - currently have OoO implementations. Like having hardware
    floating-point, this is done to get the best possible speed
    given the much larger number of transistors we can put on a
    die this day.

    Unlike allowing hardware floating-point, though, I think this
    change strikes directly at the _raison d'tre_ of RISC itself.

    If RISC exists because it's designed around making pipelining
    fast and efficient . . .

    Simple pipelining can't deliver leading-edge performance these days (or
    for quite a while). And once you start having multiple pipelines, you
    start encountering interactions between them, and OoO becomes a more
    attractive method.

    "RISC" as a principle of /implementation/ has become obsolete.

    Load-store architectures and large register sets are still a method of
    coping with the slowness of accessing memory, even with good caches.

    Instructions with simple semantics and few side-effects can make OoO
    systems easier to implement. Making it easy to decode lots of
    instructions in parallel makes it possible to keep a larger number of instructions in the pool and thus find more re-ordering opportunities.

    "RISC" as an ideology was a product of its time. All the architectures
    designed that way have died out of commercial use.

    It's an interesting question if one should even try to design an
    architecture for a very long life, given that one can't anticipate how
    the implementation technologies will change over decades. An architecture
    that will obviously become obsolete within a decade probably isn't worth
    the start-up costs, but after that, who knows?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sat Dec 9 18:11:05 2023
    Anton Ertl wrote:



    I don't think trying to redefine "RISC" to mean
    something different from its original purpose helps.

    That would mean that "RISC" has an original definition; what is it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    RISC was defined before CISC was coined as its contrapositive.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Quadibloc on Sat Dec 9 18:20:22 2023
    Quadibloc wrote:



    Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
    z/Architecture, ColdFire, maybe even Mitch's MY 66000) and

    My 66000 is RISC in a purer form of RISC than ARM or RISC-V.

    there's CISC than which _anything_ else would be better
    (such as the world's most popular ISA, devised by a company
    that made MOS memories, and then branched out into making
    some of the world's first single-chip microprocessors)...

    Given that situation, where "good CISC" is relatively
    minor in its market presence compared to bad, bad, very
    bad CISC, some architecture designers have chosen to
    incorporate as much of Patterson's original description,
    if not definition, of RISC into their designs as is
    practical in order to distance themselves more convincingly
    from x86 and x86-64.

    Even x86 designers use RISC to distance themselves from CISC.

    In designing Concertina II, which might well be described
    as a half-breed architecture from Hell that hasn't made
    up its mind whether to be RISC, CISC, or VLIW, even I have
    been affected by that concern.

    The only things in My 66000 that are not on the 7 Tenets of RISC
    is the attachments of constants as replacements for registers.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Quadibloc on Sat Dec 9 17:27:02 2023
    Quadibloc <quadibloc@servername.invalid> writes:
    However, Patterson also wrote a popular article about one of the
    early RISC processors with which he was connected for _Scientific
    American_,

    A short web search on that came up empty. Do you have a reference?

    On the other hand, a lot of RISC architectures - all the instructions
    are 32 bits long, the register banks have at least 32 registers in
    them, the architecture is load-store - currently have OoO
    implementations. Like having hardware floating-point, this is done
    to get the best possible speed given the much larger number of transistors
    we can put on a die this day.

    Unlike allowing hardware floating-point, though, I think this
    change strikes directly at the _raison d'être_ of RISC itself.

    That's the interesting thing. It doesn't.

    If RISC exists because it's designed around making pipelining
    fast and efficient, once you've got an OoO implementation, of
    what benefit is RISC?

    It's still simpler. With a load-and-op instruction, you have to split
    the instruction into the load part and the op part, they find their
    way through the OoO engine, and eventually you have to combine them
    again in the ROB. With RMW, things become even more interesting; AMD
    had (maybe still has) an R_W uop (ROP in AMD parlance) which works
    before and after the ALU part. Obviously all doable, but it adds to
    the complexity. Which was first, the PowerPC 604 or the Pentium Pro?

    Maybe the OoO circuitry doesn't have to
    work so hard,

    Or the microarchitects and validators don't have to work so hard.

    or OoO plus 32 registers can delay register
    hazards even longer than OoO plus 8 registers.

    Doubtful; with a given amount of physical registers, you then have 24
    less registers for reordering. The more relevant reason for 32
    registers is for code that has to deal with more than 16 values at the
    same time; they reduce the need for loads and stores for spilling, and
    for hardware-optimizing store-load dependencies. It's no surprise
    that both Tiger Lake (Intel) and Zen 3 (AMD) perform store-to-load
    forwarding at 0 cycle latency (which probably did cost quite a bit of
    design work and who knows how much silicon), while it takes 5 cycles
    or so on Firestorm (Apple); Firestorm does not need that optimization
    as dearly.

    Of course, there's CISC-that's-almost-as-good-as-RISC (IBM's
    z/Architecture, ColdFire,

    What makes you think so?

    there's CISC than which _anything_ else would be better
    (such as the world's most popular ISA, devised by a company
    that made MOS memories, and then branched out into making
    some of the world's first single-chip microprocessors)...

    AMD made MOS memories and some of the world's first single-chip microprocessors?

    Anyway, you seem to be referring to AMD64, but the rest is unclear.
    Intel and AMD have avoided the complexities of VAX in designing IA-32
    and AMD64; in particular, every instruction (but MOVS and the
    newfangled gather and scatter instructions, shame on Intel, and they
    were rewarded with Downfall) only refers to one memory location.

    some architecture designers have chosen to
    incorporate as much of Patterson's original description,
    if not definition, of RISC into their designs as is
    practical in order to distance themselves more convincingly
    from x86 and x86-64.

    I doubt that "distancing themselves from x86 and x86-64" was any
    consideration in the ARM A64 and RISC-V designs. Of course they would
    not design in architectural ideas where the patent has not expired,
    but for load-and-op, RMW, or three-memory-address-with-autoincrement instructions, any patents that may have existed have long expired.

    The ARM A64 seem to have had no qualms at introducing features like
    load-pair and store-pair that raise eyebrows among purists, so if they
    thought they would gain enough by deviating from A64 being a
    load-store architecture, or from sticking to fixed-width instructions,
    or from it having 32 registers, they would have gone there.
    Apparently they did not think so, and the IPC and performance per Watt
    of Firestorm indicates that they have designed well.

    The surviving RISC properties are no longer as important as they were
    in the late 1980s, but they still result in fewer problems for the microarchitects to solve, and all of what goes with lower complexity.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Dec 9 19:41:47 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    That would mean that "RISC" has an original definition; what is it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    Do you mean this paper by Patterson and Ditzel or something else?

    https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

    Also compare this paper on the 801 which starts in a very different
    place but ends up with many of the same conclusions.

    https://dl.acm.org/doi/pdf/10.1145/800050.801824

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to John Levine on Sat Dec 9 20:06:54 2023
    On Sat, 9 Dec 2023 19:41:47 -0000 (UTC), John Levine wrote:

    According to MitchAlsup <mitchalsup@aol.com>:
    That would mean that "RISC" has an original definition; what is it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    Do you mean this paper by Patterson and Ditzel or something else?

    Mitch may also be thinking of Manolis Katevenis' thesis:

    <http://users.ics.forth.gr/~kateveni/cv/katevenis_cv_full_v21.html#B1>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to John Levine on Sat Dec 9 21:47:09 2023
    John Levine wrote:

    According to MitchAlsup <mitchalsup@aol.com>:
    That would mean that "RISC" has an original definition; what is it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    Do you mean this paper by Patterson and Ditzel or something else?

    I actually means "Reduced Instruction Set Computers for VLSI" K.

    https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

    My memory ain't what it used to be........

    Several quotes from the Ditzel paper::


    By a judicious choice of the proper instruction set and the design of a corresponding
    architecture, we feel that it should be possible to have a very simple instruction set
    that can be very fast. This may lead to a substantial net gain in overall program
    execution speed. This is the concept of the Reduced Instruction Set Computer.


    Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
    the results to propose a better machine, and then repeating the cycle over a dozen times.
    Though the initial intent was not specifically to come up with a simple design, the result
    was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
    VAX [Johnson79].


    The paper also makes the case against instructions that are seldom used two ways:
    First it makes an argument against microcoded implementations using microcode as
    a gas and filling up every SQ nanoAcre with seldom functionality, then pointing out that if a simple sequence of instructions is faster than the equivalent µCode
    instruction, choosing µCode was a poor choice of the architect.

    In my case (My 66000)::
    I see enough use of indexed addressing that this feature makes the cut
    I see enough use of LD Reg,[GOT[k]] that big displacements make the cut
    I see enough use of LD IP, [GOT[k]] that simplifies cross module calling to make the cut
    I see immediates being used all sorts of places so immediates make the cut
    I see large displacements being used all sorts of places so displacements make the cut
    I see a few uses of the operand sign control but this reduces the name space the programmer
    .....needs to understand
    I see enough VEC-LOOP pairs that these make the cut
    Practically every non-leaf subroutine uses ENTER and EXIT
    .....But note I must remain vigilant that these don't end up slower than a series of insts
    The compiler happily produces transcendental instructions for those names which are in the
    .....LLVM intrinsic function list. When you can calculate practically any transcendental
    .....in fewer than 20 cycles (FDIV equivalent) it is time to do with these what happened
    .....to FP instructions around the time of the first commercial RISCs.

    Although not done yet:: I am not adverse to adding encryption instructions {once I can
    figure out what they should actually be doing and how few I can get away with}

    Also compare this paper on the 801 which starts in a very different
    place but ends up with many of the same conclusions.

    https://dl.acm.org/doi/pdf/10.1145/800050.801824

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to John Levine on Sat Dec 9 18:00:33 2023
    John Levine <johnl@taugh.com> writes:

    It appears that David Brown <david.brown@hesbynett.no> said:

    Based solely on the information Scott gave, however, I would suggest
    that the "OPR" instruction - "microcoded operation" - and the "IOT"
    operation would mean it certainly was not RISC. (This is true even if
    it has other attributes commonly associated with RISC architectures.)

    The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
    so this is a purely hypothetical argument.

    The underlying ideas were looked at in the late 1970s, but I
    believe the name RISC goes back only to the early 1980s.

    Keep in mind that the PDP-8 was built from 1400 discrete transistors
    and 10,000 diodes. It had to be simple.

    The LGP-30, first delivered in 1956, had 113 tubes and 1450
    diodes. I think it's fair to say the LGP-30 has a good
    claim to being the world's first minicomputer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Dec 10 16:34:12 2023
    According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
    The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
    so this is a purely hypothetical argument.

    The underlying ideas were looked at in the late 1970s, but I
    believe the name RISC goes back only to the early 1980s.

    I checked, the 801 project started in 1975. The RISC-I paper was
    published in 1981 and I think they came up with the name in 1980, so
    close enough.

    Keep in mind that the PDP-8 was built from 1400 discrete transistors
    and 10,000 diodes. It had to be simple.

    The LGP-30, first delivered in 1956, had 113 tubes and 1450
    diodes. I think it's fair to say the LGP-30 has a good
    claim to being the world's first minicomputer.

    I poked around one that was old and dead but just looking at it, you
    could see what a very elegant design it was to get useful work out of
    so little logic.

    The Bendix G-15 had 450 tubes and 3000 diodes so it's the other
    contender for the title. Both machines were introduced in 1956,
    cost about the same, and were about the same size, 800lb for the LGP-30,
    956lb for the G-15.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun Dec 10 16:56:10 2023
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    The ARM A64 seem to have had no qualms at introducing features like
    load-pair and store-pair that raise eyebrows among purists, so if they >thought they would gain enough by deviating from A64 being a
    load-store architecture, or from sticking to fixed-width instructions,
    or from it having 32 registers, they would have gone there.
    Apparently they did not think so, and the IPC and performance per Watt
    of Firestorm indicates that they have designed well.

    Actually the 'RISC purity' of the A64 Architecture was not
    likely to have ever been a consideration when choosing which
    features to add to the architecture. They're in the money
    making business, not some idealistic RISC business.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Dec 10 16:45:32 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    I actually means "Reduced Instruction Set Computers for VLSI" K.

    https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

    My memory ain't what it used to be........

    Several quotes from the Ditzel paper::

    By a judicious choice of the proper instruction set and the design of a corresponding
    architecture, we feel that it should be possible to have a very simple instruction set
    that can be very fast. This may lead to a substantial net gain in overall program
    execution speed. This is the concept of the Reduced Instruction Set Computer.

    Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
    the results to propose a better machine, and then repeating the cycle over a dozen times.
    Though the initial intent was not specifically to come up with a simple design, the result
    was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
    VAX [Johnson79].

    Yup. If anyone can find that Johnson tech report I'd like to read it.
    Some googlage only found references to it.

    It seems to me there were two threads to the RISC work. IBM designed
    the hardware and compiler together, a more sophisticated version of
    what Johnson did, so they were constantly trading off what they could
    do in hardware and what they could do in software, usually finding
    that software could do it better, e.g., splitting instruction and data
    caches because they knew their compiler's code never modified
    instructions.

    Berkeley used the old PCC compiler which wasn't terrible but did not
    do very sophisticated register allocation, so they invented hardware
    register windows. In retrospect, the 801 project was right and windows
    albeit clever were a bad idea. Better to use that chip area for a
    bigger cache.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Sun Dec 10 16:59:56 2023
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    John Levine <johnl@taugh.com> writes:

    It appears that David Brown <david.brown@hesbynett.no> said:

    Based solely on the information Scott gave, however, I would suggest
    that the "OPR" instruction - "microcoded operation" - and the "IOT"
    operation would mean it certainly was not RISC. (This is true even if
    it has other attributes commonly associated with RISC architectures.)

    The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
    so this is a purely hypothetical argument.

    The underlying ideas were looked at in the late 1970s, but I
    believe the name RISC goes back only to the early 1980s.

    Keep in mind that the PDP-8 was built from 1400 discrete transistors
    and 10,000 diodes. It had to be simple.

    The LGP-30, first delivered in 1956, had 113 tubes and 1450
    diodes. I think it's fair to say the LGP-30 has a good
    claim to being the world's first minicomputer.

    I'd suggest the ABC machine, but it was restricted to solving
    linear equations :-) I did have the last remaining
    component in my possession for a few months in 1981 (the
    memory drum).

    There's a modern replica at the CHM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Sun Dec 10 17:02:02 2023
    John Levine <johnl@taugh.com> writes:
    According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
    The PDP-5 was built in 1963 and the term RISC dates from the late 1970s
    so this is a purely hypothetical argument.

    The underlying ideas were looked at in the late 1970s, but I
    believe the name RISC goes back only to the early 1980s.

    I checked, the 801 project started in 1975. The RISC-I paper was
    published in 1981 and I think they came up with the name in 1980, so
    close enough.

    Keep in mind that the PDP-8 was built from 1400 discrete transistors
    and 10,000 diodes. It had to be simple.

    The LGP-30, first delivered in 1956, had 113 tubes and 1450
    diodes. I think it's fair to say the LGP-30 has a good
    claim to being the world's first minicomputer.

    I poked around one that was old and dead but just looking at it, you
    could see what a very elegant design it was to get useful work out of
    so little logic.

    The Bendix G-15 had 450 tubes and 3000 diodes so it's the other
    contender for the title. Both machines were introduced in 1956,
    cost about the same, and were about the same size, 800lb for the LGP-30, >956lb for the G-15.

    The electrodata Datatron system shipped in 1954. I used to work in
    the plant where it was designed and built (albeit decades later
    when the B4800 was the high-end machine).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to All on Sun Dec 10 17:56:00 2023
    In article <ul4pvc$2r22$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:

    Yup. If anyone can find that Johnson tech report I'd like to read
    it. Some googlage only found references to it.

    The Computer History Museum has hardcopy:

    <https://www.computerhistory.org/collections/catalog/102773566>

    The UK's Centre for Computing History also appears to have a copy: <https://www.computinghistory.org.uk/det/12205/Bell-Computing-Science-Tech nical-Report-80-A-32-Bit-Processor-Design/> Since they're local to me,
    I've asked them if they can make me a copy.

    If anyone else wants to hunt, the reference is in: <https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>

    The author was Stephen C Johnson, who worked at Bell Labs in the 1970s,
    largely on languages; he was responsible for YACC. <https://research.google.com/pubs/archive/94.pdf>

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sun Dec 10 17:56:00 2023
    In article <K4mdN.7691$83n7.220@fx18.iad>, scott@slp53.sl.home (Scott
    Lurndal) wrote:

    Actually the 'RISC purity' of the A64 Architecture was not
    likely to have ever been a consideration when choosing which
    features to add to the architecture. They're in the money
    making business, not some idealistic RISC business.

    Yup. A32 was never a pure RISC: they had understood the idea, but did not
    feel constrained by by it.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to John Levine on Sun Dec 10 18:27:40 2023
    John Levine wrote:

    According to MitchAlsup <mitchalsup@aol.com>:
    I actually means "Reduced Instruction Set Computers for VLSI" K.

    https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf

    My memory ain't what it used to be........

    Several quotes from the Ditzel paper::

    By a judicious choice of the proper instruction set and the design of a corresponding
    architecture, we feel that it should be possible to have a very simple instruction set
    that can be very fast. This may lead to a substantial net gain in overall program
    execution speed. This is the concept of the Reduced Instruction Set Computer. >>
    Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
    the results to propose a better machine, and then repeating the cycle over a dozen times.
    Though the initial intent was not specifically to come up with a simple design, the result
    was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
    VAX [Johnson79].

    Yup. If anyone can find that Johnson tech report I'd like to read it.
    Some googlage only found references to it.

    It seems to me there were two threads to the RISC work. IBM designed
    the hardware and compiler together,

    Something I keep pushing Quadrablock to do.

    a more sophisticated version of
    what Johnson did, so they were constantly trading off what they could
    do in hardware and what they could do in software, usually finding
    that software could do it better, e.g., splitting instruction and data
    caches because they knew their compiler's code never modified
    instructions.

    Apparently they had no notion of JiT compilation and JiT caches.

    Berkeley used the old PCC compiler which wasn't terrible but did not
    do very sophisticated register allocation, so they invented hardware
    register windows. In retrospect, the 801 project was right and windows
    albeit clever were a bad idea. Better to use that chip area for a
    bigger cache.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Sun Dec 10 19:50:10 2023
    jgd@cix.co.uk (John Dallman) writes:
    In article <ul4pvc$2r22$1@gal.iecc.com>, johnl@taugh.com (John Levine)
    wrote:

    Yup. If anyone can find that Johnson tech report I'd like to read
    it. Some googlage only found references to it.

    The Computer History Museum has hardcopy:

    <https://www.computerhistory.org/collections/catalog/102773566>

    The UK's Centre for Computing History also appears to have a copy: ><https://www.computinghistory.org.uk/det/12205/Bell-Computing-Science-Tech >nical-Report-80-A-32-Bit-Processor-Design/> Since they're local to me,
    I've asked them if they can make me a copy.

    If anyone else wants to hunt, the reference is in: ><https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>

    The author was Stephen C Johnson, who worked at Bell Labs in the 1970s, >largely on languages; he was responsible for YACC.

    He was also responsible for PCC, if I recall correctly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Dec 10 21:17:58 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    It seems to me there were two threads to the RISC work. IBM designed
    the hardware and compiler together, ...
    that software could do it better, e.g., splitting instruction and data
    caches because they knew their compiler's code never modified
    instructions.

    Apparently they had no notion of JiT compilation and JiT caches.

    They certainly knew about JIT code since IBM sort programs have been
    generating comparison code since the 1960s if not longer. Back in the
    olden days, particularly on machines without index registers, you
    pretty much had to write code where one instruction would modify
    another to do address or length calculations.

    By the 1970s nobody did that, programs were loaded and didn't change
    once they were loaded. If you want to do JIT, write out the JIT code,
    then poke the cache to invalidate the area where you put the JIT code.
    It's the same thing it did when loading a program in the first place.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to John Levine on Sun Dec 10 21:59:37 2023
    John Levine wrote:

    According to MitchAlsup <mitchalsup@aol.com>:
    It seems to me there were two threads to the RISC work. IBM designed
    the hardware and compiler together, ...
    that software could do it better, e.g., splitting instruction and data
    caches because they knew their compiler's code never modified
    instructions.

    Apparently they had no notion of JiT compilation and JiT caches.

    They certainly knew about JIT code since IBM sort programs have been generating comparison code since the 1960s if not longer. Back in the
    olden days, particularly on machines without index registers, you
    pretty much had to write code where one instruction would modify
    another to do address or length calculations.

    By the 1970s nobody did that, programs were loaded and didn't change
    once they were loaded. If you want to do JIT, write out the JIT code,
    then poke the cache to invalidate the area where you put the JIT code.
    It's the same thing it did when loading a program in the first place.


    My point was that the original statement was: they knew their compiler's
    code never modified instructions.

    Yet a JiT compiler HAS to modify instructions.

    I am not poking fun at 801 {for which I have great admiration.}
    I am poking fun at the inspecificity of the statement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Sun Dec 10 22:04:50 2023
    On Sat, 09 Dec 2023 17:27:02 +0000, Anton Ertl wrote:
    Quadibloc <quadibloc@servername.invalid> writes:

    However, Patterson also wrote a popular article about one of the early
    RISC processors with which he was connected for _Scientific American_,

    A short web search on that came up empty. Do you have a reference?

    I was thinking of

    D. A. Patterson, "Microprogramming," Scientific American, vol. 248, no. 3,
    pp. 36-43, March 1983.

    despite its unlikely title.

    there's CISC than which _anything_ else would be better (such as the >>world's most popular ISA, devised by a company that made MOS memories,
    and then branched out into making some of the world's first single-chip >>microprocessors)...

    AMD made MOS memories and some of the world's first single-chip microprocessors?

    No, but Intel did.

    I think of x86 as the architecture, and x86-64 as a feature, like MMX or AVX-512. Although the transition from the 8086 to the 80386 might well
    be considered as moving to a new architecture, x86-64 is, to me, just
    a feature added to the 386 architecture.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Dec 11 00:30:18 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    My point was that the original statement was: they knew their compiler's
    code never modified instructions.

    Yet a JiT compiler HAS to modify instructions.

    I am not poking fun at 801 {for which I have great admiration.}
    I am poking fun at the inspecificity of the statement.

    I was summarizing. You might want to read the paper.

    The S/360 architecture says that one instruction can modify the one
    that immediately follows it. This sort of thing was very common before
    there were index registers and still somewhat common in the 1960s. On
    S/360 the most likely example was a string move or compare, where the
    string length was in the instruction. It turned out that in practice
    nobody did that, if you wanted a variable length move, the EXecute
    instruction let you fake it by or'ing a register into the length byte
    in the executed instruction, but even now zSeries lets you store into
    the instruction stream so there is some little-used logic to detect
    that.

    The 801 people said that's silly, it's so rare to change instuctions
    that we'll require the program to explicitly invalidate the cache when
    they do.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Mon Dec 11 08:54:24 2023
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    mitchalsup@aol.com (MitchAlsup) writes:
    PDP-8 certainly is simple nor does it have many instructions, but it
    certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.

    What is that design philosophy supposed to be?

    The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
    claim that Nova led to RISC and PDP-11 to CISC, but when you look at
    Nova and its successors, including single-chip implementations, it's
    an accumulator architecture which consumed many cycles for each
    instruction, but it invested hardware in fast multiplication and
    division. The design seems to have further evolved in the direction
    of CISC, and we can read in "The Soul of a New Machine" about the
    headaches that the microarchitects of the Eclipse MV/8000 had dealing
    with virtual memory etc. in that architecture, and this architecture
    was replaced by the RISC 88000 architecture a decade later.

    Does anybody design architectures with properties like the PDP-8 or
    the Nova these days, or even in 1990? Not that I know of. By
    contrast, people are still designing architectures with some of the
    same properties that early RISCs have. Maybe the PDP-8 was designed
    with the same philosophy as the early RISCs, maybe not, but why would
    we call that philosophy RISC, rather than identifying the common
    properties of architecture that were called RISCs by Patterson, the
    inventor of the term, and calling that RISC?

    Moreover, the major "philosophy" behind the PDP-8 is probably to make
    it as cheap as possible. For 20 years we have had the b16-small
    architecture that is cheaper to produce than any RISC; it takes
    0.16mm^2 in the XC035 process (a 350nm process <https://www.eetimes.com/x-fab-expands-mixed-signal-foundry-portfolio-with-0-35-micron-process/>),
    while an 8051 in the same process takes 1mm^2. According to ARM the
    Cortex-M0 takes 0.11mm^s in 180ULL (presumably a 180nm process), so
    probably around 0.44mm^2 in a 350nm process.

    Does that make the b16-small a RISC and ARM a non-RISC? Not as far as
    I am concerned. It does not share enough characteristics with the architectures that have been called RISC.

    One interesting aspect is that b16-small can run at ~150MHz (and
    b16-dsp at ~200MHz) when manufactured in that process and connected to
    a memory subsystem that actually supports this speed, without being
    pipelined. Compare this to other CPUs manufactured in 350nm
    processes: the 5-stage P54CS (Pentium), which have been sold at up to
    200MHz, and the 10-stage Klamath (Pentium II), which runs at up to
    300MHz, and the EV56 (Alpha 21164a), which runs at up to 666MHz.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to John Levine on Mon Dec 11 18:37:40 2023
    On Mon, 11 Dec 2023 00:30:18 +0000, John Levine wrote:

    The S/360 architecture says that one instruction can modify the one that immediately follows it. This sort of thing was very common before there
    were index registers and still somewhat common in the 1960s.

    The lower-end models of the System/360, of course, didn't have any microarchitectural features that would make this a problem. For that
    matter, the original top end of the series, the Model 75, didn't
    either.

    So, since it was so common back then for computers to allow this -
    even if they had registers, and so didn't need it - it's entirely
    possible that the architecture just had this property by default,
    and its possible usefulness with string instructions was not
    the actual reason.

    People might have just felt this was the normal, expected behavior
    of a computer that wasn't staggeringly difficult to understand
    because it had bizarre restrictions due to being designed for
    some niche application, or high speed beyond what the technology
    was really ready for.

    Today, now that pipelining is standard, the rules have changed.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Dec 11 19:31:36 2023
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    What is that design philosophy supposed to be?

    We don't have to guess, because DEC published an entire book about the
    way they built their computers. The PDP-5 was originally a front end
    system for a nuclear reactor in Canada, 12 bits both because the
    analog values it was handling needed that much precision, and also
    because they used ideas from the LINC lab computer. The PDP-5 is
    recognizable as a cut down PDP-4 which was in turn a cheaper redesign
    of the PDP-1 which was largely based on the MIT TX-0 computer built to
    test core memories in the 1950s. They all had word addresses and a
    single register, not surprising since that's what all scientific
    computers of the era had.

    The PDP-8 reimplemented the PDP-5 using newer components and packaging
    so was a lot smaller and cheaper. The book says it was important that
    it was the first computer small enough to sit on a lab bench, or in a
    rack leaving room for other equipment.

    The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
    claim that Nova led to RISC and PDP-11 to CISC, ...

    The PDP-11 certainly led to VAX, and the Berkeley RISC was explicitly
    a response to the largely unused complexity of the VAX, much as the
    801 was to S/360.

    when you look at
    Nova and its successors, including single-chip implementations, it's
    an accumulator architecture which consumed many cycles for each
    instruction, but it invested hardware in fast multiplication and
    division. The design seems to have further evolved in the direction
    of CISC, and we can read in "The Soul of a New Machine" about the
    headaches that the microarchitects of the Eclipse MV/8000 had dealing
    with virtual memory etc. in that architecture, and this architecture
    was replaced by the RISC 88000 architecture a decade later.

    Right. The Nova made excellent use of the hardware available at the
    time it was designed, e.g., some of the instruction bits went straight
    into the new ALU chips it used. But as usually happens, some of those
    decisions caused a great deal of pain later, particularly when they
    made the decision to shoehorn the Eclipse into the holes in the Nova's instruction set rather than a separate mode for larger addresses as on
    the 386 and zSeries and ARM.

    Moreover, the major "philosophy" behind the PDP-8 is probably to make
    it as cheap as possible.

    Actually, it was to build the best computer they could for the target
    price, although those often end up around the same place.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to MitchAlsup on Mon Dec 11 12:23:51 2023
    On 12/9/2023 10:11 AM, MitchAlsup wrote:
    Anton Ertl wrote:



    I don't think trying to redefine "RISC" to mean something different
    from its original purpose helps.

    That would mean that "RISC" has an original definition; what is it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    RISC was defined before CISC was coined as its contrapositive.

    I agree, but this illustrates some of the semantic confusion we are
    having regarding defining RISC. In normal speech, the opposite of
    "reduced" is "increased", and the opposite of "complex" is "simple". So
    RISC and CISC are not really on the opposite end of a scale, but are on different scales!

    If we substitute "simple" for "reduced", a lot of nice things sort of
    fall out. Some examples

    Requiring a single instruction length simplifies decoding, as does no "dependent" code where you can't decode a later part of the instruction
    till you decode something in an earlier part.

    Requiring all instructions be single cycle simplifies the pipeline
    design. I think this applies to no mem-op instructions

    Requiring no more than one memory reference (and relatedly prohibiting non-aligned memory accesses) simplifies some internal agen stuff.

    etc.

    Of course, as time went on, both the number of gates on a chip and our understanding of how to do things more simply increased. So we were
    able to add more "complexity" to the design while keeping with the
    "spirit" of simplicity. So we got multi-cycle instructions in the CPU,
    not a co-processor, and non-aligned memory accesses, etc.

    Under this view, the number of instructions is not the key defining
    factor, but sort of a side effect of making the design "simple".

    So if they had used "simple" instead of "reduced" a lot of confusion
    would have been prevented. :-)



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stephen Fuld on Tue Dec 12 00:38:53 2023
    On Mon, 11 Dec 2023 12:23:51 -0800
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 12/9/2023 10:11 AM, MitchAlsup wrote:
    Anton Ertl wrote:



    I don't think trying to redefine "RISC" to mean something
    different from its original purpose helps.

    That would mean that "RISC" has an original definition; what is
    it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    RISC was defined before CISC was coined as its contrapositive.

    I agree, but this illustrates some of the semantic confusion we are
    having regarding defining RISC. In normal speech, the opposite of
    "reduced" is "increased",

    In context of Reduced Instruction Set the opposit of "reduced" is
    "full" or "complete". IMHO.

    and the opposite of "complex" is "simple".
    So RISC and CISC are not really on the opposite end of a scale, but
    are on different scales!

    If we substitute "simple" for "reduced", a lot of nice things sort of
    fall out. Some examples

    Requiring a single instruction length simplifies decoding, as does no "dependent" code where you can't decode a later part of the
    instruction till you decode something in an earlier part.

    Requiring all instructions be single cycle simplifies the pipeline
    design. I think this applies to no mem-op instructions

    Requiring no more than one memory reference (and relatedly
    prohibiting non-aligned memory accesses) simplifies some internal
    agen stuff.

    etc.

    Of course, as time went on, both the number of gates on a chip and
    our understanding of how to do things more simply increased. So we
    were able to add more "complexity" to the design while keeping with
    the "spirit" of simplicity. So we got multi-cycle instructions in
    the CPU, not a co-processor, and non-aligned memory accesses, etc.

    Under this view, the number of instructions is not the key defining
    factor, but sort of a side effect of making the design "simple".

    So if they had used "simple" instead of "reduced" a lot of confusion
    would have been prevented. :-)




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Michael S on Mon Dec 11 16:12:24 2023
    On 12/11/2023 2:38 PM, Michael S wrote:
    On Mon, 11 Dec 2023 12:23:51 -0800
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 12/9/2023 10:11 AM, MitchAlsup wrote:
    Anton Ertl wrote:



    I don't think trying to redefine "RISC" to mean something
    different from its original purpose helps.

    That would mean that "RISC" has an original definition; what is
    it?

    See the classic "Case for the Reduced Instruction Set Computer"
    Katevinis.

    RISC was defined before CISC was coined as its contrapositive.

    I agree, but this illustrates some of the semantic confusion we are
    having regarding defining RISC. In normal speech, the opposite of
    "reduced" is "increased",

    In context of Reduced Instruction Set the opposit of "reduced" is
    "full" or "complete". IMHO.

    OK, but what does "full" or "complete" mean? There are always instructions/functionality that could be added, so in that sense, no instruction set is full or complete.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Tue Dec 12 06:57:48 2023
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/11/2023 2:38 PM, Michael S wrote:
    In context of Reduced Instruction Set the opposit of "reduced" is
    "full" or "complete". IMHO.

    OK, but what does "full" or "complete" mean?

    Looking at the genesis of the RISCs, full means the S/360 and S/370
    instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    There are always
    instructions/functionality that could be added, so in that sense, no >instruction set is full or complete.

    If you start with a certain instruction set, and then leave things
    away, the result is reduced, while the starting point is complete.

    But of course RISC became a standalone term, and the Acorn RISC
    Machine was not designed as a reduced version of some instruction set,
    and looking at the 32 registers and register windows of Berkeley RISC,
    it was not a reduced VAX, either.

    So maybe SISC (simple-instruction SC) might have been more accurate,
    but RISC was probably a more marketable acronym.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Tue Dec 12 11:00:56 2023
    John Levine wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Looking at the genesis of the RISCs, full means the S/360 and S/370
    instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    This web page suggests it was more from the other direction, they started from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which had
    been a long running design philosophy in the hardware industry.

    https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond "not a Vax or S/360".

    R's,
    John

    The first paper on MIPS is

    MIPS: A VLSI Processor Architecture
    J Hennessy, N Jouppi, F Baskett, J Gill, 1981 http://ai.eecs.umich.edu/people/conway/VLSI/ClassicDesigns/MIPS/MIPS.CMU81.pdf

    says at the start that "The basic philosophy of MIPS is to present an instruction set that is a compile-driven encoding of the microengine.
    Thus little or no decoding is needed and the instructions correspond
    closely to microcode instructions".

    The original Stanford MIPS had only 16 32-bit registers and had
    no byte or word instructions. The first commercial version, the R2000,
    had 32 registers and added byte and word instructions because
    of all the software porting problems they encountered.

    An interesting thing for both MIPS and RISC-I was the amount of design
    time they took. If you subtract out all the time they spent developing
    their own software tools, it looks like the chip design would have been
    about just 6 months for 2 persons. Compared to the many hundreds of person-years for a VAX, that got a lot of peoples attention.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Dec 12 15:33:03 2023
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Looking at the genesis of the RISCs, full means the S/360 and S/370 >instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    This web page suggests it was more from the other direction, they started
    from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which had
    been a long running design philosophy in the hardware industry.

    https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond "not a Vax or S/360".

    R's,
    John
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Dec 12 11:20:59 2023
    In case anyone is interested there are some
    DEC internal company confidential memos that
    show they were thinking about this internally too.

    VOR: VAX on a RISC, 1984
    http://bwlampson.site/35a-VOR/35a-VORAbstract.html http://bwlampson.site/35a-VOR/35a-VOR.pdf

    Ideas for a simple fast VAX, 1985 http://bwlampson.site/35b-IdeasFastVAX/35b-IdeasFastVAXAbstract.html http://bwlampson.site/35b-IdeasFastVAX/35b-IdeasFastVAX.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Tue Dec 12 16:51:52 2023
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    What is that design philosophy supposed to be?

    We don't have to guess, because DEC published an entire book about the
    way they built their computers. The PDP-5 was originally a front end
    system for a nuclear reactor in Canada, 12 bits both because the
    analog values it was handling needed that much precision, and also
    because they used ideas from the LINC lab computer. The PDP-5 is
    recognizable as a cut down PDP-4 which was in turn a cheaper redesign
    of the PDP-1 which was largely based on the MIT TX-0 computer built to
    test core memories in the 1950s. They all had word addresses and a
    single register, not surprising since that's what all scientific
    computers of the era had.

    The PDP-8 reimplemented the PDP-5 using newer components and packaging
    so was a lot smaller and cheaper. The book says it was important that
    it was the first computer small enough to sit on a lab bench, or in a
    rack leaving room for other equipment.

    The PDP-8 inspired the PDP-X and eventually the Nova. I have seen a
    claim that Nova led to RISC and PDP-11 to CISC, ...

    So we have a long line of ancestry:

    TX-0, PDP-4, PDP-5, PDP-8, PDP-X, Nova, Eclipse (16-bit), Eclipse
    MV/8000 (32-bit)

    So no, Nova did not lead to RISC, it led to the Eclipse MV/8000, a
    CISC. And if the PDP-8 had the same design philosophy as RISCs, Ed de
    Castro (chief engineer of the PDP-8) and all the people that went with
    him and founded Data General forgot about it when they were working at
    Data General.

    Moreover, the major "philosophy" behind the PDP-8 is probably to make
    it as cheap as possible.

    Actually, it was to build the best computer they could for the target
    price, although those often end up around the same place.

    Either one (1. Satisfy the requirements at the lowest cost; 2. Build
    the best thing for a given price point) are general engineering
    principles. The VAX architects were certainly convinced that they
    design the best architecture for the target price, too. And VAX is
    obviously not a RISC, so there is more to RISC than that philosophy.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Tue Dec 12 20:48:32 2023
    Anton Ertl wrote:


    Either one (1. Satisfy the requirements at the lowest cost; 2. Build
    the best thing for a given price point) are general engineering
    principles. The VAX architects were certainly convinced that they
    design the best architecture for the target price, too. And VAX is
    obviously not a RISC, so there is more to RISC than that philosophy.

    In the days that store was more expensive than gates, VAX makes a lot
    of sense--this era corresponds to the 10-cycles per instruction going
    down towards 4-cycles per instruction. This era could not be extended
    into the 1-instruction per cycle reals with a VAX-like ISA.


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Dec 12 21:12:15 2023
    According to MitchAlsup <mitchalsup@aol.com>:
    Anton Ertl wrote:

    Either one (1. Satisfy the requirements at the lowest cost; 2. Build
    the best thing for a given price point) are general engineering
    principles. The VAX architects were certainly convinced that they
    design the best architecture for the target price, too. And VAX is
    obviously not a RISC, so there is more to RISC than that philosophy.

    In the days that store was more expensive than gates, VAX makes a lot
    of sense--this era corresponds to the 10-cycles per instruction going
    down towards 4-cycles per instruction. This era could not be extended
    into the 1-instruction per cycle reals with a VAX-like ISA.

    It also made sense in the era when microcode ROM was faster than main
    memory RAM. Unfortunately, by the time the Vax came out, that era was
    over.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to paaronclayton@gmail.com on Thu Dec 14 01:52:30 2023
    On Tue, 12 Dec 2023 18:27:50 -0500, "Paul A. Clayton"
    <paaronclayton@gmail.com> wrote:

    On 12/10/23 11:45?AM, John Levine wrote:
    According to MitchAlsup <mitchalsup@aol.com>:
    :
    Johnson used an iterative technique of proposing a machine, writing a compiler, measuring
    the results to propose a better machine, and then repeating the cycle over a dozen times.
    Though the initial intent was not specifically to come up with a simple design, the result
    was a RISC-like 32-bit architecture whose code density was as compact as the PDP-11 and
    VAX [Johnson79].

    Yup. If anyone can find that Johnson tech report I'd like to read it.
    Some googlage only found references to it.

    It looks like one can get a PDF from Semantic Scholar: >https://www.semanticscholar.org/paper/A-32-bit-processor-design-Johnson/5ef2b3e8a755a2c29833eba8ab61117c296d95ac

    I have a PDF on my computer that I can email to anyone interested >(paaronclayton is my gmail address).

    Seems the server that hosted that paper is no longer operating.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Jan 1 12:15:28 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    mitchalsup@aol.com (MitchAlsup) writes:

    PDP-8 certainly is simple nor does it have many instructions, but it
    certainly is NOT RISC.

    Did not have a large GPR register file
    Was Not pipelined
    Was Not single cycle execution
    Did not overlap instruction fetch with execution
    Did not rely on compiler for good code performance

    Of course the PDP-8 is a RISC. These propperties may have been
    common among some RISC processors, but they don't define what
    RISC is. RISC is a design philosophy, not any particular set
    of architectural features.

    What is that design philosophy supposed to be?

    Mitch quoted it in an earlier posting, and may be summarized as
    "simple instructions, simple architecture." The PDP-8 is
    consistent with that description, which is all that matters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to John Levine on Mon Jan 1 12:20:23 2024
    John Levine <johnl@taugh.com> writes:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    Looking at the genesis of the RISCs, full means the S/360 and S/370
    instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    This web page suggests it was more from the other direction, they
    started from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which
    had been a long running design philosophy in the hardware
    industry.

    https://cs.stanford.edu/people/eroberts/
    courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond
    "not a Vax or S/360".

    Surely people don't view the Itanium as being a RISC. And what
    about the Mill? Is that a RISC or not?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Jan 2 01:05:38 2024
    According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
    The PDP-8 is just a very small computer, with a very small instruction
    set, designed before the RISC design philosophy was even concieved of.

    That it was designed before is irrelevant. All that matters is
    that the end result is consistent with that philosophy.

    I dunno, indirect addressing and these those auto-index locations 10
    to 17 don't seem so RISCful. Nor does having only one register you
    have to use for everything.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Tue Jan 2 10:42:32 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Surely people don't view the Itanium as being a RISC.

    What makes you think so?

    It has a lot of RISC characteristics, in particular, it's a load/store architecture with a large number of general-purpose registers (and for
    the other registers, there are also many of them, avoiding the
    register allocation problems that compilers tend to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    And what
    about the Mill?

    The Mill is not even a paper design (I have yet to see a paper about
    it), so how would I know?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to MitchAlsup on Tue Jan 2 16:13:21 2024
    On Mon, 1 Jan 2024 20:28:29 +0000
    mitchalsup@aol.com (MitchAlsup) wrote:

    Tim Rentsch wrote:

    John Levine <johnl@taugh.com> writes:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    Looking at the genesis of the RISCs, full means the S/360 and
    S/370 instruction sets for the 801 project, and VAX for the
    Berkeley RISC project. Not sure what full means for Stanford
    MIPS.

    This web page suggests it was more from the other direction, they
    started from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which
    had been a long running design philosophy in the hardware
    industry.

    https://cs.stanford.edu/people/eroberts/
    courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond
    "not a Vax or S/360".

    Surely people don't view the Itanium as being a RISC. And what
    about the Mill? Is that a RISC or not?

    Itanium is VLIW


    No, it isn't.
    Itanium is RISC with few VLIW-inspired additions.
    Semantics are fully defined on the level of individual verbs rather
    than at the level of bundles or groups.

    Mill is Belted
    Both are dependent on compiler to perform code scheduling.

    In Itanium you can add ';' between verbs and for "defined" program
    result would be the same as without. Scheduling is needed for
    performance, but not for correction, the same as any wide in-order RISC.
    It is [Berkeley-style] RISC with funny instruction encoding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Anton Ertl on Tue Jan 2 14:19:25 2024
    On Tue, 02 Jan 2024 10:42:32 +0000, Anton Ertl wrote:
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Surely people don't view the Itanium as being a RISC.

    What makes you think so?

    It has a lot of RISC characteristics, in particular, it's a load/store architecture with a large number of general-purpose registers (and for
    the other registers, there are also many of them, avoiding the register allocation problems that compilers tend to have with unique registers).
    In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    The TMS320C6000 has individual instructions that are indeed very similar
    to those of a RISC machine. But because they're grouped in blocks of
    eight instructions, with a bit in each instruction to indicate whether
    or not a given instruction can execute in parallel with those that precede
    it, it is classed as a VLIW architecture.

    Intel didn't use the term VLIW in referring to the Itanium. I guess they
    didn't think that 128 bits (unlike 256 bits) was "very" long.

    But that's basically what the Itanium was, even if it shared a lot of characteristics with RISC. Three instructions were grouped into a 128-bit block; possible parallelism between them was indicated explicitly, and
    each of the three instructions even had a different format from the two
    others.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Quadibloc on Tue Jan 2 16:43:45 2024
    On Tue, 2 Jan 2024 14:19:25 -0000 (UTC)
    Quadibloc <quadibloc@servername.invalid> wrote:

    On Tue, 02 Jan 2024 10:42:32 +0000, Anton Ertl wrote:
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Surely people don't view the Itanium as being a RISC.

    What makes you think so?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers tend
    to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    The TMS320C6000 has individual instructions that are indeed very
    similar to those of a RISC machine. But because they're grouped in
    blocks of eight instructions, with a bit in each instruction to
    indicate whether or not a given instruction can execute in parallel
    with those that precede it, it is classed as a VLIW architecture.

    Intel didn't use the term VLIW in referring to the Itanium. I guess
    they didn't think that 128 bits (unlike 256 bits) was "very" long.

    But that's basically what the Itanium was, even if it shared a lot of characteristics with RISC. Three instructions were grouped into a
    128-bit block; possible parallelism between them was indicated
    explicitly, and each of the three instructions even had a different
    format from the two others.

    John Savard

    On TI C6000 (or on ADI TigerSharc, another VLIW that people actually
    used to do real work and not just to write papers about and to con VCs
    into investments) the pipeline is exposed to programmer, i.e. visible
    through results of execution and not just through timing of execution.
    In Itanium, at least for legal well-formed programs, pipeline is not
    exposed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Tim Rentsch on Tue Jan 2 16:59:53 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    John Levine <johnl@taugh.com> writes:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    Looking at the genesis of the RISCs, full means the S/360 and S/370
    instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    This web page suggests it was more from the other direction, they
    started from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which
    had been a long running design philosophy in the hardware
    industry.

    https://cs.stanford.edu/people/eroberts/
    courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond
    "not a Vax or S/360".

    Surely people don't view the Itanium as being a RISC.

    It was an Epic Risk.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Scott Lurndal on Tue Jan 2 11:29:13 2024
    scott@slp53.sl.home (Scott Lurndal) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    John Levine <johnl@taugh.com> writes:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    Looking at the genesis of the RISCs, full means the S/360 and S/370
    instruction sets for the 801 project, and VAX for the Berkeley RISC
    project. Not sure what full means for Stanford MIPS.

    This web page suggests it was more from the other direction, they
    started from the compiler:

    The Stanford research group had a strong background in compilers,
    which led them to develop a processor whose architecture would
    represent the lowering of the compiler to the hardware level, as
    opposed to the raising of hardware to the software level, which
    had been a long running design philosophy in the hardware
    industry.

    https://cs.stanford.edu/people/eroberts/
    courses/soco/projects/risc/mips/index.html

    I agree that these days RISC doesn't really meen anything beyond
    "not a Vax or S/360".

    Surely people don't view the Itanium as being a RISC.

    It was an Epic Risk.

    Okay that made me laugh. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Anton Ertl on Tue Jan 2 19:55:51 2024
    On Tue, 02 Jan 2024 10:42:32 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    The Mill is not even a paper design (I have yet to see a paper about
    it), so how would I know?

    - anton

    List of patents:
    https://millcomputing.com/patents/


    The Mill has a fixed block level design, but details of the blocks are
    variable and/or customizable and may be different across instances.
    They don't really have model lines as we normally think of them -
    instead they have instances which can be one-offs or reproduced in
    bulk, as desired.

    They have settled on three demonstration instances - which they call
    "Gold", "Silver", and "Bronze" - whose purpose is to show how
    performance varies with the details.

    Ivan has said that inside information might be had with an NDA. If
    you really are interested you could ask them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Wed Jan 3 07:20:23 2024
    George Neuner <gneuner2@comcast.net> writes:
    On Tue, 02 Jan 2024 10:42:32 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    The Mill is not even a paper design (I have yet to see a paper about
    it), so how would I know?

    - anton

    List of patents:
    https://millcomputing.com/patents/

    Patents are not written for comprehension, and this page acknoledges
    that with sentences like

    |Split-stream encoding is described here in a way that is more
    |accessible than the patent text.

    This particular link actually points to a white paper, the few other
    such links only point to videos.

    What I had in mind is an overview paper of the architecture, or maybe
    an architecture manual (if it is a short one like the RISC-V one, not
    a reference-only monstrosity like the ones produced by Intel and ARM).

    Ivan has said that inside information might be had with an NDA. If
    you really are interested you could ask them.

    Not that interested, but when somebody asks "What about the Mill?",
    the answer I give is what you cited above.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to John Levine on Thu Jan 4 04:01:56 2024
    John Levine <johnl@taugh.com> writes:

    According to Tim Rentsch <tr.17687@z991.linuxsc.com>:

    The PDP-8 is just a very small computer, with a very small instruction
    set, designed before the RISC design philosophy was even concieved of.

    That it was designed before is irrelevant. All that matters is
    that the end result is consistent with that philosophy.

    I dunno, indirect addressing and these those auto-index locations 10
    to 17 don't seem so RISCful. Nor does having only one register you
    have to use for everything.

    You bring up some relevant points. Let me try to explain my
    perspective.

    First, I don't think having only one accumulator automatically
    disqualifies an architecture from being a RISC. Even though
    the architectures originally designed under the name "RISC"
    tended to share certain properties, the RISC concept was never
    meant to be tied to a specific technology; it's just that the
    technology of the time tended to favor certain properties, and
    people tended to glom on to those properties as defining RISC.
    A RISC of today would never have been designed in the 1980s.
    Similarly a RISC of the 1980s would not have been designed in
    the 1960s, when the PDP-8 was. Having X number of registers
    is RISC dogma, not RISC principle.

    Furthermore I view the PDP-8 not as having one register but as
    having 128 + 1 registers, with only one of those registers being
    fully capable of arithmetic. The non-accumulator registers offer
    limited arithmetic (increment and conditional skip) and a limited
    form of indexing (which if I am not mistaken was not available
    via the accumulator). The question of indexing brings us to
    indirect addressing.

    The PDP-8 does not have register-based indexing. Instead a
    rudimentary indexing capability is available by using indirect
    addressing: compute an address, store it in one of the page-0
    registers, and indirect through the address so formed. (In case
    this wasn't clear, page-0 memory may be thought of as "registers"
    partly because they are available from any instruction no matter
    where it is located.) Providing indirect addressing rather than
    general indexing simplifies and lightens the architecture.

    Indirect addressing provides another capability unrelated to
    indexing. Addresses in PDP-8 instructions don't allow access to
    the entire memory space. To be able to access any word in
    memory we need more bits than a single instruction has. Rather
    than a complicated scheme for two-word instructions, indirect
    addressing can be used to put those extra bits "somewhere else",
    conveniently and strategically located on the same page as the
    instruction. That makes for a simpler way of providing full
    memory access while maintaining a single fixed-length (and short)
    instruction format. Furthermore using indirect addressing to
    provide full memory access means that those address can be reused
    by several instructions on the same page, without having to pay
    full price for additional uses.

    On the one remaining item - auto-increment for locations 10 to 17
    when used via indirect addressing, do I have that right? - I admit
    it is something of an architectual oddity. On the other hand it
    looks like it's pretty cheap to implement, and pragmatically very
    useful. That brings us to a key aspect of my comments. Most of
    my understanding of RISC comes almost entirely from reading the
    early writing that came out of the RISC group (which I did about
    the same time it came out, so 40 years ago give or take). My
    main takeaway was this: put in only those functionalities that
    carry their weight. Or, put more simply, No frills. (A good
    example of a frill is the evaluate polynomial instruction on
    the VAX.) This view explains why I would call the PDP-8 a
    RISC. The items you mention all carry their weight; none of
    them is a frill. Conversely, adding one or more additional
    accumulators, or providing more general indexing, would have
    added substantially to the architectural cost (and of course also
    the monetary cost). So the decision to provide only a single
    accumulator - in conjunction with other parts of the architectue -
    is a good fit to what I see as the essence of RISC, as explained
    by the people who first described and defined the term (not the
    concept necessarily, but the name RISC to refer to it).

    As always, my aim is just to explain my point of view, not to
    convince anyone. I don't mind if people are persuaded, but
    my intention here has not been to persuade, just to explain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Thu Jan 4 04:18:50 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    Surely people don't view the Itanium as being a RISC.

    What makes you think so?

    Perhaps I should have written that with a question mark:
    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Thu Jan 4 13:04:13 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote
    <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.

    And I don't think so, either. That's because some of the features
    that the IBM 801 left away are non-RISC features, such as the EDIT
    instruction. OTOH, none of the special features of IA-64 are
    particularly non-RISCy, and indeed, it's implementations implemented
    the instructions without microcode (AFAIK) and typically with a
    single-cycle issue rate per functional unit. What makes you think it
    is not a RISC?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to BGB on Fri Jan 5 18:25:00 2024
    In article <un9alb$74e8$1@dont-email.me>, cr88192@gmail.com (BGB) wrote:

    IA64 had 3 instructions per 128-bit block, with some bits
    indicating how to process the other instructions. Typically, the
    instructions in the block would execute in parallel rather than
    serial (so, would take a big hit in code density if the code lacked sufficient ILP, as many of these spots would hold NOPs).

    Another code density hit was the common use of two instructions for
    functions where more conventional ISAs would use one. Advance load and
    check load, square root in steps rather than single instructions, things
    like that.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Sat Jan 6 09:30:30 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote
    <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.

    And I don't think so, either. That's because some of the features
    that the IBM 801 left away are non-RISC features, such as the EDIT instruction. OTOH, none of the special features of IA-64 are
    particularly non-RISCy, [...]

    Begging the question.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Sat Jan 6 18:01:10 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote
    <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.

    And I don't think so, either. That's because some of the features
    that the IBM 801 left away are non-RISC features, such as the EDIT
    instruction. OTOH, none of the special features of IA-64 are
    particularly non-RISCy, [...]

    Begging the question.

    Ok, let's make this more precise:

    |And I don't think so, either. That's because among the features that
    |the IBM 801 left away are instructions that access memory, but are not
    |just a load or a store, such as the EDIT instruction. OTOH, none of
    |the special features of IA-64 add instructions that access memory, but
    |are not just a load or a store, [...]

    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sat Jan 6 14:13:45 2024
    Anton Ertl wrote:
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote
    <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:
    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.
    And I don't think so, either. That's because some of the features
    that the IBM 801 left away are non-RISC features, such as the EDIT
    instruction. OTOH, none of the special features of IA-64 are
    particularly non-RISCy, [...]
    Begging the question.

    Ok, let's make this more precise:

    |And I don't think so, either. That's because among the features that
    |the IBM 801 left away are instructions that access memory, but are not
    |just a load or a store, such as the EDIT instruction. OTOH, none of
    |the special features of IA-64 add instructions that access memory, but
    |are not just a load or a store, [...]

    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    - anton

    I recall reading that the original HPPA left off MUL as it would take
    multiple clocks, violating their principle of 1 clock per instruction
    (and multi-cycle floating point was handled by a coprocessor).

    Hewlett-Packard Precision Architecture, Aug 1986 https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf

    "Single-Cycle Execution
    A primary design goal was that all functional computations in the basic instruction set could execute in one machine cycle in a pipelined implementation of the processor architecture.
    ...
    Complex operations that are necessary to support required software
    functions but cannot be implemented in a single execution cycle are
    broken down into primitive op erations, each of which can be executed
    in a single cycle."

    They believed that since most MULs were by smallish values that
    shift-add would be just as good. They found out they were wrong and
    added MUL back in.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Jan 7 17:29:55 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    - anton

    I recall reading that the original HPPA left off MUL as it would take >multiple clocks, violating their principle of 1 clock per instruction
    (and multi-cycle floating point was handled by a coprocessor).

    Hewlett-Packard Precision Architecture, Aug 1986 >https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
    ...
    They believed that since most MULs were by smallish values that
    shift-add would be just as good. They found out they were wrong and
    added MUL back in.

    On page 18 I find what I meant:

    |The architected assist instruction extensions include integer
    |multiply and divide functions for applications requiring higher
    |frequencies of multiplication and division.

    IIRC early HPPA implementations implemented these instructions by
    transfering the integer data to the FPU, using the multiplier or
    divider there, and transferring the result back to an integer
    register. At least I remember reading one paper that described it
    this way.

    It seems to me that the integer instruction set was first developed
    without considering the existence of an FPU, and then once they
    considered the FPU, they added the assist instruction extensions
    mentioned above. I wonder if there were any HPPA implementations that
    did not have these assist instruction extensions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Jan 7 13:14:54 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    - anton
    I recall reading that the original HPPA left off MUL as it would take
    multiple clocks, violating their principle of 1 clock per instruction
    (and multi-cycle floating point was handled by a coprocessor).

    Hewlett-Packard Precision Architecture, Aug 1986
    https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
    ....
    They believed that since most MULs were by smallish values that
    shift-add would be just as good. They found out they were wrong and
    added MUL back in.

    On page 18 I find what I meant:

    |The architected assist instruction extensions include integer
    |multiply and divide functions for applications requiring higher
    |frequencies of multiplication and division.

    IIRC early HPPA implementations implemented these instructions by
    transfering the integer data to the FPU, using the multiplier or
    divider there, and transferring the result back to an integer
    register. At least I remember reading one paper that described it
    this way.

    It seems to me that the integer instruction set was first developed
    without considering the existence of an FPU, and then once they
    considered the FPU, they added the assist instruction extensions
    mentioned above. I wonder if there were any HPPA implementations that
    did not have these assist instruction extensions.

    - anton

    It seems the first HW version had the assist instruction.
    The cover story is for the first implementation of HP-PA.
    The processor is used in both the HP 9000 Model 840 and HP 3000 Series 930.

    Hardware Design of the First HP Precision Architecture Computers, Mar-1987 https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1987-03.pdf

    "The HP 9000 Model 840 and the HP 3000 Series 930 are both based on the
    same processor, memory system, and I/O system. The processor consists of
    five printed circuit boards, each 8.4 by 11.3 inches, containing
    off-the-shelf TTL logic. It uses FAST TTL, 25-ns and 35-ns static RAMs,
    and 25-ns and 35-ns PALs. These five boards in clude the processor pipeline, which fetches and executes an instruction every 125 ns, a 4096-entry translation lookaside buffer (TLB) for high-speed address translation,
    and 128K bytes of cache memory. An additional (sixth) board contains the hardware floating-point coprocessor. Each board contains about 150 ICs."
    ...
    Execution Unit
    The E-unit (execution unit) performs arithmetic calculations on the
    operands. It executes the arithmetic instructions and creates the
    addresses for load and store instruc tions. It contains a 32-bit ALU (arithmetic logic unit) for arithmetic and logical calculations,
    a barrel shifter for shift instructions, and complex mask/merge circuitry
    for extracting and depositing bit strings. It also contains a preshifter
    on one input to the ALU. This is used in address calculations and for
    special instructions used in software multiply routines
    (the Model 840/Series 930 does not execute multiply instructions
    directly in hardware.)"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jan 8 11:49:29 2024
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    - anton

    I recall reading that the original HPPA left off MUL as it would take
    multiple clocks, violating their principle of 1 clock per instruction
    (and multi-cycle floating point was handled by a coprocessor).

    Hewlett-Packard Precision Architecture, Aug 1986
    https://www.hpl.hp.com/hpjournal/pdfs/IssuePDFs/1986-08.pdf
    ...
    They believed that since most MULs were by smallish values that
    shift-add would be just as good. They found out they were wrong and
    added MUL back in.

    On page 18 I find what I meant:

    |The architected assist instruction extensions include integer
    |multiply and divide functions for applications requiring higher
    |frequencies of multiplication and division.

    IIRC early HPPA implementations implemented these instructions by
    transfering the integer data to the FPU, using the multiplier or
    divider there, and transferring the result back to an integer
    register. At least I remember reading one paper that described it
    this way.

    It seems to me that the integer instruction set was first developed
    without considering the existence of an FPU, and then once they
    considered the FPU, they added the assist instruction extensions
    mentioned above. I wonder if there were any HPPA implementations that
    did not have these assist instruction extensions.

    - anton

    The Pentium (or the P6?) implemented MUL using the x87 multiplier, so
    with the added transport to and from the FPU part, it took about 10
    clock cycles. :-(

    DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock
    cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Tue Jan 9 06:36:42 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [Pentium]
    DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock >cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

    Maybe they would have noticed the bug in the FP divider much earlier
    if they had used the FP divider for integer division.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Tue Jan 9 10:45:24 2024
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [Pentium]
    DIV was (thankfully!) separate from FDIV, so it was slow (about 40 clock
    cycles for a 32-bit divisor running a bit/iteration sequencer) but correct.

    Maybe they would have noticed the bug in the FP divider much earlier
    if they had used the FP divider for integer division.


    That is of course possible, but maybe not very likely?

    The integer part had 32-bit registers, so an integer DIV would
    concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
    source (that had to be larger than the EDX part) and still trigger one
    of the 5 missing SRT table entries. When the bug was originally found,
    it only generated wrong results in the last 4-5 bits, so most of these
    would still have given the same 32-bit result.

    OTOH, Tim Coe's magic test pair only needed two 7-digit integer divisor/dividend values to trigger a 1/256 final error!

    The fact that Tim was able to come up with this test pair purely on
    paper, with no PC to check it on, is really impressive IMHO.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Tue Jan 9 16:27:47 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Maybe they would have noticed the bug in the FP divider much earlier
    if they had used the FP divider for integer division.


    That is of course possible, but maybe not very likely?

    The integer part had 32-bit registers, so an integer DIV would
    concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
    source (that had to be larger than the EDX part) and still trigger one
    of the 5 missing SRT table entries. When the bug was originally found,
    it only generated wrong results in the last 4-5 bits, so most of these
    would still have given the same 32-bit result.

    The reason why I consider it more likely is because programs tend to
    crash or give very wrong results if an integer computation is wrong,
    because integer computations are used in addressing and for directing
    program flow. By contrast, if an FP computation is wrong, very few
    people notice (and the late discovery of the Pentium FDIV bug shows
    this); Seymour Cray decided to make his machines FP-divide quickly
    rather than precisely, because he knew his customers, and they indeed
    bought his machines. I don't think he did so with integer division.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Wed Jan 10 09:02:50 2024
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Maybe they would have noticed the bug in the FP divider much earlier
    if they had used the FP divider for integer division.


    That is of course possible, but maybe not very likely?

    The integer part had 32-bit registers, so an integer DIV would
    concatenate EDX:EAX into a 64-bit source before dividing by a 32-bit
    source (that had to be larger than the EDX part) and still trigger one
    of the 5 missing SRT table entries. When the bug was originally found,
    it only generated wrong results in the last 4-5 bits, so most of these
    would still have given the same 32-bit result.

    The reason why I consider it more likely is because programs tend to
    crash or give very wrong results if an integer computation is wrong,
    because integer computations are used in addressing and for directing
    program flow. By contrast, if an FP computation is wrong, very few
    people notice (and the late discovery of the Pentium FDIV bug shows
    this); Seymour Cray decided to make his machines FP-divide quickly
    rather than precisely, because he knew his customers, and they indeed
    bought his machines. I don't think he did so with integer division.

    OTOH, where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium timeline all compilers were already converting division by constant to reciprocal MUL, so it was only the few remaining variable divisor DIVs
    which remained, and those could only fail if you had one of the very few patterns (leading bits) in the divisor that we had to check for in the
    FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Wed Jan 10 18:22:06 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Maybe they would have noticed the bug in the FP divider much earlier
    if they had used the FP divider for integer division.
    ...
    The reason why I consider it more likely is because programs tend to
    crash or give very wrong results if an integer computation is wrong,
    because integer computations are used in addressing and for directing
    program flow. By contrast, if an FP computation is wrong, very few
    people notice (and the late discovery of the Pentium FDIV bug shows
    this); Seymour Cray decided to make his machines FP-divide quickly
    rather than precisely, because he knew his customers, and they indeed
    bought his machines. I don't think he did so with integer division.

    OTOH, where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations?

    Conversion to strings (and BASE is, regrettably, a variable).

    I also see several uses of a division by the screen width (also not a constant).

    I also occasionally run a benchmark that spends most of its cycles in
    division IIRC.

    By the Pentium
    timeline all compilers were already converting division by constant to >reciprocal MUL,

    Such a general claim is easy to disprove:

    [c8:~:533] vfx64
    VFX Forth 64 5.11 RC2 [build 0112] 2021-05-02 for Linux x64
    © MicroProcessor Engineering Ltd, 1998-2021

    : foo 3 / ; ok
    see foo
    FOO
    ( 004E3E60 B903000000 ) MOV ECX, # 00000003
    ( 004E3E65 488BC3 ) MOV RAX, RBX
    ( 004E3E68 4899 ) CQO
    ( 004E3E6A 48F7F9 ) IDIV RCX
    ( 004E3E6D 488BD8 ) MOV RBX, RAX
    ( 004E3E70 C3 ) RET/NEXT
    ( 17 bytes, 6 instructions )

    And no, they did not do it in 1993, either.

    But anyway, yes, you or I don't divide integers much, but among the
    millions of users of the Pentium, there were people who use division
    more frequently and for other things, and I expect that one of them
    would have noticed pretty early.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 10 16:07:02 2024
    FWIW, I think these arguments remind me of the "No true Scottsman" kind
    of argument. It's even worse because every one have their own (and
    often changing) definition of what is and what isn't "RISC".

    Luckily, it doesn't matter because in practice being RISC or not is not
    a quality that affects anything practical, contrary to performance,
    design costs, power consumption, compatibility, ...


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to terje.mathisen@tmsw.no on Wed Jan 10 20:03:28 2024
    On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    ..., where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium >timeline all compilers were already converting division by constant to >reciprocal MUL, so it was only the few remaining variable divisor DIVs
    which remained, and those could only fail if you had one of the very few >patterns (leading bits) in the divisor that we had to check for in the
    FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje

    But were i486 compilers at that time routinely converting division by
    constant to reciprocal MUL?

    It was fairly well known that compiling integer heavy code for i486
    would make it run faster on P5 than if compiled for P5. The speedup necessarily was code specific, but on average was 3-4 percent.

    Somehow compiling for i486 allowed more use of the (simple) V
    pipeline. This trick worked on the original P5 through at least the
    (100Mhz) P54C. [Don't know if it worked on later P5 chips.]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Jan 11 01:59:45 2024
    According to Stefan Monnier <monnier@iro.umontreal.ca>:
    FWIW, I think these arguments remind me of the "No true Scottsman" kind
    of argument.

    Yeah. It occurs to me that part of the problem is that RISC is a process,
    not just a checklist of what's in an architecture.

    The R is key. Each of the projects we think of as RISC (Berkeley RISC,
    Stanford MIPS, IBM 801) were familiar with existing architectures and
    then started making tradeoffs to try to get better performance at
    lower design cost. They had less complexity in the hardware, usually
    balanced by more complexity in the compiler, with the less complex
    hardware allowing global performance increases like bigger cache,
    deeper pipeline, or faster clock rate.

    The PDP-8 wasn't like that. They started with the PDP-4, then asked
    themselves how much of this 18 bit machine can we cram into a 12 bit
    machine that we can afford to build and still be good enough to do
    useful work, ending up with the PDP-5. There were tradeoffs but of a
    very different kind.

    From the PDP-5 to -8 it was a reimplementation in faster cheaper
    logic, and put the program counter in flip flops rather than in core
    location zero.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Thu Jan 11 07:02:58 2024
    John Levine <johnl@taugh.com> writes:
    According to Stefan Monnier <monnier@iro.umontreal.ca>:
    FWIW, I think these arguments remind me of the "No true Scottsman" kind
    of argument.

    Yeah. It occurs to me that part of the problem is that RISC is a process, >not just a checklist of what's in an architecture.

    The R is key. Each of the projects we think of as RISC (Berkeley RISC, >Stanford MIPS, IBM 801) were familiar with existing architectures and
    then started making tradeoffs to try to get better performance at
    lower design cost. They had less complexity in the hardware, usually
    balanced by more complexity in the compiler, with the less complex
    hardware allowing global performance increases like bigger cache,
    deeper pipeline, or faster clock rate.

    Yes, but the results of these three projects had commonalities, and
    the other architectures that are commonly identified as RISCs shared
    many of these commonalities, while earlier architectures didn't.

    In 1980 Patterson and Ditzel ("The Case for the Reduced Instruction
    Set Computer") indeed did not give any criteria for what constitutes a
    RISC, which supports the process view.

    In 1985 Patterson wrote "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, and there he
    identified 4 criteria:

    |1. Operations are register-to-register, with only LOAD and STORE
    | accessing memory. [...]
    |
    |2. The operations and addressing modes are reduced. Operations
    | between registers complete in one cycle, permitting a simpler,
    | hardwired control for each RISC, instead of
    | microcode. Multiple-cycle instructions such as floating-point
    | arithmetic are either executed in software or in a special-purpose
    | coprocessor. (Without a coprocessor, RISCs have mediocre
    | floating-point performance.) Only two simple addressing modes,
    | indexed and PC-relative, are provided. More complicated addressing
    | modes can be synthesized from the simple ones.
    |
    |3. Instruction formats are simple and do not cross word boundaries. [...]
    |
    |4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

    which I discussed in <2023Dec9.093314@mips.complang.tuwien.ac.at>,
    leaving mainly 1 and a relaxed version of 3. I also added

    5. register machine

    John Mashey has taken the criteria-based approach quite a bit further
    in his great postings on the question <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
    John Mashey ignored ARM.

    If we apply the criteria of Patterson (and mine to the PDP-8, we get):

    1. No (AFAIK), in particular not the indirect addressing

    2. No, but that criterion has not stood the test of time.

    3. Don't know, but that criterion has only partially stood the test of
    time.

    4. No, but that criterion has not stood the test of time.

    5. No.

    I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

    The PDP-8 wasn't like that. They started with the PDP-4, then asked >themselves how much of this 18 bit machine can we cram into a 12 bit
    machine that we can afford to build and still be good enough to do
    useful work, ending up with the PDP-5. There were tradeoffs but of a
    very different kind.

    Yes, a very good point in the process view. And looking at the
    descendants of the PDP-8 (Nova, 16-bit Eclipse, 32-bit Eclipse), you
    also see there that the process was not one that led to RISCs.

    Still, I think that a criteria-based way of classifying something as a
    RISC (or not) is more practical, because criteria are generally easier
    to determine than the process, and because the properties considered
    by the criteria are what the implementors of the architecture have to
    deal with and what the programmers play with.

    It would be interesting to take John Masheys criteria and evaluate a
    few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
    I'll put that on my ToDo list.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to George Neuner on Thu Jan 11 10:31:11 2024
    George Neuner wrote:
    On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    ..., where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium
    timeline all compilers were already converting division by constant to
    reciprocal MUL, so it was only the few remaining variable divisor DIVs
    which remained, and those could only fail if you had one of the very few
    patterns (leading bits) in the divisor that we had to check for in the
    FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje

    But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?

    It was fairly well known that compiling integer heavy code for i486
    would make it run faster on P5 than if compiled for P5. The speedup necessarily was code specific, but on average was 3-4 percent.

    Somehow compiling for i486 allowed more use of the (simple) V
    pipeline. This trick worked on the original P5 through at least the
    (100Mhz) P54C. [Don't know if it worked on later P5 chips.]

    That wasn't known to me!

    OTOH, while writing far too much hand-optimized Pentium code, I did
    almost completely limit myself to the instructions that a 486 could run
    in a single cycle, and then I would manually pair them up so that any of
    the harder opcodes (like shifts!) would be in the first (u) pipe and the simpler opcode would be in the second/less capable (v) pipe.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jan 12 00:48:47 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Yes, but the results of these three projects had commonalities, and
    the other architectures that are commonly identified as RISCs shared
    many of these commonalities, while earlier architectures didn't.

    True. But I think that was as much because they all started around the
    same time and were familiar with the same kinds of computers than any
    deep reason.

    |1. Operations are register-to-register, with only LOAD and STORE
    | accessing memory. [...]
    |
    |2. The operations and addressing modes are reduced. ...
    |
    |3. Instruction formats are simple and do not cross word boundaries. [...]
    |
    |4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

    5. register machine

    John Mashey has taken the criteria-based approach quite a bit further
    in his great postings on the question ><https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
    John Mashey ignored ARM.

    If we apply the criteria of Patterson (and mine to the PDP-8, we get):

    I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

    I'd say that it makes no sense to try to evaluate a 1960s single register machine to see whether it's a RISC. The IBM 650 had only one addressing
    mode, but it also used a spinning drum as main memory and the word size
    was 10 decimal digits. Was that a RISC? The question makes no sense.

    1. No (AFAIK), in particular not the indirect addressing

    Given that there's only one register, load/store architecture doesn't
    work. If you don't have an ADD instruction that references memory, how
    are you going to do any arithmetic?

    3. Don't know, but that criterion has only partially stood the test of
    time.

    The PDP-8 had one instruction format for all of the memory reference instructions, one for the operate and skip group, and one for I/O.
    It was pretty simple but it had to be.

    All of the standard instructions were a single word but the amazing
    680 TTY multiplexer, which could handle 64 TTY lines, scanning
    characters in and out a bit at a time, modified the CPU to add a
    three-word instruction that did one bit of line scanning. You could do
    that kind of stuff when each card in the machine had maybe one flip
    flop, and the backplane was wire wrapped.

    4. No, but that criterion has not stood the test of time.

    Pipeline? What's that?

    5. No.

    Well, it had one register.

    Still, I think that a criteria-based way of classifying something as a
    RISC (or not) is more practical, because criteria are generally easier
    to determine than the process, and because the properties considered
    by the criteria are what the implementors of the architecture have to
    deal with and what the programmers play with.

    The criteria are fine so long as you limit the scope to machines designed
    when the criteria make sense.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Fri Jan 12 01:33:45 2024
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
    It was pretty simple but it had to be.

    All of the standard instructions were a single word but the amazing
    680 TTY multiplexer, which could handle 64 TTY lines, scanning
    characters in and out a bit at a time, modified the CPU to add a
    three-word instruction that did one bit of line scanning. You could do
    that kind of stuff when each card in the machine had maybe one flip
    flop, and the backplane was wire wrapped.

    As I understand it, the PDP-8 just put the entire instruction word on
    the bus so all cards saw it, and the card that decoded the instruction
    would respond. That's how the IOT instructions worked, anyway.

    Not sure how that could be used to support three word instructions,
    unless they were in the IOT space and the card could increment the
    PC over the backplane somehow.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Fri Jan 12 06:47:40 2024
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    If we apply the criteria of Patterson (and mine to the PDP-8, we get):

    I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

    I'd say that it makes no sense to try to evaluate a 1960s single register >machine to see whether it's a RISC.

    Why not? If the result is what we would arrive at with other methods,
    it certainly makes sense. If the result is different, one may wonder
    whether the criteria are wrong or the other methods are wrong, and, as
    a result, may gain additional insights.

    The IBM 650 had only one addressing
    mode, but it also used a spinning drum as main memory and the word size
    was 10 decimal digits. Was that a RISC? The question makes no sense.

    Again, why not. Maybe, like I added the register-machine criterion
    that Patterson had not considered in 1985 because the machines that he
    compared were all register machines, one might want to add criteria
    about random-access memory (I expect that the drum memory resulted in
    each instruction having a next-instruction field, right?) and binary
    data.

    1. No (AFAIK), in particular not the indirect addressing

    Given that there's only one register, load/store architecture doesn't
    work. If you don't have an ADD instruction that references memory, how
    are you going to do any arithmetic?

    So definitely "No".

    3. Don't know, but that criterion has only partially stood the test of
    time.

    The PDP-8 had one instruction format for all of the memory reference >instructions, one for the operate and skip group, and one for I/O.
    It was pretty simple but it had to be.

    All of the standard instructions were a single word but the amazing
    680 TTY multiplexer, which could handle 64 TTY lines, scanning
    characters in and out a bit at a time, modified the CPU to add a
    three-word instruction that did one bit of line scanning. You could do
    that kind of stuff when each card in the machine had maybe one flip
    flop, and the backplane was wire wrapped.

    So: Yes for the base instruction set, no with the 680 TTY multiplexer.
    In a way like RISC-V, which is "yes" for the base instruction set,
    "no" with the C extension, and it has provisions for longer
    instruction encodings.

    4. No, but that criterion has not stood the test of time.

    Pipeline? What's that?

    A technique that was used ILLIAC II (1962), in the IBM Stretch (1961)
    and the CDC 6600 (1964). But that's not an architectural criterion,
    except the existence of branch-delay slots.

    5. No.

    Well, it had one register.

    Does that make it a register machine? Ok, the one register has all
    the purposes that registers have in that architecture, so one can
    argue that it is a general-purpose register. However, as far as the
    way to use it is concerned, the point of a register machine is that
    you have multiple GPRs so that the programmer or compiler can just use
    another one when one is occupied, and if you run out, you spill and
    refill. So one register is definitely too few to make it a register
    machine.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Fri Jan 12 09:21:48 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    It would be interesting to take John Masheys criteria and evaluate a
    few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
    I'll put that on my ToDo list.

    Ok, let's see. Taking his criteria from <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. It is worth
    reading the posting in full (in particular, he explains why certain
    features are not in RISC architectures), but here I cite the parts
    that are the criteria he listed.

    |MOST RISCs:
    |
    | 3.
    | a. Have 1 size of instruction in an instruction stream
    | b. And that size is 4 bytes
    | c. Have a handful (1-4) addressing modes) (* it is VERY hard to count
    | these things; will discuss later).
    | d. Have NO indirect addressing in any form (i.e., where you need one
    | memory access to get the address of another operand in memory)
    | 4.
    | a. Have NO operations that combine load/store with arithmetic, i.e.,
    | like add from memory, or add to memory. (note: this means
    | especially avoiding operations that use the value of a load as
    | input to an ALU operation, especially when that operation can
    | cause an exception. Loads/stores with address modification can
    | often be OK as they don't have some of the bad effects)
    | b. Have no more than 1 memory-addressed operand per instruction
    | 5.
    | a. Do NOT support arbitrary alignment of data for loads/stores
    | b. Use an MMU for a data address no more than once per instruction
    | c. Have >=5 bits per integer register specifier
    | d. Have >= 4 bits per FP register specifier
    [...]
    |So, here's a table:
    |
    | * C: number of years since first implementation sold in this family (or
    | first thing which with this is binary compatible). Note: this table was
    | first done in 1991, so year = 1991-(age in table).
    | * 3a: # instruction sizes
    | * 3b: maximum instruction size in bytes
    | * 3c: number of distinct addressing modes for accessing data (not jumps)
    | I didn't count register orliteral, but only ones that referenced
    | memory, and I counted different formats with different offset sizes
    | separately. This was hard work ... Also, even when a machine had
    | different modes for register-relative and PC_relative addressing, I
    | counted them only once.
    | * 3d: indirect addressing (0 - no, 1 - yes)
    | * 4a: load/store combined with arithmetic (0 - no, 1 - yes)
    | * 4b: maximum number of memory operands
    | * 5a: unaligned addressing of memory references allowed in load/store,
    | without specific instructions ( 0 - no never [MIPS, SPARC, etc], 1 -
    | sometimes [as in RS/6000], 2 - just about any time)
    | * 5b: maximum number of MMU uses for data operands in an instruction
    | * 6a: number of bits for integer register specifier
    | * 6b:number of bits for 64-bit or more FP register specifier, distinct
    | from integer registers
    [...]
    |So, here's a table of 12 implementations of various architectures, one per |architecture, with the attributes above. [...] I'm
    |going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and |also, at the head of each column, I'm going to put a rule, which, in that |column, most of the RISCs obey. Any RISC that does not obey it is marked
    |with a +; any CISC that DOES obey it is marked with a *. So ...

    ... here's the table, with the entries vor Clipper, i960KB and CDC6600
    inserted and that for the additional instruction sets appended
    (starting with ARM1):

    CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
    (1991)
    RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
    A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
    B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
    C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
    D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
    RISC
    E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
    F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
    G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
    H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

    J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
    K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
    Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

    L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
    M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
    N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
    O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
    P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

    6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
    -10 1 5+ 1 0 0 1 0 1 7 7 1/10 IA-64
    -12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
    -12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
    -22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
    -28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

    Notes:

    I did not want to pre-classify the architectures, but used + for all
    the criteria that Mashey considered non-RISC (and consequently nothing
    rather than * for those that he considered RISC). The ODD column
    contains two values, first the number of criteria according to which
    the architecture is non-RISC, then the criteria according to which the architecture is RISC.

    CDC6600: there are some mistakes in entry that I fixed, and I also
    added +s to those cases that did not satisfy Mashey's RISC criteria.
    I classified 30-bit instructions as 4-byte instructions. I used
    Mashey's entry for 5b.

    Apparently you could buy some ARM1 in 1985 as an add-on board, but the
    wide release came with ARM2 in 1987. I used the 1985 date; Anyway,
    the Age criterion is just ageism.

    IA-64: Specifying instruction size is interesting, but in any case, it
    does not satisfy 3b

    In ARMv7 T32 is required, and the M profile only includes T32.
    Alignment always works for some loads, and never for others, resulting
    in 2 MMU accesses being required for some memory accesses. But the
    same has been true for PowerPC long before ARMv6 required unaligned
    accesses to work. There is VFP[345]-D16 with only 16 FP registers on
    some implementations, but that would also satisfy 6b; most use 32
    64-bit FP registers.

    AMD64 can have instructions up to 15 bytes (with prefixes and all).
    All addressing modes with displacement are counted twice according to
    Mashey's rule, RIP-relative is not counted extra. The MOVS
    instruction has two memory operands; I did not count REP MOVS/STOS
    etc. as separate, although they arguably are; anyway, AMD64 is
    non-RISC according to 4b. SSE2 instructions require alignment; that
    blunder was fixed in AVX, but base AMD64 only has SSE2.

    ARM A64: I used the availability data of the iPhone 5s in 2013 for
    Age. A64 has not only interesting offset options, but also addressing
    modes that sign/zero-extend the index operand; how to count them? In
    any case, A64 does not satisfy 3c. I counted the one memory operand
    of a load/store-pair instruction as one memory operand. There are a
    very few cases where unaligned accesses produce an alignment fault, so
    I gave 1 for 5a. As usual, unaligned accesses may cause two MMU uses.

    It's hard to get a date for RISC-V, with things like the Rocket in
    2016, but the document with the ratified architecture parts of RV64GC
    only available in 2019; so I used the latter year for the age. 5a:
    atomics are required to be aligned; everything else may be unaligned.

    So, looking at the table, ARM1, ARMv7 T32, ARM A64, and RV64GC are
    more RISC than non-RISC according to these criteria, and AMD64 is more
    non-RISC than RISC, which is exactly what I would have said without
    this table. So Mashey's criteria still seem to be mostly valid 32
    years later.

    However, I would not classify the CDC6600 as RISC, because it does not
    have general-purpose registers. It can be seen as a precursor,
    though, and the fact that it shares many of the RISC criteria shows
    that despite changing technology, when you want to design an
    architecture for performance, the architectural features you design in
    and especially those that you don't design in have stayed similar
    across 59 years.

    Some criteria, though, despite making implementation difficult, have
    been designed into relatively recent architectures despite being
    classified as non-RISC by Mashey; in particular, allowing unaligned
    accesses has won; and consequently, they all require up to 2 MMU uses
    per instruction for data operands.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Fri Jan 12 13:55:02 2024
    This is a reposting of <2024Jan12.102148@mips.complang.tuwien.ac.at>
    with some corrections: I added Alpha, and made IA-64 corrections.

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    It would be interesting to take John Masheys criteria and evaluate a
    few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
    I'll put that on my ToDo list.

    Ok, let's see. Taking his criteria from <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. It is worth
    reading the posting in full (in particular, he explains why certain
    features are not in RISC architectures), but here I cite the parts
    that are the criteria he listed.

    |MOST RISCs:
    |
    | 3.
    | a. Have 1 size of instruction in an instruction stream
    | b. And that size is 4 bytes
    | c. Have a handful (1-4) addressing modes) (* it is VERY hard to count
    | these things; will discuss later).
    | d. Have NO indirect addressing in any form (i.e., where you need one
    | memory access to get the address of another operand in memory)
    | 4.
    | a. Have NO operations that combine load/store with arithmetic, i.e.,
    | like add from memory, or add to memory. (note: this means
    | especially avoiding operations that use the value of a load as
    | input to an ALU operation, especially when that operation can
    | cause an exception. Loads/stores with address modification can
    | often be OK as they don't have some of the bad effects)
    | b. Have no more than 1 memory-addressed operand per instruction
    | 5.
    | a. Do NOT support arbitrary alignment of data for loads/stores
    | b. Use an MMU for a data address no more than once per instruction
    | c. Have >=5 bits per integer register specifier
    | d. Have >= 4 bits per FP register specifier
    [...]
    |So, here's a table:
    |
    | * C: number of years since first implementation sold in this family (or
    | first thing which with this is binary compatible). Note: this table was
    | first done in 1991, so year = 1991-(age in table).
    | * 3a: # instruction sizes
    | * 3b: maximum instruction size in bytes
    | * 3c: number of distinct addressing modes for accessing data (not jumps)
    | I didn't count register orliteral, but only ones that referenced
    | memory, and I counted different formats with different offset sizes
    | separately. This was hard work ... Also, even when a machine had
    | different modes for register-relative and PC_relative addressing, I
    | counted them only once.
    | * 3d: indirect addressing (0 - no, 1 - yes)
    | * 4a: load/store combined with arithmetic (0 - no, 1 - yes)
    | * 4b: maximum number of memory operands
    | * 5a: unaligned addressing of memory references allowed in load/store,
    | without specific instructions ( 0 - no never [MIPS, SPARC, etc], 1 -
    | sometimes [as in RS/6000], 2 - just about any time)
    | * 5b: maximum number of MMU uses for data operands in an instruction
    | * 6a: number of bits for integer register specifier
    | * 6b:number of bits for 64-bit or more FP register specifier, distinct
    | from integer registers
    [...]
    |So, here's a table of 12 implementations of various architectures, one per |architecture, with the attributes above. [...] I'm
    |going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and |also, at the head of each column, I'm going to put a rule, which, in that |column, most of the RISCs obey. Any RISC that does not obey it is marked
    |with a +; any CISC that DOES obey it is marked with a *. So ...

    ... here's the table, with the entries vor Clipper, i960KB and CDC6600
    inserted and that for the additional instruction sets appended
    (starting with ARM1):

    CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
    (1991)
    RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
    A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
    B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
    C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
    D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
    RISC
    E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
    F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
    G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
    H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

    J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
    K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
    Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

    L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
    M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
    N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
    O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
    P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

    6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
    -1 1 4 1 0 0 1 0 1 5 5 0/11 Alpha
    -10 2+ 10+1 0 0 1 0 1 7 7 2/9 IA-64
    -12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
    -12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
    -22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
    -28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

    Notes:

    I did not want to pre-classify the architectures, but used + for all
    the criteria that Mashey considered non-RISC (and consequently nothing
    rather than * for those that he considered RISC). The ODD column
    contains two values, first the number of criteria according to which
    the architecture is non-RISC, then the criteria according to which the architecture is RISC.

    CDC6600: there are some mistakes in entry that I fixed, and I also
    added +s to those cases that did not satisfy Mashey's RISC criteria.
    I classified 30-bit instructions as 4-byte instructions. I used
    Mashey's entry for 5b.

    Apparently you could buy some ARM1 in 1985 as an add-on board, but the
    wide release came with ARM2 in 1987. I used the 1985 date; Anyway,
    the Age criterion is just ageism.

    IA-64: There are instructions that occupy 2/3 of a bundle, while most
    occupy 1/3, so I counted two instruction sizes. Specifying the
    maximum instruction size is interesting (how do you count the 5 extra
    bits in a bundle?), but in any case, IA-64 does not satisfy 3b.

    In ARMv7 T32 is required, and the M profile only includes T32.
    Alignment always works for some loads, and never for others, resulting
    in 2 MMU accesses being required for some memory accesses. But the
    same has been true for PowerPC long before ARMv6 required unaligned
    accesses to work. There is VFP[345]-D16 with only 16 FP registers on
    some implementations, but that would also satisfy 6b; most use 32
    64-bit FP registers.

    AMD64 can have instructions up to 15 bytes (with prefixes and all).
    All addressing modes with displacement are counted twice according to
    Mashey's rule, RIP-relative is not counted extra. The MOVS
    instruction has two memory operands; I did not count REP MOVS/STOS
    etc. as separate, although they arguably are; anyway, AMD64 is
    non-RISC according to 4b. SSE2 instructions require alignment; that
    blunder was fixed in AVX, but base AMD64 only has SSE2.

    ARM A64: I used the availability data of the iPhone 5s in 2013 for
    Age. A64 has not only interesting offset options, but also addressing
    modes that sign/zero-extend the index operand; how to count them? In
    any case, A64 does not satisfy 3c. I counted the one memory operand
    of a load/store-pair instruction as one memory operand. There are a
    very few cases where unaligned accesses produce an alignment fault, so
    I gave 1 for 5a. As usual, unaligned accesses may cause two MMU uses.

    It's hard to get a date for RISC-V, with things like the Rocket in
    2016, but the document with the ratified architecture parts of RV64GC
    only available in 2019; so I used the latter year for the age. 5a:
    atomics are required to be aligned; everything else may be unaligned.

    So, looking at the table, ARM1, ARMv7 T32, ARM A64, and RV64GC are
    more RISC than non-RISC according to these criteria, and AMD64 is more
    non-RISC than RISC, which is exactly what I would have said without
    this table. So Mashey's criteria still seem to be mostly valid 32
    years later.

    However, I would not classify the CDC6600 as RISC, because it does not
    have general-purpose registers. It can be seen as a precursor,
    though, and the fact that it shares many of the RISC criteria shows
    that despite changing technology, when you want to design an
    architecture for performance, the architectural features you design in
    and especially those that you don't design in have stayed similar
    across 59 years.

    Some criteria, though, despite making implementation difficult, have
    been designed into relatively recent architectures despite being
    classified as non-RISC by Mashey; in particular, allowing unaligned
    accesses has won; and consequently, they all require up to 2 MMU uses
    per instruction for data operands.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Fri Jan 12 12:27:45 2024
    Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    According to Stefan Monnier <monnier@iro.umontreal.ca>:
    FWIW, I think these arguments remind me of the "No true Scottsman" kind
    of argument.
    Yeah. It occurs to me that part of the problem is that RISC is a process, >> not just a checklist of what's in an architecture.

    The R is key. Each of the projects we think of as RISC (Berkeley RISC,
    Stanford MIPS, IBM 801) were familiar with existing architectures and
    then started making tradeoffs to try to get better performance at
    lower design cost. They had less complexity in the hardware, usually
    balanced by more complexity in the compiler, with the less complex
    hardware allowing global performance increases like bigger cache,
    deeper pipeline, or faster clock rate.

    Yes, but the results of these three projects had commonalities, and
    the other architectures that are commonly identified as RISCs shared
    many of these commonalities, while earlier architectures didn't.

    In 1980 Patterson and Ditzel ("The Case for the Reduced Instruction
    Set Computer") indeed did not give any criteria for what constitutes a
    RISC, which supports the process view.

    In 1985 Patterson wrote "Reduced instruction set computers" <https://dl.acm.org/doi/pdf/10.1145/2465.214917>, and there he
    identified 4 criteria:

    |1. Operations are register-to-register, with only LOAD and STORE
    | accessing memory. [...]
    |
    |2. The operations and addressing modes are reduced. Operations
    | between registers complete in one cycle, permitting a simpler,
    | hardwired control for each RISC, instead of
    | microcode. Multiple-cycle instructions such as floating-point
    | arithmetic are either executed in software or in a special-purpose
    | coprocessor. (Without a coprocessor, RISCs have mediocre
    | floating-point performance.) Only two simple addressing modes,
    | indexed and PC-relative, are provided. More complicated addressing
    | modes can be synthesized from the simple ones.
    |
    |3. Instruction formats are simple and do not cross word boundaries. [...]
    |
    |4. RlSC branches avoid pipeline penalties. [... delayed branches ...]

    which I discussed in <2023Dec9.093314@mips.complang.tuwien.ac.at>,
    leaving mainly 1 and a relaxed version of 3. I also added

    5. register machine

    John Mashey has taken the criteria-based approach quite a bit further
    in his great postings on the question <https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC>. Unfortunately,
    John Mashey ignored ARM.

    If we apply the criteria of Patterson (and mine to the PDP-8, we get):

    1. No (AFAIK), in particular not the indirect addressing

    2. No, but that criterion has not stood the test of time.

    3. Don't know, but that criterion has only partially stood the test of
    time.

    4. No, but that criterion has not stood the test of time.

    5. No.

    I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria.

    The PDP-8 wasn't like that. They started with the PDP-4, then asked
    themselves how much of this 18 bit machine can we cram into a 12 bit
    machine that we can afford to build and still be good enough to do
    useful work, ending up with the PDP-5. There were tradeoffs but of a
    very different kind.

    Yes, a very good point in the process view. And looking at the
    descendants of the PDP-8 (Nova, 16-bit Eclipse, 32-bit Eclipse), you
    also see there that the process was not one that led to RISCs.

    Still, I think that a criteria-based way of classifying something as a
    RISC (or not) is more practical, because criteria are generally easier
    to determine than the process, and because the properties considered
    by the criteria are what the implementors of the architecture have to
    deal with and what the programmers play with.

    It would be interesting to take John Masheys criteria and evaluate a
    few more recent architectures: Alpha, IA-64, AMD64, ARM A64, RV64GC.
    I'll put that on my ToDo list.

    - anton

    The basic difference between RISC and CISC is that, with some exceptions,
    CISC cores are all a single monolithic state machines serially executing multiple states for each instruction, whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of
    these is designed from those different points of view. Now after the fact
    some CISC's added limited HW concurrency but the ISA's were designed from
    the monolithic point of view.

    Although it doesn't say this, the Patterson and Ditzel paper gives criteria
    to guide HW designers used to looking at the problem from the monolithic
    view into how to decompose the problem from a concurrent view.

    If you look at MIPS-1, RISC-1 and ARM-1, the instructions and their
    formats are all chosen to fit nicely into the concurrent HW view.

    If one looks at it this way, then one can access how well an ISA would fit
    into that RISC concurrent HW model. Its not whether it has a single
    accumulator register, but what effect does a single accumulator have
    on HW concurrency - it creates RAW, WAW and WAR dependencies that don't
    need to exist and are costly to eliminate later in HW.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:13:44 2024
    Anton Ertl wrote:

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    If we apply the criteria of Patterson (and mine to the PDP-8, we get):

    I don't know enoug about the PDP-8 to evaluate it by John Mashet's criteria. >>
    I'd say that it makes no sense to try to evaluate a 1960s single register >>machine to see whether it's a RISC.

    Why not? If the result is what we would arrive at with other methods,
    it certainly makes sense. If the result is different, one may wonder
    whether the criteria are wrong or the other methods are wrong, and, as
    a result, may gain additional insights.

    The IBM 650 had only one addressing
    mode, but it also used a spinning drum as main memory and the word size
    was 10 decimal digits. Was that a RISC? The question makes no sense.

    Again, why not. Maybe, like I added the register-machine criterion
    that Patterson had not considered in 1985 because the machines that he compared were all register machines, one might want to add criteria
    about random-access memory (I expect that the drum memory resulted in
    each instruction having a next-instruction field, right?) and binary
    data.

    1. No (AFAIK), in particular not the indirect addressing

    Given that there's only one register, load/store architecture doesn't
    work. If you don't have an ADD instruction that references memory, how
    are you going to do any arithmetic?

    So definitely "No".

    3. Don't know, but that criterion has only partially stood the test of
    time.

    The PDP-8 had one instruction format for all of the memory reference >>instructions, one for the operate and skip group, and one for I/O.
    It was pretty simple but it had to be.

    All of the standard instructions were a single word but the amazing
    680 TTY multiplexer, which could handle 64 TTY lines, scanning
    characters in and out a bit at a time, modified the CPU to add a
    three-word instruction that did one bit of line scanning. You could do
    that kind of stuff when each card in the machine had maybe one flip
    flop, and the backplane was wire wrapped.

    So: Yes for the base instruction set, no with the 680 TTY multiplexer.
    In a way like RISC-V, which is "yes" for the base instruction set,
    "no" with the C extension, and it has provisions for longer
    instruction encodings.

    4. No, but that criterion has not stood the test of time.

    Pipeline? What's that?

    A technique that was used ILLIAC II (1962), in the IBM Stretch (1961)
    and the CDC 6600 (1964). But that's not an architectural criterion,
    except the existence of branch-delay slots.


    CDC 6600 has concurrent but not pipelined*
    CDC 7600 was pipelined.

    (*) Instruction fetch was pipelined but calculations were not.

    5. No.

    Well, it had one register.

    Does that make it a register machine? Ok, the one register has all
    the purposes that registers have in that architecture, so one can
    argue that it is a general-purpose register. However, as far as the
    way to use it is concerned, the point of a register machine is that
    you have multiple GPRs so that the programmer or compiler can just use another one when one is occupied, and if you run out, you spill and
    refill. So one register is definitely too few to make it a register
    machine.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Fri Jan 12 18:41:50 2024
    From:: https://homepages.cwi.nl/%7Erobertl/mash/RISCvsCISC

    1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 1718 1920 21 22
    r r
    r r r +d1 +d1
    r r r r r r r+ +d+d1 I +s
    r r r +d+x +s s+ s+ s+ +d +d r+ +dI I I +s I
    r +d +x+s >r>r >r r+ -r a a r+ -r +x +s I I +s +s+d2 +d2 +d2
    AMD 29K 1 . . . . . . . . . . . . . . . . . . . . .
    Rxxx . 1 . . . . . . . . . . . . . . . . . . . .
    SPARC . 1 1. . . . . . . . . . . . . . . . . . .
    88K . 1 1 1 . . . . . . . . . . . . . . . . . .
    HP PA . 2 1 1 4 1 1 . . . . . . . . . . . . . . .
    ROMP 1 2 . . . . . . . . . . . . . . . . . . . .
    POWER . 1 1. 1 1 . . . . . . . . . . . . . . . .
    i860 . 1 1. 1 1 . . . . . . . . . . . . . . . .
    Swrdfish1 1 1. . . . . . 1. . . . . . . . . . . .
    ARM 2 2 . 2 1. 1 1 1 . . . . . . . . . . . . .
    Clipper 1 3 1. . . . 1 1 2. . . . . . . . . . . .
    i960KB 1 1 1 1 . . . . . 2 2 . . 1 . . . . . . . .
    . . . . . . . . . . . . . . . . . . . . . .
    S/360 . 1 . . . . . . . . . . . 1 . . . . . . . .
    i486 1 3 1 1 . . . 1 1 2. . . 2 3 . . . . . . .
    NSC32K . 3 . . . . . 1 1 3 3 . . . 3 . . . . 9 . .
    MC68000 1 1 . . . . . 1 1 2. . . 2 . . . . . . . .
    MC68020 1 1 . . . . . 1 1 2. . . 2 4 . . . . . 16 16
    VAX 1 3 . 1 . . . 1 1 1 1 1 1 . 3 1 3 1 3 . . .

    My 66000 . 3 . 1 1 . . . . . . . . . 2 . . . . . . .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Fri Jan 12 18:50:44 2024
    Anton Ertl wrote:

    Just adding My 66000 data::


    CPU Age 3a 3b 3c 3d 4a4b 5a 5b 6a 6b# ODD
    (1991)
    RULE <6 =1 =4 <5 =0 =0=1 <2 =1 >4 >3
    A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
    B1 5 1 4 1 0 0 1 0 1 5 4 - MIPS R2000
    C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC V7
    D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000

    ?? ? 5 20 4 0 0 1 2 1 5 0 ? My 66000
    RISC
    E1 5 1 410+ 0 0 1 0 1 5 4 1 HP PA
    F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
    G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
    H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860

    J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
    K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
    Q1 27+ 2+ 4 1 0 0 1 0 1 3+ 3+ 4 CDC 6600

    L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM3090
    M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
    N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 CISC
    O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
    P3 13 56 56 22 1 1 6 2 24 4 0 - VAX

    6+ 1 4 7+ 0 0 1 0 1 4+ - 3/8 ARM1
    -10 1 5+ 1 0 0 1 0 1 7 7 1/10 IA-64
    -12 2+ 4 7+ 0 0 1 1 2+ 4+ 5 4/7 ARMv7 T32
    -12 15+15+7+ 0 1+ 2+ 1 4+ 4+ 4 7/4 AMD64
    -22 1 4 15+ 0 0 1 1 2+ 5 5 2/9 ARM A64
    -28 2+ 4 1 0 0 1 1 2+ 5 5 2/9 RV64GC

    ?? ? 5 20 4 0 0 1 2 1 5 0 ? My 66000

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to George Neuner on Sat Jan 13 10:44:10 2024
    George Neuner <gneuner2@comcast.net> schrieb:
    On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    ..., where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium >>timeline all compilers were already converting division by constant to >>reciprocal MUL, so it was only the few remaining variable divisor DIVs >>which remained, and those could only fail if you had one of the very few >>patterns (leading bits) in the divisor that we had to check for in the
    FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje

    But were i486 compilers at that time routinely converting division by constant to reciprocal MUL?

    I've had a look at the gcc 2.4.5 sources (around when the Pentium came
    out), and it seems it didn't do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sat Jan 13 18:33:57 2024
    Thomas Koenig wrote:
    George Neuner <gneuner2@comcast.net> schrieb:
    On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    ..., where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium
    timeline all compilers were already converting division by constant to
    reciprocal MUL, so it was only the few remaining variable divisor DIVs
    which remained, and those could only fail if you had one of the very few >>> patterns (leading bits) in the divisor that we had to check for in the
    FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje

    But were i486 compilers at that time routinely converting division by
    constant to reciprocal MUL?

    I've had a look at the gcc 2.4.5 sources (around when the Pentium came
    out), and it seems it didn't do it.

    I guess I'm mixing up my own (re-)discovery of the technique which I
    then promptly used in a couple(*) of my most favorite asm algorithms and
    the timing of when it became standard for compiled code.

    *) Those were the julian day number to Y-M-D and the unsigned binary to
    ascii conversions.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Jan 13 21:31:12 2024
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    The basic difference between RISC and CISC is that, with some exceptions, >CISC cores are all a single monolithic state machines serially executing >multiple states for each instruction, whereas RISC is a bunch of concurrent >state machines with handshakes between them, and the ISA's for each of
    these is designed from those different points of view. Now after the fact >some CISC's added limited HW concurrency but the ISA's were designed from
    the monolithic point of view. ...

    That's a great insight. It's easy to see how stuff like multiple registers, and load/store memory references follow from that.

    You can also see why a single register machine would be hopeless even
    if the instruction set is as simple as a PDP-8, because everything
    interlocks on that register.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Sat Jan 13 22:40:33 2024
    EricP wrote:


    The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction,

    I can tell you that 68000, 68010, 68020, 68030 all used 3 microcode pointers simultaneously, 1 running the address section, 1 running the Data section, and 1 running the Fetch-Decode section. The 3 pointers could access different lines in µcode and have the 3 reads ORed together as they exit µstore.

    whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of
    these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from
    the monolithic point of view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Sun Jan 14 11:14:03 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    George Neuner <gneuner2@comcast.net> schrieb:
    On Wed, 10 Jan 2024 09:02:50 +0100, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    ..., where are you using DIV in integer code?

    Some modulus operations, a few perspective calculations? By the Pentium >>>> timeline all compilers were already converting division by constant to >>>> reciprocal MUL, so it was only the few remaining variable divisor DIVs >>>> which remained, and those could only fail if you had one of the very few >>>> patterns (leading bits) in the divisor that we had to check for in the >>>> FDIV workaround.

    I.e. it could well have gone undetected for a long time.

    Terje

    But were i486 compilers at that time routinely converting division by
    constant to reciprocal MUL?

    I've had a look at the gcc 2.4.5 sources (around when the Pentium came
    out), and it seems it didn't do it.

    I guess I'm mixing up my own (re-)discovery of the technique which I
    then promptly used in a couple(*) of my most favorite asm algorithms and
    the timing of when it became standard for compiled code.

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to John Levine on Sun Jan 14 11:47:33 2024
    On 13 Jan 2024, John Levine wrote
    (in article <unuvf0$qko$1@gal.iecc.com>):

    According to EricP<ThatWouldBeTelling@thevillage.com>:
    The basic difference between RISC and CISC is that, with some exceptions, CISC cores are all a single monolithic state machines serially executing multiple states for each instruction, whereas RISC is a bunch of concurrent state machines with handshakes between them, and the ISA's for each of these is designed from those different points of view. Now after the fact some CISC's added limited HW concurrency but the ISA's were designed from the monolithic point of view. ...

    That's a great insight. It's easy to see how stuff like multiple registers, and load/store memory references follow from that.

    You can also see why a single register machine would be hopeless even
    if the instruction set is as simple as a PDP-8, because everything
    interlocks on that register.

    While respecting the caveat "with some exceptions",
    KDF9 (designed 1960-62) was made of as many as 24
    concurrently running state machines.

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sun Jan 14 14:30:51 2024
    MitchAlsup wrote:
    EricP wrote:


    The basic difference between RISC and CISC is that, with some exceptions,
    CISC cores are all a single monolithic state machines serially executing
    multiple states for each instruction,

    I can tell you that 68000, 68010, 68020, 68030 all used 3 microcode
    pointers
    simultaneously, 1 running the address section, 1 running the Data
    section, and
    1 running the Fetch-Decode section. The 3 pointers could access
    different lines
    in µcode and have the 3 reads ORed together as they exit µstore.

    Well... I don't know about the others but for the 68000 and its 68008
    brother any such concurrency is extremely limited. It had I think an 8 byte instruction prefetch queue, and Decode could look up the next microcode
    start address in parallel with the end of a prior macro instruction.
    But it has *only one micro sequencer* and can only execute one macro instruction at once.

    And while it does have some parallel resources like separate address and
    data registers and separate address and data buses, their use is sequenced
    by microcode so any concurrent use is *hard coded* into a micro sequence.

    Furthermore, the address and data registers and buses are 16 bits and
    the high 16-bits are shared, so it takes multiple sequential cycles
    to perform those overlapped address and data movements anyway.
    (Move the high data and low address at once, then low data and high address,
    so it takes 2 cycles to move an address and data pair instead of 4).

    But the single data bus means that each data operand register needs
    separate sequential access, so 4 cycles for 2 operands,
    then 2 cycles to write the result.

    68000 used a 2 level control store of microcode and nanocode as a means
    of compressing the microword. Basically they took a flat, wide uWord,
    and moved the control bits common to multiple uWords out to a single
    nano word, then put the address of the nWord in the uWord.
    The sequential micro sequencer addresses the uCode ROM which addresses
    the nCode ROM which drives the execute function unit.

    It is possible to add pipeline stage registers in such designs,
    at the uCode ROM and nCode ROM outputs to access ROMs concurrently,
    and apparently the 68000 has such ROM pipeline registers.
    But the inherently serial nature of uCode means you wind up with uCode
    branch delay slots for each new pipeline stage, *two* in this case,
    so a conditional uCode branch performed at time T1 takes effect at T4.
    Any uCode words that might be folded into the branch delay slots would be
    from the same macro instruction sequence, otherwise they are NOPs.

    So yes there are opportunities for overlapping some actions.
    But it is limited and hard coded into the microcode sequence.
    And it is one microcode sequence for one macro instruction at once.

    whereas RISC is a bunch of
    concurrent
    state machines with handshakes between them, and the ISA's for each of
    these is designed from those different points of view. Now after the fact
    some CISC's added limited HW concurrency but the ISA's were designed from
    the monolithic point of view.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jan 14 19:38:19 2024
    According to Bill Findlay <findlaybill@blueyonder.co.uk>:
    On 13 Jan 2024, John Levine wrote
    (in article <unuvf0$qko$1@gal.iecc.com>):

    According to EricP<ThatWouldBeTelling@thevillage.com>:
    That's a great insight. It's easy to see how stuff like multiple registers, >> and load/store memory references follow from that.

    You can also see why a single register machine would be hopeless even
    if the instruction set is as simple as a PDP-8, because everything
    interlocks on that register.

    While respecting the caveat "with some exceptions",
    KDF9 (designed 1960-62) was made of as many as 24
    concurrently running state machines.

    I believe you but from what I can see it had hardware stacks and 16
    index registers, so it was hardly a single register machine.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Thomas Koenig on Sun Jan 14 19:50:05 2024
    On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

    Division code is often scary,
    Long division's downright scary.

    Shouldn't one of the occurrences of "scary", most
    probably the first, be replaced by "hairy"?

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Quadibloc on Sun Jan 14 19:54:59 2024
    On Sun, 14 Jan 2024 19:50:05 +0000, Quadibloc wrote:

    On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

    Division code is often scary,
    Long division's downright scary.

    Shouldn't one of the occurrences of "scary", most
    probably the first, be replaced by "hairy"?

    I finally did locate the original source, and indeed
    my guess was correct.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Quadibloc on Sun Jan 14 20:45:03 2024
    Quadibloc <quadibloc@servername.invalid> schrieb:
    On Sun, 14 Jan 2024 11:14:03 +0000, Thomas Koenig wrote:

    Division code is often scary,
    Long division's downright scary.

    Shouldn't one of the occurrences of "scary", most
    probably the first, be replaced by "hairy"?

    Yes, I mistyped that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Findlay@21:1/5 to John Levine on Sun Jan 14 23:48:16 2024
    On 14 Jan 2024, John Levine wrote
    (in article <uo1d7b$284u$1@gal.iecc.com>):

    According to Bill Findlay<findlaybill@blueyonder.co.uk>:
    On 13 Jan 2024, John Levine wrote
    (in article <unuvf0$qko$1@gal.iecc.com>):

    According to EricP<ThatWouldBeTelling@thevillage.com>:
    That's a great insight. It's easy to see how stuff like multiple registers,
    and load/store memory references follow from that.

    You can also see why a single register machine would be hopeless even
    if the instruction set is as simple as a PDP-8, because everything interlocks on that register.

    While respecting the caveat "with some exceptions",
    KDF9 (designed 1960-62) was made of as many as 24
    concurrently running state machines.

    I believe you but from what I can see it had hardware stacks and 16
    index registers, so it was hardly a single register machine.

    More to the point, it was not a RISC.

    --
    Bill Findlay

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Mon Feb 12 20:19:26 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Stefan Monnier on Mon Feb 12 20:44:29 2024
    Stefan Monnier <monnier@iro.umontreal.ca> writes:

    FWIW, I think these arguments remind me of the "No true Scottsman"
    kind of argument. [...]

    They shouldn't. Different people here have expressed different
    ideas, but each person has expressed more or less definite ideas.
    The essential element of "No true Scotsman" is that whatever the
    distinguishing property or quality is supposed to be is never
    identified, and cannot be, because it is chosen after the fact to
    make the "prediction" be correct. That's not what's happening in
    the RISC discussions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Anton Ertl on Mon Feb 12 20:25:30 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    surely people don't view the Itanium as being a RISC?

    It has a lot of RISC characteristics, in particular, it's a
    load/store architecture with a large number of general-purpose
    registers (and for the other registers, there are also many of
    them, avoiding the register allocation problems that compilers
    tend to have with unique registers). In 1999 I wrote
    <http://www.complang.tuwien.ac.at/anton/ia-64-1999.txt>:

    |It's basically a RISC with lots of special features:

    I think the same description could be said of the IBM System/360.
    I don't think of System/360 as a RISC, even if a subset of it
    might be.

    And I don't think so, either. That's because some of the features
    that the IBM 801 left away are non-RISC features, such as the EDIT
    instruction. OTOH, none of the special features of IA-64 are
    particularly non-RISCy, [...]

    Begging the question.

    Ok, let's make this more precise:

    |And I don't think so, either. That's because among the features that
    |the IBM 801 left away are instructions that access memory, but are not
    |just a load or a store, such as the EDIT instruction. OTOH, none of
    |the special features of IA-64 add instructions that access memory, but
    |are not just a load or a store, [...]

    IA-64 followed early RISC practice in leaving away integer and FP
    division (a feature which is interestingly already present in
    commercial RISC, and, I think HPPA, as well as the 88000 and Power,
    but in the integer case not in the purist Alpha).

    It appears we have different sources for our respective ideas
    of what qualities or properties are the essential elements of
    RISC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Feb 13 15:22:39 2024
    Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    Even in DFP you can keep the mantissa in binary, in which case the
    problem is exactly the same (modulo som minor differences in how to round).

    Assuming DPD (BCD/base 1000 more or less) you could still do division
    with an approximate reciprocal and a loop, or via a sufficiently precise reciprocal.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Tim Rentsch on Tue Feb 13 07:52:30 2024
    On 2/12/2024 8:19 PM, Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    Shouldn't one of those two "scary"s be a "hairy"?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Tue Feb 13 17:51:13 2024
    Robert Finch wrote:

    On 2024-02-13 2:35 a.m., BGB wrote:
    On 2/12/2024 10:19 PM, Tim Rentsch wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

     From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    One thing I have noted is that for floating-point division via
    Newton-Raphson, there is often a need for a first stage that converges
    less aggressively.


    Say, for most stages in finding the reciprocal, one can do:
      y=y*(2.0-x*y);
    But, this may not always converge, so one might need to do a first stage
    of, say:
      y=y*((2.0-x*y)*0.375+0.625);

    Where, one may generate the initial guess as, say:
      *(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

    ...

    Without something to nudge the value closer than the initial guess, the
    iteration might sometimes become unstable and "fly off into space"
    instead of converging towards the answer.

    ...

    Could an initial guess come from estimating the reciprocal then doing a multiply, then dong the NR- iterations?

    Goldschmidt division generally starts with 9-bits from the HoBs of the denominator and gets 11-bits from a tabled indexed by those HoBs. Then
    the first multiply is known to have 8-bits of precision. We use 11-bits
    here so that the first multiplication drives the denominator into [8'B01111111..8'B100000000] an 9-bit in 9-bit out table generates
    the range [8'B01111111..8'B10000001] which has oscillatory convergence
    whereas the 9-bit in 11-bit always converges from the same side.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Tim Rentsch on Tue Feb 13 19:21:40 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    ... which Hacker's Delight also describes (especially the
    signed version, which is not quite as straigtforward)
    as the unsigned version).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Tim Rentsch on Tue Feb 13 10:45:56 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    They shouldn't. Different people here have expressed different
    ideas, but each person has expressed more or less definite ideas.
    The essential element of "No true Scotsman" is that whatever the distinguishing property or quality is supposed to be is never
    identified, and cannot be, because it is chosen after the fact to
    make the "prediction" be correct. That's not what's happening in
    the RISC discussions.

    I had impression from John
    https://en.wikipedia.org/wiki/John_Cocke
    that 801/risc was to do the opposite of the failed Future System effort http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html
    but there is also account of some risc work overlapping with FS https://www.ibm.com/history/risc
    In the early 1970s, telephone calls didn't instantly bounce between
    handheld devices and cell towers. Back then, the connection process
    required human operators to laboriously plug cords into the holes of a switchboard. Come 1974, a team of IBM researchers led by John Cocke set
    out in search of ways to automate the process. They envisioned a
    telephone exchange controller that would connect 300 calls per second (1 million per hour). Hitting that mark would require tripling or even
    quadrupling the performance of the company's fastest mainframe at the
    time -- which would require fundamentally reimagining high-performance computing.

    ... snip ...

    End of 70s, 801/risc Iliad chip was going to be microprocessor for
    running 370 (for low&mid range 370 computers) & other architecture
    emulators ... effort floundered and even had some 801/risc engineers
    leaving IBM for other vendor risc efforts (like AMD 29k).

    801/risc ROMP chip was going to be for the displaywriter followon
    ... but when that was killed, they decided to pivot to unix workstation
    market ... and got the company that did PC/IX for the IBM/PC to do port
    for ROMP ... announced as AIX for PC/RT.

    Then there was six chip RIOS for RS/6000 ... we were doing HA/6000
    originally for NYTimes to move their newspaper system (ATEX) off
    VAXcluster to RS/6000. I rename it HA/CMP when start doing
    technical/scientific cluster scaleup with national labs and commerical
    cluster scaleup with RDBMS vendors (oracle, sybase, informix,
    ingres). At the time 801/risc didn't have cache coherency for
    multiprocessor scaleup

    The executive we were reporting to, then went over to head up Somerset
    ... single chip for AIM
    https://en.wikipedia.org/wiki/AIM_alliance https://en.wikipedia.org/wiki/IBM_Power_microprocessors https://en.wikipedia.org/wiki/Motorola_88000
    In the early 1990s Motorola joined the AIM effort to create a new RISC architecture based on the IBM POWER architecture. They worked a few
    features of the 88000 (such as a compatible bus interface[10]) into the
    new PowerPC architecture to offer their customer base some sort of
    upgrade path. At that point the 88000 was dumped as soon as possible

    ... snip ...

    https://en.wikipedia.org/wiki/PowerPC https://en.wikipedia.org/wiki/IBM_Power_microprocessors#PowerPC
    After two years of development, the resulting PowerPC ISA was introduced
    in 1993. A modified version of the RSC architecture, PowerPC added single-precision floating point instructions and general
    register-to-register multiply and divide instructions, and removed some
    POWER features. It also added a 64-bit version of the ISA and support
    for SMP.

    trivia, telco work postdates ACS/360 ... folklore is IBM killed the
    effort because they were afraid that it would advance the
    state-of-the-art too fast and they would loose control of the market
    ... also references features that would show up more than two decades
    later in the 90s with ES/9000 https://people.computing.clemson.edu/~mark/acs_end.html

    trivia2: We had early Jan92 meeting with Oracle CEO Ellison and
    AWD/Hester where Hester tells Ellison HA/CMP would have 16-way clusters
    by mid92 and 128-way clusters by ye92. Then end of Jan92, the official
    Kingston supercomputer group pivtos and HA/CMP cluster scaleup is
    transferred to Kingston (for announce as IBM supercomputer for technical/scientific *ONLY*) and we are told we can't work on anything
    with more than four processors. Then Computerworld news 17feb1992 (from
    wayback machne) ... IBM establishes laboratory to develop parallel
    systems (pg8)
    https://archive.org/details/sim_computerworld_1992-02-17_26_7

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to BGB on Wed Feb 14 07:49:57 2024
    BGB <cr88192@gmail.com> writes:

    On 2/12/2024 10:19 PM, Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    One thing I have noted is that for floating-point division via Newton-Raphson, there is often a need for a first stage that
    converges less aggressively.

    Right. It's important to be inside the radius of convergence
    before using a more accelerating form that is also less stable.

    Say, for most stages in finding the reciprocal, one can do:
    y=y*(2.0-x*y);
    But, this may not always converge, so one might need to do a
    first stage of, say:
    y=y*((2.0-x*y)*0.375+0.625);

    Where, one may generate the initial guess as, say:
    *(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

    The left hand side of this assignment has undefined behavior,
    by virtue of violating effective type rules.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Feb 14 07:53:17 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    Even in DFP you can keep the mantissa in binary, in which case the
    problem is exactly the same (modulo som minor differences in how
    to round).

    Assuming DPD (BCD/base 1000 more or less) you could still do
    division with an approximate reciprocal and a loop, or via a
    sufficiently precise reciprocal.

    In some sense this is part of the point I'm making, namely,
    division isn't hard if we don't mind using a simple algorithm.
    It's only if we insist on it being fast that division becomes
    complicated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Lynn Wheeler on Wed Feb 14 08:17:54 2024
    Lynn Wheeler <lynn@garlic.com> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    They shouldn't. Different people here have expressed different
    ideas, but each person has expressed more or less definite ideas.
    The essential element of "No true Scotsman" is that whatever the
    distinguishing property or quality is supposed to be is never
    identified, and cannot be, because it is chosen after the fact to
    make the "prediction" be correct. That's not what's happening in
    the RISC discussions.

    I had impression from John
    https://en.wikipedia.org/wiki/John_Cocke
    that 801/risc was to do the opposite of the failed Future System
    effort
    http://www.jfsowa.com/computer/memo125.htm https://people.computing.clemson.edu/~mark/fs.html
    but there is also account of some risc work overlapping with FS https://www.ibm.com/history/risc [...]

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer
    architecture).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Wed Feb 14 17:37:13 2024
    Tim Rentsch wrote:

    Lynn Wheeler <lynn@garlic.com> writes:



    I had impression from John
    https://en.wikipedia.org/wiki/John_Cocke
    that 801/risc was to do the opposite of the failed Future System
    effort
    http://www.jfsowa.com/computer/memo125.htm
    https://people.computing.clemson.edu/~mark/fs.html
    but there is also account of some risc work overlapping with FS
    https://www.ibm.com/history/risc [...]

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    Architecture is as much about what you leave out as what you put in.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer architecture).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to BGB on Wed Feb 14 10:24:01 2024
    BGB <cr88192@gmail.com> writes:

    On 2/14/2024 9:49 AM, Tim Rentsch wrote:

    BGB <cr88192@gmail.com> writes:

    On 2/12/2024 10:19 PM, Tim Rentsch wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    [... division ...]

    From Hacker's Delight:

    I think I shall never envision
    an op unlovely as division.

    An op whose answer must be guessed
    and then, through multiply, assessed.

    An op for which we dearly pay,
    In cycles wasted every day.

    Division code is often scary,
    Long division's downright scary.

    The proofs can overtax your brain,
    The ceiling and floor may drive you insane.

    Good code to divide takes a Knuthian hero,
    but even God can't divide by zero!

    Division in base 2 is quite straightforward.

    One thing I have noted is that for floating-point division via
    Newton-Raphson, there is often a need for a first stage that
    converges less aggressively.

    Right. It's important to be inside the radius of convergence
    before using a more accelerating form that is also less stable.

    Not sure how big the radius is, only that the first-stage
    approximation may fall outside of it...


    Say, for most stages in finding the reciprocal, one can do:
    y=y*(2.0-x*y);
    But, this may not always converge, so one might need to do a
    first stage of, say:
    y=y*((2.0-x*y)*0.375+0.625);

    Where, one may generate the initial guess as, say:
    *(uint64_t *)(&y)=0x7FE0000000000000ULL-(*(uint64_t *)(&x));

    The left hand side of this assignment has undefined behavior,
    by virtue of violating effective type rules.

    Yeah, but:
    Relying on the underlying representation of 'double' is UB;

    No, it isn't. The representation of double, along with every
    other type, is implementation-defined. Furthermore the bit-level representation of double can be verified at startup with code
    that is portable and 100% safe. These two properties mean
    relying on the representation of double is a whole other kettle
    of fish than violating effective type rules.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Tim Rentsch on Wed Feb 14 13:26:05 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer architecture).

    one of the final nails in the FS coffin was study by the IBM Houston
    Science Center if 370/195 apps were redone for FS machine made out of
    the fastest available technology, they would have throughput of 370/145
    (about fractor of 30 times drop in throughput).

    during the FS period, which was completely different than 370 and was
    going to completely replace it, internal politics was killing off 370
    efforts ... the lack of new 370 during the period is credited with giving
    clone 370 makers their market foothold. when FS finally implodes there
    as mad rush getting stuff back into the 370 product pipelines.

    trivia: I continued to work on 360&370 stuff all through FS period, even periodically ridiculing what they were doing (drawing analogy with a
    long running cult film playing at theater down the street in central
    sq), which wasn't exactly career enhancing activity ... it was as if
    there was nobody that bothered to think about how all the "wonderful"
    stuff might actually be implemented (or even if was possible).

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lynn Wheeler on Thu Feb 15 00:59:08 2024
    Lynn Wheeler wrote:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer
    architecture).

    one of the final nails in the FS coffin was study by the IBM Houston
    Science Center if 370/195 apps were redone for FS machine made out of
    the fastest available technology, they would have throughput of 370/145 (about fractor of 30 times drop in throughput).

    during the FS period, which was completely different than 370 and was
    going to completely replace it, internal politics was killing off 370
    efforts ... the lack of new 370 during the period is credited with giving clone 370 makers their market foothold. when FS finally implodes there
    as mad rush getting stuff back into the 370 product pipelines.

    trivia: I continued to work on 360&370 stuff all through FS period, even periodically ridiculing what they were doing (drawing analogy with a
    long running cult film playing at theater down the street in central
    sq), which wasn't exactly career enhancing activity ... it was as if
    there was nobody that bothered to think about how all the "wonderful"
    stuff might actually be implemented (or even if was possible).

    Sounds similar to Intel 432 ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Thu Feb 15 21:26:00 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    Lynn Wheeler wrote:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer
    architecture).

    one of the final nails in the FS coffin was study by the IBM Houston
    Science Center if 370/195 apps were redone for FS machine made out of
    the fastest available technology, they would have throughput of 370/145
    (about fractor of 30 times drop in throughput).

    during the FS period, which was completely different than 370 and was
    going to completely replace it, internal politics was killing off 370
    efforts ... the lack of new 370 during the period is credited with giving
    clone 370 makers their market foothold. when FS finally implodes there
    as mad rush getting stuff back into the 370 product pipelines.

    trivia: I continued to work on 360&370 stuff all through FS period, even
    periodically ridiculing what they were doing (drawing analogy with a
    long running cult film playing at theater down the street in central
    sq), which wasn't exactly career enhancing activity ... it was as if
    there was nobody that bothered to think about how all the "wonderful"
    stuff might actually be implemented (or even if was possible).

    Sounds similar to Intel 432 ...

    A page with a bunch of links on IBM future systems:

    https://people.computing.clemson.edu/~mark/fs.html#:~:text=The%20IBM%20Future%20System%20(FS,store%20with%20automatic%20data%20management.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Brett on Fri Feb 16 09:36:05 2024
    Brett <ggtgp@yahoo.com> writes:
    A page with a bunch of links on IBM future systems:

    https://people.computing.clemson.edu/~mark/fs.html#:~:text=The%20IBM%20Future%20System%20(FS,store%20with%20automatic%20data%20management.

    trivia: upthread post I also mention web page https://people.computing.clemson.edu/~mark/fs.html
    and Smotherman references archive of my posts that mention future system http://www.garlic.com/~lynn/subtopic.html#futuresys
    but around two decades ago, I split subtopic.html web page into several,
    now
    http://www.garlic.com/~lynn/submain.html#futuresys

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Lynn Wheeler on Sat Feb 17 15:34:28 2024
    Lynn Wheeler <lynn@garlic.com> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer
    architecture).

    one of the final nails in the FS coffin was study by the IBM
    Houston Science Center if 370/195 apps were redone for FS machine
    made out of the fastest available technology, they would have
    throughput of 370/145 (about fractor of 30 times drop in
    throughput). [...]

    Looks like they should have called it Back to the Future Systems. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Sun Feb 18 17:51:22 2024
    Tim Rentsch wrote:

    Lynn Wheeler <lynn@garlic.com> writes:

    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    In case people are interested, I think the rest of the book is
    worth reading also, but to be fair there should be a warning
    that much of what is said is more about management than it is
    about technical matters. Still I expect y'all will find it
    interesting (and it does have some things to say about computer
    architecture).

    one of the final nails in the FS coffin was study by the IBM
    Houston Science Center if 370/195 apps were redone for FS machine
    made out of the fastest available technology, they would have
    throughput of 370/145 (about fractor of 30 times drop in
    throughput). [...]

    Looks like they should have called it Back to the Future Systems. :)

    And power it with 2.2 GW ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Rentsch on Thu Feb 29 11:21:00 2024
    In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim Rentsch) wrote:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    Page 73 of the Addison Wesley paperback. I don't know much about Future Systems, but it seems to have had a problem that I first encountered with IA-64: complexity presented as /completeness/, reassuring many people
    that it must be good because it has everything you could want. My doubts started when I was skimming the IA-64 instruction set reference and ran
    into instructions that did not seem to make any sense. I went back to
    them a few times, but could not figure them out.

    In contrast, the most recent weird instructions I ran into were Aarch64's Branch Target Indicator family. They are not well described in the ISA reference, but after a couple of readings, they made sense. AArch64 has annoying complexity in its more obscure corners, but that's better than
    the seductive complexity of IA-64.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Dallman on Thu Feb 29 08:27:09 2024
    John Dallman wrote:
    In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim Rentsch) wrote:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    Page 73 of the Addison Wesley paperback. I don't know much about Future Systems, but it seems to have had a problem that I first encountered with IA-64: complexity presented as /completeness/, reassuring many people
    that it must be good because it has everything you could want. My doubts started when I was skimming the IA-64 instruction set reference and ran
    into instructions that did not seem to make any sense. I went back to
    them a few times, but could not figure them out.

    In contrast, the most recent weird instructions I ran into were Aarch64's Branch Target Indicator family. They are not well described in the ISA reference, but after a couple of readings, they made sense. AArch64 has annoying complexity in its more obscure corners, but that's better than
    the seductive complexity of IA-64.

    John

    I believe you mean the Branch Target Identification BTI instruction.
    Looks like a landing pad for branch/calls to catch
    Return Oriented Programming exploits.

    I nominate A64 logical immediate instructions for a WTF award.
    They split a 12-bit immediate into two 6-bit fields that encode:

    "C3.4.2 Logical (immediate)
    The Logical (immediate) instructions accept a bitmask immediate value that
    is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
    the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
    from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
    This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of pattern and their bitwise inverse.

    Note
    Values that consist of only zeros or only ones cannot be described
    in this way."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Thu Feb 29 15:45:42 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    John Dallman wrote:
    In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com (Tim
    Rentsch) wrote:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    Page 73 of the Addison Wesley paperback. I don't know much about Future
    Systems, but it seems to have had a problem that I first encountered with
    IA-64: complexity presented as /completeness/, reassuring many people
    that it must be good because it has everything you could want. My doubts
    started when I was skimming the IA-64 instruction set reference and ran
    into instructions that did not seem to make any sense. I went back to
    them a few times, but could not figure them out.

    In contrast, the most recent weird instructions I ran into were Aarch64's
    Branch Target Indicator family. They are not well described in the ISA
    reference, but after a couple of readings, they made sense. AArch64 has
    annoying complexity in its more obscure corners, but that's better than
    the seductive complexity of IA-64.

    John

    I believe you mean the Branch Target Identification BTI instruction.
    Looks like a landing pad for branch/calls to catch
    Return Oriented Programming exploits.

    I nominate A64 logical immediate instructions for a WTF award.
    They split a 12-bit immediate into two 6-bit fields that encode:

    "C3.4.2 Logical (immediate)
    The Logical (immediate) instructions accept a bitmask immediate value that
    is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical >elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
    the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
    from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
    This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of >pattern and their bitwise inverse.

    That was a fun one to implement in our ARM64 simulator. AArch32 also
    has some odd logical operand instruction encoding as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to EricP on Thu Feb 29 17:49:00 2024
    In article <wD%DN.356177$vFZa.109267@fx13.iad>, ThatWouldBeTelling@thevillage.com (EricP) wrote:
    John Dallman wrote:
    In contrast, the most recent weird instructions I ran into were
    Aarch64's Branch Target Indicator family.

    I believe you mean the Branch Target Identification BTI instruction.

    I do, I keep getting that name wrong.

    Looks like a landing pad for branch/calls to catch
    Return Oriented Programming exploits.

    That is what they are for, but the different forms, the way they only
    apply to branches that take addresses from registers, and the necessity
    for support in the object file format was quite confusing at first.

    I nominate A64 logical immediate instructions for a WTF award.
    They split a 12-bit immediate into two 6-bit fields that encode:

    "C3.4.2 Logical (immediate)
    The Logical (immediate) instructions accept a bitmask immediate
    value that is a 32-bit pattern or a 64-bit pattern viewed as a
    vector of identical elements of size e = 2, 4, 8, 16, 32 or,
    64 bits. Each element contains the same sub-pattern, that is a
    single run of 1 to (e - 1) nonzero bits from bit 0 followed by
    zero bits, then rotated by 0 to (e - 1) bits. This mechanism
    can generate 5334 unique 64-bit patterns as 2667 pairs of
    pattern and their bitwise inverse.

    OK, that is twisty. I'm glad I don't have to write an emulator for it.
    I'm only likely to run into it in debugging compiler-generated code,
    where I can hope the disassembler and debugger will take care of it.

    Note
    Values that consist of only zeros or only ones cannot be described
    in this way."

    Since that's confined to AND/OR/XOR/Test instructions, are values of all
    zeroes or all ones particularly useful?

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu Feb 29 19:32:19 2024
    EricP wrote:

    John Dallman wrote:

    I nominate A64 logical immediate instructions for a WTF award.
    They split a 12-bit immediate into two 6-bit fields that encode:

    "C3.4.2 Logical (immediate)
    The Logical (immediate) instructions accept a bitmask immediate value that
    is a 32-bit pattern or a 64-bit pattern viewed as a vector of identical elements of size e = 2, 4, 8, 16, 32 or, 64 bits. Each element contains
    the same sub-pattern, that is a single run of 1 to (e - 1) nonzero bits
    from bit 0 followed by zero bits, then rotated by 0 to (e - 1) bits.
    This mechanism can generate 5334 unique 64-bit patterns as 2667 pairs of pattern and their bitwise inverse.

    My 66000 took a 12-bit immediate and uses it to specify two 6-bit constants
    the later is the shift count range [0:63] the former is the result width
    also [0:63] but 0 means 64-bits in width.

    Note
    Values that consist of only zeros or only ones cannot be described
    in this way."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Dallman on Sat Mar 2 20:00:49 2024
    On Thu, 29 Feb 2024 11:21 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <86wmr710lp.fsf@linuxsc.com>, tr.17687@z991.linuxsc.com
    (Tim Rentsch) wrote:

    By coincidence I have recently been reading The Design of Design,
    by Fred Brooks. In an early chapter he briefly relates (in just
    a few paragrahs) an experience of doing a review of the Future
    Systems architecture, and it's clear Brooks was impressed by a
    lot of what he heard. It's worth reading. But I can't resist
    giving away the punchline, which appears at the start of the
    fourth (and last) paragraph:

    I knew then that the project was doomed.

    Page 73 of the Addison Wesley paperback. I don't know much about
    Future Systems, but it seems to have had a problem that I first
    encountered with IA-64: complexity presented as /completeness/,
    reassuring many people that it must be good because it has everything
    you could want. My doubts started when I was skimming the IA-64
    instruction set reference and ran into instructions that did not seem
    to make any sense. I went back to them a few times, but could not
    figure them out.

    In contrast, the most recent weird instructions I ran into were
    Aarch64's Branch Target Indicator family. They are not well described
    in the ISA reference, but after a couple of readings, they made
    sense. AArch64 has annoying complexity in its more obscure corners,
    but that's better than the seductive complexity of IA-64.

    John

    Obviously, I know nothing specific about Future Systems, but my
    impression is that it was both more complicated than IPF and, more
    importantly, the sort of complexity was different. If we want to compare
    FS to Intel products then I'd expect Future Systems complexity to be
    similar to i432 complexity or to complexity of 80286 additions to x86 architecture or to those parts of BiiN that did not become i960.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Sat Mar 2 19:35:00 2024
    In article <20240302200049.00000d7c@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    If we want to compare FS to Intel products then I'd expect Future
    Systems complexity to be similar to i432 complexity or to complexity
    of 80286 additions to x86 architecture or to those parts of BiiN
    that did not become i960.

    i432 is more like the impressions I have of FS. An idealised architecture,
    with little thought for the practicalities of implementation.

    My own view has been for a couple of decades, that architecture is the
    art of compromise between "acceptably fast with the initial
    implementation technology" and "capable of maintaining software
    compatibility and acceptable performance in unpredictable future
    implementation technologies."

    Since about 1980 that has amounted to "making use of increasing
    transistor counts to increase speed while being useful with existing
    software, at least at source level."


    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Tue Apr 16 00:35:06 2024
    On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

    Furthermore, the address and data registers and buses are 16 bits and
    the high 16-bits are shared ...

    No, in the 68000 family the A- and D- registers are 32 bits.

    If you compare the earlier members with the 68020 and later, it becomes
    clear that the architecture was designed as full 32-bit from the
    beginning, and then implemented in a cut-down form for the initial 16-bit products. Going full 32-bit was just a matter of filling in the gaps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Tue Apr 16 08:23:47 2024
    On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:
    On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

    Furthermore, the address and data registers and buses are 16 bits and
    the high 16-bits are shared ...

    No, in the 68000 family the A- and D- registers are 32 bits.

    If you compare the earlier members with the 68020 and later, it becomes
    clear that the architecture was designed as full 32-bit from the
    beginning, and then implemented in a cut-down form for the initial 16-bit products. Going full 32-bit was just a matter of filling in the gaps.

    Yes, the 68000 was designed to have full support for 32-bit types and a
    32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
    16-bit buses internally and externally. Some 68000 compilers had 16-bit
    int, some had 32-bit int, and some let you choose either, since 16-bit
    types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to David Brown on Tue Apr 16 07:30:32 2024
    David Brown wrote:
    On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:
    On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

    Furthermore, the address and data registers and buses are 16 bits and
    the high 16-bits are shared ...

    No, in the 68000 family the A- and D- registers are 32 bits.

    If you compare the earlier members with the 68020 and later, it becomes
    clear that the architecture was designed as full 32-bit from the
    beginning, and then implemented in a cut-down form for the initial 16-bit
    products. Going full 32-bit was just a matter of filling in the gaps.

    Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
    16-bit buses internally and externally. Some 68000 compilers had 16-bit
    int, some had 32-bit int, and some let you choose either, since 16-bit
    types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

    Yes, I was referring to its 16-bit internal bus structure.
    This M68000 patent from 1978 shows it in Fig 2:

    Patent US4296469 Execution unit for data processor using
    segmented bus structure, 1978
    https://patents.google.com/patent/US4296469A/en

    Other M68000 patents (the last one US4514803 appears to be for
    when it was reworked into the IBM XT/370 PC in 1983):

    Patent US4307445 Microprogrammed control apparatus having
    a two-level control store for data processor, 1978

    Patent US4312034 ALU and Condition code control unit for
    data processor, 1979

    Patent US4325121 Two-level control store for microprogrammed
    data processor, 1979

    Patent US4514803 Methods for partitioning mainframe instruction sets, 1982

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Tue Apr 16 20:26:09 2024
    EricP wrote:


    Yes, I was referring to its 16-bit internal bus structure.
    This M68000 patent from 1978 shows it in Fig 2:

    Patent US4296469 Execution unit for data processor using
    segmented bus structure, 1978
    https://patents.google.com/patent/US4296469A/en

    There are a number of interesting things about those segmented busses::
    a) the busses were true-complement
    b) the busses were precharged
    c) the busses were coupled with 2 pass gates on either side of a
    3 transistor sense amplifier
    d) to copy data from one bus to the next buss one
    1) opened up the pass gates on the active bus
    2) fired the sense amplifier
    3) opened up the pass gate to the next bus

    So, in 7 transistors, one got::
    a) bus to bus isolation
    b) bus to bus data copy in either direction
    c) and a bus flip-flop (fired sense amplifier)

    This would take at least 30 transistors with todays technology
    to do what they did in 7.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Sat May 25 22:02:13 2024
    EricP wrote:
    David Brown wrote:

    Yes, the 68000 was designed to have full support for 32-bit types and
    a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU
    and 16-bit buses internally and externally. Some 68000 compilers had
    16-bit int, some had 32-bit int, and some let you choose either, since
    16-bit types could be significantly faster on the 68000 even though
    the general-purpose registers were 32-bit.

    Yes, I was referring to its 16-bit internal bus structure.
    This M68000 patent from 1978 shows it in Fig 2:

    Patent US4296469 Execution unit for data processor using
    segmented bus structure, 1978
    https://patents.google.com/patent/US4296469A/en

    I found a book on the uArch design of the 68000 and Micro/370
    written by their senior designer Nick Tredennick.

    Microprocessor Logic Design, Tredennick, 1987 https://archive.org/download/tredennick-microprocessor-logic-design/Tredennick-Microprocessor-logic-Design_text.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun May 26 03:16:12 2024
    EricP wrote:

    I found a book on the uArch design of the 68000 and Micro/370
    written by their senior designer Nick Tredennick.

    Microprocessor Logic Design, Tredennick, 1987 https://archive.org/download/tredennick-microprocessor-logic-design/Tredennick-Microprocessor-logic-Design_text.pdf


    Reading the text you can almost hear Nick talking--he had a very
    peculiar and very recognizable talking style, even after not hearing
    it for 3 decades...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Tue Oct 1 19:02:23 2024
    On Tue, 16 Apr 2024 6:23:47 +0000, David Brown wrote:

    On 16/04/2024 02:35, Lawrence D'Oliveiro wrote:
    On Sun, 14 Jan 2024 14:30:51 -0500, EricP wrote:

    Furthermore, the address and data registers and buses are 16 bits and
    the high 16-bits are shared ...

    No, in the 68000 family the A- and D- registers are 32 bits.

    If you compare the earlier members with the 68020 and later, it becomes
    clear that the architecture was designed as full 32-bit from the
    beginning, and then implemented in a cut-down form for the initial
    16-bit
    products. Going full 32-bit was just a matter of filling in the gaps.

    Yes, the 68000 was designed to have full support for 32-bit types and a 32-bit future, but (primarily for cost reasons) used a 16-bit ALU and
    16-bit buses internally and externally.

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Some 68000 compilers had 16-bit
    int, some had 32-bit int, and some let you choose either, since 16-bit
    types could be significantly faster on the 68000 even though the general-purpose registers were 32-bit.

    Just count the bus cycles.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Oct 1 20:00:28 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Oct 1 21:04:55 2024
    On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    It was pin limited and internal area limited; and close to being
    power limited (NMOS).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Tue Oct 1 23:38:10 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    It was pin limited and internal area limited; and close to being
    power limited (NMOS).


    My wish list for the 68k was a barrel roller for fast shifts, would have
    made a HUGE difference for the first Macintosh.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Thomas Koenig on Wed Oct 2 10:07:25 2024
    On 10/1/24 3:00 PM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    Saving an extra pass through the 16 bit ALU for a 32 bit operation would
    be faster. Assuming that you didn't have to wait for another bus cycle
    to get the other half of an operand.

    Making it faster for register to register operations and not much else.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Schultz on Wed Oct 2 16:08:43 2024
    David Schultz <david.schultz@earthlink.net> wrote:
    On 10/1/24 3:00 PM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    Saving an extra pass through the 16 bit ALU for a 32 bit operation would
    be faster. Assuming that you didn't have to wait for another bus cycle
    to get the other half of an operand.

    Making it faster for register to register operations and not much else.

    A 16 bit barrel roller does not make sense, and Motorola had no idea that shifts would be so important.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Brett on Wed Oct 2 13:51:24 2024
    On 10/2/24 11:08 AM, Brett wrote:
    A 16 bit barrel roller does not make sense, and Motorola had no idea that shifts would be so important.

    They might have guessed. The Xerox Alto had been around for a while.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Oct 2 20:23:53 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 1 Oct 2024 20:00:28 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    It was pin limited and internal area limited; and close to being
    power limited (NMOS).

    Somebody did a good job of optimizing it, then, at the limit of
    several constraints. Not necessarily a global optimium, though,
    that could have been closer to the (much later) ARM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Wed Oct 2 21:34:33 2024
    On Wed, 2 Oct 2024 16:08:43 +0000, Brett wrote:

    David Schultz <david.schultz@earthlink.net> wrote:
    On 10/1/24 3:00 PM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    A 32-bit bus would have priced the 68K at 30%-50% higher simply
    due to the number of pins on available packages. This would have
    eliminated any chance at competing for the broad markets at that
    time.

    Would have an external 16-bit bus and an internal 32-bit bus have
    been advantageous, or would this have blown a likely transistor
    budget for little gain?

    Saving an extra pass through the 16 bit ALU for a 32 bit operation would
    be faster. Assuming that you didn't have to wait for another bus cycle
    to get the other half of an operand.

    Making it faster for register to register operations and not much else.

    A 16 bit barrel roller does not make sense, and Motorola had no idea
    that shifts would be so important.

    In the original 68000, a barrel shifter would have blown the area
    budget--it would have been about equal to the d-section; even in
    16-bit form. Remember this was a 1 layer metal design before poly
    silicon was in the process.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to All on Wed Oct 2 18:55:14 2024
    On 10/2/24 4:34 PM, MitchAlsup1 wrote:
    In the original 68000, a barrel shifter would have blown the area
    budget--it would have been about equal to the d-section; even in
    16-bit form. Remember this was a 1 layer metal design before poly
    silicon was in the process.

    Before polysilicon? I find that hard to believe. Looking at the die shot
    at 6502.org I see either two layers of metal or metal plus polysilicon.

    http://www.visual6502.org/images/pages/Motorola_68000.html

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Thomas Koenig on Thu Oct 3 00:30:21 2024
    On Tue, 1 Oct 2024 20:00:28 -0000 (UTC), Thomas Koenig wrote:

    Would have an external 16-bit bus and an internal 32-bit bus have been advantageous, or would this have blown a likely transistor budget for
    little gain?

    I thought that’s exactly how the original 68000 chip worked.

    I recall being told that the main factor in determining the cost of a CPU
    chip was the number of pins it had.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Brett on Thu Oct 3 00:31:59 2024
    On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

    My wish list for the 68k was a barrel roller for fast shifts, would have
    made a HUGE difference for the first Macintosh.

    The 68020 certainly had that. It also added bit-field instructions, on top
    of the single-bit instructions of the original 68000.

    And in typical big-endian fashion, they added yet another inconsistency to
    the way the bits were numbered ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 01:26:44 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

    My wish list for the 68k was a barrel roller for fast shifts, would have
    made a HUGE difference for the first Macintosh.

    The 68020 certainly had that. It also added bit-field instructions, on top
    of the single-bit instructions of the original 68000.

    And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...

    I noticed that, there is a tiny chance that Apple wanted that to make
    QuickDraw one cycle faster. But other changes hint that the clue train was
    not sharp at Motorola.

    Everyone was muddling back in that era, and being quick to market was the
    rule, hard to blame them. I can see myself making the very same choices, or much more likely seeing my coworkers make bad choices and not being able to
    do anything about it. You do what you can with what you have.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 06:28:21 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    The 68020 certainly had that. It also added bit-field instructions, on top
    of the single-bit instructions of the original 68000.

    And in typical big-endian fashion, they added yet another inconsistency to >the way the bits were numbered ...

    How typical is that?

    Certainly the Power(PC) numbers its bits in big-endian fashion. But
    does PowerPC have instructions where that matters? If so, I expect
    that's not so great with the switch to little-endian in Linux-PPC
    (IIRC AIX is still big-endian).

    I expect that the s390x uses the same bit numbering as Power. Does it
    have instructions where that matters?

    The 88000 has instructions where that matters and has little-endian bit-ordering (like the 68000, so Motorola is stubborn in its mistakes;
    OTOH, with Apple as a prospective customer, they probbly did not want
    to require software changes, even if those changes would make the code shorter). It supports both byte orders, but AFAIK all 88K systems are big-endian.

    MIPS-I has no instructions where the bit numbering plays a role. It
    was available in big- and little-endian systems.

    SPARC uses little-endian bit ordering, but AFAICS has no instructions
    where that plays a role. AFAIK there is no little-endian SPARC
    system.

    So, yes, using little-endian bit ordering with big-endian byte
    ordering is frequent, but OTOH instructions that actually use bit
    numbers exist only in few instruction sets.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 09:21:13 2024
    On 03/10/2024 02:31, Lawrence D'Oliveiro wrote:
    On Tue, 1 Oct 2024 23:38:10 -0000 (UTC), Brett wrote:

    My wish list for the 68k was a barrel roller for fast shifts, would have
    made a HUGE difference for the first Macintosh.

    The 68020 certainly had that. It also added bit-field instructions, on top
    of the single-bit instructions of the original 68000.

    And in typical big-endian fashion, they added yet another inconsistency to the way the bits were numbered ...

    Since you mentioned POWER and PowerPC elsewhere, the bit numbering
    challenges of the m68k world are nothing compared to the PowerPC world. Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
    down to 31 as the LSB. So your 32-bit address bus had lines from A0
    down to A31. Then it got extended to 64-bit (some devices had only
    partial 64-bit extensions), and the chips got a wider address bus (you
    never need all 64-bit lines physically) - the pins for the higher
    address lines were numbered A-1, A-2, and so on. For the internal
    registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
    were original 32-bit and got extended to 64-bit, and so are numbered bit
    -32 down to bit 31 for consistency. Others are 32-bit but numbered from
    bit 32 down to bit 63.

    I am not really complaining - one of our customers hired us precisely
    because they wanted to use a particular PowerPC microcontroller but
    after one look at the ten thousand page reference manual full of this
    kind of thing, they paid us to write a library and drivers for the
    peripherals they wanted so that they never had to think about that mess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Thu Oct 3 09:39:03 2024
    David Brown <david.brown@hesbynett.no> writes:
    Since you mentioned POWER and PowerPC elsewhere, the bit numbering
    challenges of the m68k world are nothing compared to the PowerPC world. >Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
    down to 31 as the LSB. So your 32-bit address bus had lines from A0
    down to A31. Then it got extended to 64-bit (some devices had only
    partial 64-bit extensions), and the chips got a wider address bus (you
    never need all 64-bit lines physically) - the pins for the higher
    address lines were numbered A-1, A-2, and so on. For the internal
    registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
    were original 32-bit and got extended to 64-bit, and so are numbered bit
    -32 down to bit 31 for consistency. Others are 32-bit but numbered from
    bit 32 down to bit 63.

    Maybe they should have started with the MSB as bit -31 or -63, which
    would have allowed them to always use bit 0 for the LSB while having
    big-endian bit ordering.

    For bit ordering big-endian (as in the PowerPC manual) looked more
    wrong to me than for byte ordering; I thought that that was just a
    matter of getting used to the unfamiliar bit ordering, but maybe the
    advantage of little-endian becomes more apparent in bit ordering, and
    maybe that's why Motorola and Sun chose little-endian bit ordering
    despite having big-endian byte ordering.

    For both bit and byte ordering, the advantage of little-endian shows
    up when there are several widths involved. So why is it more obvious
    for bit-ordering?

    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is
    a 64-bit architecture, and that the manual describes only the 32-bit
    subset. Maybe the original Power was 32-bit.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Thu Oct 3 14:34:35 2024
    On 03/10/2024 11:39, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    Since you mentioned POWER and PowerPC elsewhere, the bit numbering
    challenges of the m68k world are nothing compared to the PowerPC world.
    Originally, the PowerPC was 32-bit and numbered bits from 0 as the MSB
    down to 31 as the LSB. So your 32-bit address bus had lines from A0
    down to A31. Then it got extended to 64-bit (some devices had only
    partial 64-bit extensions), and the chips got a wider address bus (you
    never need all 64-bit lines physically) - the pins for the higher
    address lines were numbered A-1, A-2, and so on. For the internal
    registers, some are 64-bit numbered bit 0 (MSB) down to bit 63. Some
    were original 32-bit and got extended to 64-bit, and so are numbered bit
    -32 down to bit 31 for consistency. Others are 32-bit but numbered from
    bit 32 down to bit 63.

    Maybe they should have started with the MSB as bit -31 or -63, which
    would have allowed them to always use bit 0 for the LSB while having big-endian bit ordering.


    That's a very "outside the box thinking" solution!

    For bit ordering big-endian (as in the PowerPC manual) looked more
    wrong to me than for byte ordering; I thought that that was just a
    matter of getting used to the unfamiliar bit ordering, but maybe the advantage of little-endian becomes more apparent in bit ordering, and
    maybe that's why Motorola and Sun chose little-endian bit ordering
    despite having big-endian byte ordering.

    For both bit and byte ordering, the advantage of little-endian shows
    up when there are several widths involved. So why is it more obvious
    for bit-ordering?

    I have certainly found big-endian bit numbering harder to get my head
    around than big-endian byte ordering. One possible explanation is that
    with little-endian ordering, (1 << bit_no) gives you a 1 in the right
    bit number. Another is that with little-endian bit ordering, the same
    bit number has the same value regardless of the size of the type. And I
    work with electronics as well as software - virtually everything in
    hardware (except PowerPC microcontrollers!) uses little-endian bit
    numbering. Smallest to largest, counting upwards from 0 or 1, is just
    more natural.


    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is
    a 64-bit architecture, and that the manual describes only the 32-bit
    subset. Maybe the original Power was 32-bit.


    As I understand it - and my history here might not be completely
    accurate - PowerPC was specified for both 32-bit and 64-bit from early
    on, but first made in the 32-bit version. There were quite a number of optional parts of the PowerPC architecture, including 64-bit width,
    floating point units, and support for little-endian data modes - IIRC
    these were referred to as books of various colours. And yes, you could
    then have weird things like it being a 64-bit architecture where none of
    the 64-bit features were actually implemented. One microcontroller I
    used had 64-bit GPRs, but almost no 64-bit instructions - I don't think
    you could even load or save all 64 bits at a time. The only use of them
    was for transferring to and from the 64-bit double precision floating
    point registers (which could be loaded and saved in full).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Thu Oct 3 23:49:00 2024
    In article <2024Oct3.113903@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I expect that the s390x uses the same bit numbering as Power.

    You're correct. I started reading up on the architecture a few years ago,
    and found this very confusing.

    Does it have instructions where that matters?

    I didn't get far enough to find out.

    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
    is a 64-bit architecture, and that the manual describes only the
    32-bit subset. Maybe the original Power was 32-bit.

    It was. It stayed that way for a while, but grew 64-bit extensions in the
    late 1990s.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Thu Oct 3 22:17:23 2024
    On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a 64-bit architecture, and that the manual describes only the 32-bit
    subset. Maybe the original Power was 32-bit.

    I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

    The PowerPC 601 was first shown publicly in 1993; I can’t remember when
    the fully 64-bit 620 came out, but it can’t have been long after.

    Motorola did a similar thing with the 68000 family: if you compare the
    original 68000 instruction set with the 68020, you will see the latter
    only needed to fill in a few gaps to become fully 32-bit.

    Compare this with the pain the x86 world went through, over a much longer
    time, to move to 32-bit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to Lawrence D'Oliveiro on Thu Oct 3 15:33:54 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

    The PowerPC 601 was first shown publicly in 1993; I can’t remember when
    the fully 64-bit 620 came out, but it can’t have been long after.

    Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
    only needed to fill in a few gaps to become fully 32-bit.

    Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.

    power/pc was done at Somerset ... part of AIM; apple, ibm, motorola ...
    and some of amount of motorola risc 88k contributed to power/pc https://en.wikipedia.org/wiki/PowerPC
    https://en.wikipedia.org/wiki/PowerPC_600 https://en.wikipedia.org/wiki/PowerPC_600#60x_bus
    Using the 88110 bus as the basis for the 60x bus helped schedules in a
    number of ways. It helped the Apple Power Macintosh team by reducing the
    amount of redesign of their support ASICs and it reduced the amount of
    time required for the processor designers and architects to propose,
    document, negotiate, and close a new bus interface (successfully
    avoiding the "Bus Wars" expected by the 601 management team if the 88110
    bus or the previous RSC buses hadn't been adopted). Worthy to note is
    that accepting the 88110 bus for the benefit of Apple's efforts and the alliance was at the expense of the first IBM RS/6000 system design
    team's efforts who had their support ASICs already implemented around
    the RSC's totally different bus structure.

    ... note that RS/6000 didn't have design that supported cache
    consistency, shared-memory multiprocessing ... (one of the reason ha/cmp
    had to resort to cluster operation for scale-up)

    https://en.wikipedia.org/wiki/PowerPC_600#PowerPC_620 https://wiki.preterhuman.net/The_Somerset_Design_Center

    the executive we reported to when we were doing HA/CMP https://en.wikipedia.org/wiki/IBM_High_Availability_Cluster_Multiprocessing went over to head up Somerset

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Lawrence D'Oliveiro on Fri Oct 4 11:23:50 2024
    On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
    On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC is a
    64-bit architecture, and that the manual describes only the 32-bit
    subset. Maybe the original Power was 32-bit.

    I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

    The PowerPC 601 was first shown publicly in 1993; I can’t remember when
    the fully 64-bit 620 came out, but it can’t have been long after.


    I don't remember the history well enough here.

    Motorola did a similar thing with the 68000 family: if you compare the original 68000 instruction set with the 68020, you will see the latter
    only needed to fill in a few gaps to become fully 32-bit.


    The m68k was always designed as a 32-bit ISA. But the first 68000 implementation used a 16-bit ALU and internal buses for size and cost
    reasons. I would not describe the additional instructions in the 68020
    as "filling gaps to 32-bit", but merely a natural expansion of the ISA
    with a few more useful instructions.

    Compare this with the pain the x86 world went through, over a much longer time, to move to 32-bit.

    The x86 started from 8-bit roots, and increased width over time, which
    is a very different path.

    And much of the reason for it being a slow development is that the world
    was held back by MS's lack of progress in using new features. The 80386
    was produced in 1986, but the MS world was firmly at 16-bit under it
    gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit
    from 1993, and Win32s was from around the same time, but these were
    relatively small in the market.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Fri Oct 4 17:30:07 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
    Compare this with the pain the x86 world went through, over a much longer
    time, to move to 32-bit.

    The x86 started from 8-bit roots, and increased width over time, which
    is a very different path.

    Still, the question is why they did the 286 (released 1982) with its
    protected mode instead of adding IA-32 to the architecture, maybe at
    the start with a 386SX-like package and with real-mode only, or with
    the MMU in a separate chip (like the 68020/68551).

    And much of the reason for it being a slow development is that the world
    was held back by MS's lack of progress in using new features. The 80386
    was produced in 1986, but the MS world was firmly at 16-bit under it
    gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit
    from 1993, and Win32s was from around the same time, but these were >relatively small in the market.)

    At that time the market was moving much slower than nowadays. Systems
    with a 286 (and maybe even the 8088) were sold for a long time after
    the 386 was introduced. E.g., the IBM PS/1 Model 2011 was released in
    1990 with a 10MHz 286, and the successor Model 2121 with a 386SX was
    not introduced until 1992. I think it's hard to blame MS for
    targeting the machines that were out there. And looking at <https://en.wikipedia.org/wiki/Windows_2.1x>, Windows 2.1 in 1988
    already was available in a Windows/386 version (but the programs were
    running in virtual 8086 mode, i.e., were still 16-bit programs).

    And it was not just MS who was going in that direction. MS and IBM
    worked on OS/2, and despite ambitious goals IBM insisted that the
    software had to run on a 286.

    The fact that the 386SX only appeared in 1988 also did not help.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Fri Oct 4 23:06:21 2024
    On Fri, 4 Oct 2024 19:05:15 +0000, BGB wrote:

    On 10/4/2024 12:30 PM, Anton Ertl wrote:

    Say, pretty much none of the modern graphics programs (that I am aware
    of) really support working with 16-color and 256-color bitmap images
    with a manually specified color palette.

    Typically, any modern programs are true-color internally, typically only supporting 256-color as an import/export format with an automatically generated "optimized" palette, and often not bothering with 16-color
    images at all. Not so useful if one is doing something that does
    actually have a need for an explicit color palette (and does not have so
    much need for any "photo manipulation" features).

    1996 version of CorelDraw 3 suffers from none of this, supporting all
    sorts of pallets {RGB, CYM, CYMK, at least 3 more) with various
    user specified limitations, 24-bit, 32-bit, ... with all sorts of
    fillers mixing any 2 colors previous mentioned with various patterns
    {gradient, polka dot, you define which pixel gets from which color}.

    Still have the CD-ROM if anyone wants to try.


    And, most people generally haven't bothered with this stuff since the
    Win16 era (even the people doing "pixel art" are still generally doing
    so using true-color PNGs or similar).

    Blame PowerPoint ... No more evil tool ever existed.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Sat Oct 5 06:34:28 2024
    On Fri, 4 Oct 2024 23:06:21 +0000, MitchAlsup1 wrote:

    Blame PowerPoint ... No more evil tool ever existed.

    Competitors existed, at one time, e.g. Adobe Persuasion, Harvard Graphics, others I’ve forgotten.

    Somehow Microsoft made PowerPoint the most attractive of the lot ... were
    the others even worse?

    Actually, it’s not that it doesn’t produce pretty graphics, it’s that people end up believing in the prettiness of the graphics, instead of considering the facts they’re supposed to (mis)represent.

    Edward Tufte, come back, all is forgiven!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Sat Oct 5 06:31:02 2024
    On Fri, 04 Oct 2024 17:30:07 GMT, Anton Ertl wrote:

    The fact that the 386SX only appeared in 1988 also did not help.

    As a software guy, I liked the idea of the 386SX, and encouraged friends/ colleagues to choose it over a 286.

    Of course, they wanted to compare price/performance, but I saw things in
    terms of future software compatibility, and the sooner the move away from braindead x86 segmentation towards a nice, flat, expansive, linear address space, the better for everybody.

    Sometimes I felt like a voice crying in the wilderness ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to BGB on Sat Oct 5 06:35:56 2024
    On Fri, 4 Oct 2024 19:44:40 -0500, BGB wrote:

    MS PaintBrush became MS Paint and seemingly mostly got dumbed down as
    time went on.

    Side excursion into 3D Paint (or is that Paint 3D?), which failed to take
    off, and is now being abandoned.

    Closest modern alternative is Paint.NET, but still doesn't allow manual palette control in the same way as BitEdit.

    Inkscape has good palette control. It does scalable vector graphics
    natively. Give it a try.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Lawrence D'Oliveiro on Sat Oct 5 17:52:09 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Fri, 04 Oct 2024 17:30:07 GMT, Anton Ertl wrote:

    The fact that the 386SX only appeared in 1988 also did not help.

    As a software guy, I liked the idea of the 386SX, and encouraged friends/ colleagues to choose it over a 286.

    Of course, they wanted to compare price/performance, but I saw things in terms of future software compatibility, and the sooner the move away from braindead x86 segmentation towards a nice, flat, expansive, linear address space, the better for everybody.

    Sometimes I felt like a voice crying in the wilderness ...


    Didn’t it take a decade for the 386 to get a 32 bit OS, by which time the early machines were long since in the garbage bin, making the extra cost a waste.

    The AMD 286 was faster and cheaper, better lifetime value for the money.

    You were a voice crying in the wilderness, because you were wrong. ;)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Brett on Sat Oct 5 18:11:55 2024
    Brett <ggtgp@yahoo.com> writes:
    Didn’t it take a decade for the 386 to get a 32 bit OS

    386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and Windows/386 appeared in 1987.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Sat Oct 5 22:53:35 2024
    On Sat, 05 Oct 2024 18:11:55 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Brett <ggtgp@yahoo.com> writes:
    Didn’t it take a decade for the 386 to get a 32 bit OS

    386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and Windows/386 appeared in 1987.

    - anton

    SunOS for i386 in 1988.
    Netware 3x in 1990.
    The later sold in very high volumes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Sun Oct 6 11:58:01 2024
    On 04/10/2024 19:30, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
    Compare this with the pain the x86 world went through, over a much longer >>> time, to move to 32-bit.

    The x86 started from 8-bit roots, and increased width over time, which
    is a very different path.

    Still, the question is why they did the 286 (released 1982) with its protected mode instead of adding IA-32 to the architecture, maybe at
    the start with a 386SX-like package and with real-mode only, or with
    the MMU in a separate chip (like the 68020/68551).


    I can only guess the obvious - it is what some big customer(s) were
    asking for. Maybe Intel didn't see the need for 32-bit computing in the markets they were targeting, or at least didn't see it as worth the cost.

    And much of the reason for it being a slow development is that the world
    was held back by MS's lack of progress in using new features. The 80386
    was produced in 1986, but the MS world was firmly at 16-bit under it
    gained a bit of 32-bit features with Windows 95. (Windows NT was 32-bit >>from 1993, and Win32s was from around the same time, but these were
    relatively small in the market.)

    At that time the market was moving much slower than nowadays. Systems
    with a 286 (and maybe even the 8088) were sold for a long time after
    the 386 was introduced. E.g., the IBM PS/1 Model 2011 was released in
    1990 with a 10MHz 286, and the successor Model 2121 with a 386SX was
    not introduced until 1992. I think it's hard to blame MS for
    targeting the machines that were out there.

    It is fair enough to target the existing market, but they were also slow
    (IMHO) to take advantage of new opportunities in hardware, re-enforcing
    the situation. I think MS and their monopoly on markets caused a
    stagnation - lack of real competition meant lack of progress.

    And looking at
    <https://en.wikipedia.org/wiki/Windows_2.1x>, Windows 2.1 in 1988
    already was available in a Windows/386 version (but the programs were
    running in virtual 8086 mode, i.e., were still 16-bit programs).

    And it was not just MS who was going in that direction. MS and IBM
    worked on OS/2, and despite ambitious goals IBM insisted that the
    software had to run on a 286.


    IBM were famous for poor (and perhaps cowardly) decisions at the time,
    and MS happily screwed them over again and again in regards to OS/2. It
    takes a special kind of bad management for a company of IBM's size to
    make PC's, and to make a PC OS, and yet that OS could not run on their
    own PC's. Later, once OS/2 /did/ run on IBM PC's, they would not sell computers with their own OS pre-installed - you had to first by the
    machine with the competitor's OS, then buy IBM's OS at retail prices,
    and install it yourself (from some 50-60 floppy disks, IIRC).

    The fact that the 386SX only appeared in 1988 also did not help.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sun Oct 6 13:04:15 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 19:30, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
    Compare this with the pain the x86 world went through, over a much longer >>>> time, to move to 32-bit.

    The x86 started from 8-bit roots, and increased width over time, which
    is a very different path.

    Still, the question is why they did the 286 (released 1982) with its
    protected mode instead of adding IA-32 to the architecture, maybe at
    the start with a 386SX-like package and with real-mode only, or with
    the MMU in a separate chip (like the 68020/68551).


    I can only guess the obvious - it is what some big customer(s) were
    asking for. Maybe Intel didn't see the need for 32-bit computing in the >markets they were targeting, or at least didn't see it as worth the cost.

    Anyone could see the problems that the PDP-11 had with its 16-bit
    limitation. Intel saw it in the iAPX 432 starting in 1975. It is
    obvious that, as soon as memory grows beyond 64KB (and already the
    8086 catered for that), the protected mode of the 80286 would be more
    of a hindrance than even the real mode of the 8086. I find it hard to
    believe that many customers would ask Intel for something the 80286
    protected mode with segments limited to 64KB, and even if, that Intel
    would listen to them. This looks much more like an idee fixe to me
    that one or more of the 286 project leaders had, and all customer
    input was made to fit into this idea, or was ignored.

    Concerning the cost, ther 80286 has 134,000 transistors, compared to
    supposedly 68,000 for the 68000, and the 190,000 of the 68020. I am
    sure that Intel could have managed a 32-bit 8086 (maybe even with the
    nice addressing modes that the 386 has in 32-bit mode) with those
    134,000 transistors if Motorola could build the 68000 with half of
    that.

    It is fair enough to target the existing market, but they were also slow >(IMHO) to take advantage of new opportunities in hardware, re-enforcing
    the situation.

    They introduced Windows/386 in 1987.

    I think MS and their monopoly on markets caused a
    stagnation - lack of real competition meant lack of progress.

    Monopoly? These were the times with lots of competition from
    different hardware and software manufacturers. Apple with the Apple
    II, Lisa and MacIntosh, Atari with their 8-bit line and ther Atari ST
    line, Commodore with their 8-bit line and their Amiga line, and, on
    the software side, Digital Research with CP/M(-86/68K) and GEM, and
    various Unix offerings, including Xenix. Were they all no real
    competition? Not in my book. It's just that Microsoft eventually
    won, maybe accidentially (as it happens in a winner-takes-all market).

    IBM were famous for poor (and perhaps cowardly) decisions at the time,
    and MS happily screwed them over again and again in regards to OS/2.

    Another interpretation is that MS went faithfully into OS/2, assigning
    not just their Xenix team to it (although according to Wikipedia the
    Xenix abandonment by MS was due to AT&T entering the Unix market) and reportedly also assigned the best MS-DOS developers to OS/2. They
    tried to stick to OS/2 for several years, but eventually were fed up
    with all the bad decisions coming from IBM, and bowed out.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sun Oct 6 16:34:00 2024
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and
    considered viable a decade after the original inventors had learned
    better.

    Another interpretation is that MS went faithfully into OS/2,
    assigning not just their Xenix team to it (although according
    to Wikipedia the Xenix abandonment by MS was due to AT&T
    entering the Unix market) and reportedly also assigned the best
    MS-DOS developers to OS/2. They tried to stick to OS/2 for
    several years, but eventually were fed up with all the bad
    decisions coming from IBM, and bowed out.

    It's known that they split the work with IBM, such the MS would do a
    redesigned OS/2 that was intended to be version 3.0, while IBM
    concentrated on 2.0. A friend of mine was working on OS/2 within IBM at
    the time, until he left with serious stress and depression: the people management was not good.

    Then MS switched emphasis, so that the Windows API was the primary
    personality of OS/2 3.0, and renamed it Windows NT. That also had an OS/2 personality at the start, along with a POSIX personality.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Sun Oct 6 18:50:01 2024
    On Thu, 3 Oct 2024 22:17:23 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Thu, 03 Oct 2024 09:39:03 GMT, Anton Ertl wrote:

    BTW, at least in my 32-bit PowerPC manual the claim is that PowerPC
    is a 64-bit architecture, and that the manual describes only the
    32-bit subset. Maybe the original Power was 32-bit.

    I would say IBM designed 32-bit POWER/PowerPC as a cut-down 64-bit architecture, needing only a few gaps filled to make it fully 64-bit.

    The PowerPC 601 was first shown publicly in 1993; I can’t remember
    when the fully 64-bit 620 came out, but it can’t have been long after.


    For all practical purposes, PPC/MPC620 never came out.
    That is, the chip was formally shipped 2-3 years later than originally
    planned in order to fulfill contractual obligations by somebody I don't remember to somebody else I also don't remember. But that 2nd somebody
    barely used it.
    By that time, IBM itself had another 64-bit POWER CPU working. That one
    had less ambitious microarchitecture than 620, but also had one advanced feature that didn't appear in "big" POWER CPUs until POWER5 7 years
    later - multi-threading.

    https://en.wikipedia.org/wiki/IBM_RS64


    Motorola did a similar thing with the 68000 family: if you compare
    the original 68000 instruction set with the 68020, you will see the
    latter only needed to fill in a few gaps to become fully 32-bit.


    Not similar at all.

    Compare this with the pain the x86 world went through, over a much
    longer time, to move to 32-bit.

    As far as Intel is responsible, only one year longer - 8 years vs 7
    years.
    And if we count only main line CPUs then it would be 8 years vs 8 years
    (from POWER to POWER3).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Sun Oct 6 22:07:11 2024
    Michael S wrote:
    On Sat, 05 Oct 2024 18:11:55 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Brett <ggtgp@yahoo.com> writes:
    Didn’t it take a decade for the 386 to get a 32 bit OS

    386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
    Windows/386 appeared in 1987.

    - anton

    SunOS for i386 in 1988.
    Netware 3x in 1990.
    The later sold in very high volumes.

    It deserved to do so:

    For what it was doing (file/print service), it was by far the most
    efficient product I've even heard of!

    Drew Major managed to get the total latency of the "ack network
    interrupt, parse incoming packet, determine that it is a read request
    for which the client has the required access rights, locate the
    requested data somewhere in the memory cache, construct a response
    packet and hand it off to the network card" down to 300 clock cycles.

    Those clock cycles _might_ have been measured on a 486 with mostly single-cycle instructions, instead of the original 386 which needed 2+
    clock cycles for lots of stuff. The point still stands, it was amazingly efficient.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Michael S on Sun Oct 6 21:53:36 2024
    Michael S <already5chosen@yahoo.com> wrote:
    On Sat, 05 Oct 2024 18:11:55 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Brett <ggtgp@yahoo.com> writes:
    Didn’t it take a decade for the 386 to get a 32 bit OS

    386/ix appeared in 1985, exactly 0 years after the 386. Xenix/386 and
    Windows/386 appeared in 1987.

    - anton

    SunOS for i386 in 1988.
    Netware 3x in 1990.
    The later sold in very high volumes.

    The first 32 bit windows was Windows 95, a full decade later.
    Windows 386 was 16 bit as was Windows 2.x.

    I do concede to being wrong about the unix ports.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Brett on Mon Oct 7 06:29:12 2024
    On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

    The first 32 bit windows was Windows 95 ...

    Windows NT 3.1, 1993.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Oct 7 06:32:46 2024
    On Sun, 6 Oct 2024 16:34 +0100 (BST), John Dallman wrote:

    Then MS switched emphasis, so that the Windows API was the primary personality of OS/2 3.0, and renamed it Windows NT.

    Dave Cutler came from DEC (where he was one of the resident Unix-haters)
    to mastermind the Windows NT project in 1988. When did the OS/2→NT pivot
    take place?

    That also had an OS/2 personality at the start, along with a POSIX personality.

    Funny, you’d think they would use that same “personality” system to implement WSL1, the Linux-emulation layer. But they didn’t.

    I think the whole “personality” concept, along with the supposed portability to non-x86 architectures, had just bit-rotted away by that
    point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 7 06:33:59 2024
    On Sun, 6 Oct 2024 18:50:01 +0300, Michael S wrote:

    Motorola did a similar thing with the 68000 family: if you compare the
    original 68000 instruction set with the 68020, you will see the latter
    only needed to fill in a few gaps to become fully 32-bit.

    Not similar at all.

    Planning for the next major leap forward, and building the current
    generation as a cut-down version of that? Of course it is similar.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Mon Oct 7 07:33:14 2024
    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and >considered viable a decade after the original inventors had learned
    better.

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model. The smaller memory models were
    possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access, as
    was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, given 640KB of RAM (not to mention the 16MB that
    the 286 supported); and with 640KB having the segments limited to 64KB
    is too restrictive for a number of applications.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lars Poulsen@21:1/5 to Anton Ertl on Mon Oct 7 12:42:20 2024
    On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model. The smaller memory models were
    possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access, as
    was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, given 640KB of RAM (not to mention the 16MB that
    the 286 supported); and with 640KB having the segments limited to 64KB
    is too restrictive for a number of applications.

    I completely agree. Back when the 8086 was designed, 640K seemed like a
    lot. They never expected it to grow beyond the minframes of their time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lars Poulsen on Mon Oct 7 15:17:36 2024
    Lars Poulsen wrote:
    On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model. The smaller memory models were
    possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access, as
    was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, given 640KB of RAM (not to mention the 16MB that
    the 286 supported); and with 640KB having the segments limited to 64KB
    is too restrictive for a number of applications.

    I completely agree. Back when the 8086 was designed, 640K seemed like a
    lot. They never expected it to grow beyond the minframes of their time.

    640K was an artifact of the frame buffer placement selected by the
    original IBM PC, it could just as well have been 900+ K.

    Afair the PC also mishandled interrupt handling?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon Oct 7 17:45:19 2024
    On Mon, 7 Oct 2024 15:17:36 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Lars Poulsen wrote:
    On 2024-10-07, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode,
    and in the 80286 they finally got around to it. And the idea was
    (like AFAIK in the iAPX432) to have one segment per object and per
    procedure, i.e., the large memory model. The smaller memory
    models were possible, but not really intended. The Huge memory
    model was completely alien to protected mode, as was direct
    hardware access, as was common on the IBM PC. And computing with
    segment register contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would
    have been too inefficient, and also that 8192 segments is not
    enough for that kind of usage, given 640KB of RAM (not to mention
    the 16MB that the 286 supported); and with 640KB having the
    segments limited to 64KB is too restrictive for a number of
    applications.

    I completely agree. Back when the 8086 was designed, 640K seemed
    like a lot. They never expected it to grow beyond the minframes of
    their time.

    640K was an artifact of the frame buffer placement selected by the
    original IBM PC, it could just as well have been 900+ K.

    Afair the PC also mishandled interrupt handling?

    Terje


    Yes it did.
    IIRC, Intel recommended interrupt slots 0 to 31 to be reserved for
    hardware interrupts, but IBM ignored their recommendation and put
    various BIOS at slots 16 to 31.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Lawrence D'Oliveiro on Mon Oct 7 16:16:32 2024
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

    The first 32 bit windows was Windows 95 ...

    Windows NT 3.1, 1993.

    So 8 years, that PC would still be in the trash can by then.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Anton Ertl on Mon Oct 7 16:32:34 2024
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and
    considered viable a decade after the original inventors had learned
    better.

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model. The smaller memory models were
    possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access, as
    was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, given 640KB of RAM (not to mention the 16MB that
    the 286 supported); and with 640KB having the segments limited to 64KB
    is too restrictive for a number of applications.

    I have for decades pointed out that the four bit offset of 8086 segments
    was planned obsolescence. An 8 bit offset with 16 megabytes of address
    space would have kept the low end alive for too long in Intels eyes. To
    control the market you need to drive complexity onto the users, which weeds
    out licensed competition.

    Everything Intel did drove needless patentable complexity into the follow
    on CPUs.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Brett on Mon Oct 7 19:57:44 2024
    On Mon, 7 Oct 2024 16:16:32 -0000 (UTC)
    Brett <ggtgp@yahoo.com> wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Sun, 6 Oct 2024 21:53:36 -0000 (UTC), Brett wrote:

    The first 32 bit windows was Windows 95 ...

    Windows NT 3.1, 1993.

    So 8 years, that PC would still be


    Wikipedia:
    Development of i386 technology began in 1982 under the internal name of
    P3.[4] The tape-out of the 80386 development was finalized in July
    1985.[4] The 80386 was introduced as pre-production samples for
    software development workstations in October 1985.[5] Manufacturing of
    the chips in significant quantities commenced in June 1986.

    in the trash can by then.

    Not every PC made in those years was crap. Some of them were quite
    reliable and lasted long.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Brett on Mon Oct 7 20:03:35 2024
    On Mon, 7 Oct 2024 16:32:34 -0000 (UTC)
    Brett <ggtgp@yahoo.com> wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and
    considered viable a decade after the original inventors had learned
    better.

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and
    in the 80286 they finally got around to it. And the idea was (like
    AFAIK in the iAPX432) to have one segment per object and per
    procedure, i.e., the large memory model. The smaller memory models
    were possible, but not really intended. The Huge memory model was completely alien to protected mode, as was direct hardware access,
    as was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would
    have been too inefficient, and also that 8192 segments is not
    enough for that kind of usage, given 640KB of RAM (not to mention
    the 16MB that the 286 supported); and with 640KB having the
    segments limited to 64KB is too restrictive for a number of
    applications.

    I have for decades pointed out that the four bit offset of 8086
    segments was planned obsolescence. An 8 bit offset with 16 megabytes
    of address space would have kept the low end alive for too long in
    Intels eyes. To control the market you need to drive complexity onto
    the users, which weeds out licensed competition.

    Everything Intel did drove needless patentable complexity into the
    follow on CPUs.

    - anton




    You forget that Intel didn't and couldn't expect that 8088 would be
    such stunning success. Not just that. According to Oral history they
    didn't realize what they have in hands until 1983.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Michael S on Mon Oct 7 17:40:04 2024
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 7 Oct 2024 16:32:34 -0000 (UTC)
    Brett <ggtgp@yahoo.com> wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and
    considered viable a decade after the original inventors had learned
    better.

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and
    in the 80286 they finally got around to it. And the idea was (like
    AFAIK in the iAPX432) to have one segment per object and per
    procedure, i.e., the large memory model. The smaller memory models
    were possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access,
    as was common on the IBM PC. And computing with segment register
    contents was also not intended.

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would
    have been too inefficient, and also that 8192 segments is not
    enough for that kind of usage, given 640KB of RAM (not to mention
    the 16MB that the 286 supported); and with 640KB having the
    segments limited to 64KB is too restrictive for a number of
    applications.

    I have for decades pointed out that the four bit offset of 8086
    segments was planned obsolescence. An 8 bit offset with 16 megabytes
    of address space would have kept the low end alive for too long in
    Intels eyes. To control the market you need to drive complexity onto
    the users, which weeds out licensed competition.

    Everything Intel did drove needless patentable complexity into the
    follow on CPUs.

    You forget that Intel didn't and couldn't expect that 8088 would be
    such stunning success. Not just that. According to Oral history they
    didn't realize what they have in hands until 1983.

    Today the 8088 is a joke microcontroller, that was not the case when it was introduced. The 8088 was a major project with major profits, not some afterthought.

    Yes the success eventually dwarfed expectations, but that was a lightning strike, the plan was in place and so the lightning strike could be taken advantage of to build a monopoly, instead of the small walled fortress with moat that was planned.

    You saw what happened to the MC680x0 series that did not have a moat or a
    good plan.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Oct 7 16:00:38 2024
    Not every PC made in those years was crap. Some of them were quite
    reliable and lasted long.

    But back then, Dennard scaling meant that an 8 year-old PC was so much
    slower than a current PC that it was difficult to find people willing to
    still use it.

    Nowadays, for a large proportion of tasks, you can't really tell the
    difference between a last-generation CPU and an 8 year-old CPU, so the reliability is much more of a factor.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stefan Monnier on Tue Oct 8 00:11:31 2024
    On Mon, 07 Oct 2024 16:00:38 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Not every PC made in those years was crap. Some of them were quite reliable and lasted long.

    But back then, Dennard scaling meant that an 8 year-old PC was so much
    slower than a current PC that it was difficult to find people willing
    to still use it.

    Nowadays, for a large proportion of tasks, you can't really tell the difference between a last-generation CPU and an 8 year-old CPU, so the reliability is much more of a factor.


    Stefan

    In March 1992 as a new employee I was given a PC based on 386SX.
    I don't remember if the clock was 16 MHz or 20 MHz, but no more than 20.
    1.5 years later when I started to work at client's site for the most of
    my time, this PC was still my only desktop when I was coming back to
    office.
    High-end PC made in 1986, e.g. Compaq Deskpro 386, would be
    non-trivially faster than this cheap, but far from the cheapest,
    computer that I used daily 7.5 years later.

    Did it feel so slow that was difficult to use? No, for what I was doing
    it wasn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Mon Oct 7 21:46:28 2024
    On Mon, 7 Oct 2024 19:57:44 +0300, Michael S wrote:

    The 80386 was introduced as pre-production samples for software
    development workstations in October 1985.[5] Manufacturing of the chips
    in significant quantities commenced in June 1986.

    And the first vendor to offer a Microsoft-compatible PC product based on
    that chip? Compaq, with its “Deskpro 386” that same year, I believe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Anton Ertl on Mon Oct 7 21:52:31 2024
    On Mon, 07 Oct 2024 07:33:14 GMT, Anton Ertl wrote:

    Here's another speculation: The 286 protected mode was what they already
    had in mind when they built the 8086, but there were not enough
    transistors to do it in the 8086, so they did real mode, and in the
    80286 they finally got around to it.

    Nah. Intel were never capable of thinking that far ahead. Each bodge to
    the x80/x86 line was made just to take the product one step further,
    without regard to any future growth. The 8086/8088 was designed to make it
    easy to port across 8080/8085 code, with the segment registers tacked on
    to give you a bit more address space if you needed it, if you could figure
    out how to use it -- that was their idea of “technological progress”.

    And then the next step was to add this new-fangled “hardware memory protection” that the folks using the Big Computers were always going on about, so they bodged the 8086 segmentation scheme into kind of a memory- management scheme in the 80286.

    Finally, in the 80386, they gave everybody the large, linear address space
    they had been crying out for. That is, everybody who wasn’t already using more sensibly-designed chips from companies like Motorola, NatSemi and the
    RISC vendors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Terje Mathisen on Mon Oct 7 21:55:05 2024
    On Mon, 7 Oct 2024 15:17:36 +0200, Terje Mathisen wrote:

    640K was an artifact of the frame buffer placement selected by the
    original IBM PC, it could just as well have been 900+ K.

    Another MS-DOS machine, the DEC Rainbow, could be upgraded to 896KiB of
    RAM. I know because our Comp Sci department had one.

    That was the one with the dual Z80 and 8086 (8088?) chips, so it could run
    3 different OSes: CP/M-80, CP/M-86, and MS-DOS. Not more than one at once, though (that would have been some trick).

    But it was not fully hardware-compatible with the IBM PC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Oct 7 23:13:25 2024
    On Mon, 7 Oct 2024 7:33:14 +0000, Anton Ertl wrote:

    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Oct6.150415@mips.complang.tuwien.ac.at>, >>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    I find it hard to believe that many customers would ask Intel
    for something the 80286 protected mode with segments limited
    to 64KB, and even if, that Intel would listen to them. This
    looks much more like an idee fixe to me that one or more of
    the 286 project leaders had, and all customer input was made
    to fit into this idea, or was ignored.

    Either half-remembered from older architectures, or re-invented and >>considered viable a decade after the original inventors had learned
    better.

    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model. The smaller memory models were
    possible, but not really intended. The Huge memory model was
    completely alien to protected mode, as was direct hardware access, as
    was common on the IBM PC. And computing with segment register
    contents was also not intended.

    Is protected mode not "how Pascal" thinks of memory and objects
    in memory ??

    If programmers had used the 8086 in the intended way, porting to
    protected mode would have been easy, but the programmers used it in
    other ways, and the protected mode flopped.

    Whereas by the time 286 got out, everybody was wanting flat
    memory ala C.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, given 640KB of RAM (not to mention the 16MB that
    the 286 supported); and with 640KB having the segments limited to 64KB
    is too restrictive for a number of applications.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 8 06:16:12 2024
    On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

    Is protected mode not "how Pascal" thinks of memory and objects in
    memory ??

    How is that different from C?

    Whereas by the time 286 got out, everybody was wanting flat memory ala
    C.

    When did they not want that?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Tue Oct 8 07:28:21 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Is protected mode not "how Pascal" thinks of memory and objects
    in memory ??

    You can map the object of standard C (essentially what one malloc()
    call gives you, or what a variable contains) and the equivalent of
    Pascal into a segment on the 286, but then that object is limited to
    64KB in size, and the program is limited to 8192 objects. And the
    program runs very slowly. So I doubt that Pascal compiler
    implementors though that the 80286 is their dream machine. Especially
    given that you can implement Pascal just as well with less performance disadvantages on hardware with flat memory (and still perform bounds
    checking where necessary). In particular, I doubt that Turbo Pascal
    used the 286 in this way.

    Whereas by the time 286 got out, everybody was wanting flat
    memory ala C.

    It's interesting that, when C was standardized, the segmentation found
    its way into it by disallowing subtracting and comparing between
    addresses in different objects. This disallows performing certain
    forms of induction variable elimination by hand. So while flat memory
    is C culture so much that you write "flat memory ala C", the
    standardized subset of C (what standard C fanatics claim is the only
    meaning of "C") actually specifies a segmented memory model.

    An interesting case is the Forth standard. It specifies "contiguous
    regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 10:40:09 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 7 Oct 2024 19:57:44 +0300, Michael S wrote:

    The 80386 was introduced as pre-production samples for software
    development workstations in October 1985.[5] Manufacturing of the chips
    in significant quantities commenced in June 1986.

    And the first vendor to offer a Microsoft-compatible PC product based on
    that chip? Compaq, with its “Deskpro 386” that same year, I believe.

    I got one of those that fall, most impressive was the fact hhat you
    could order it with a 130 MB hard drive, an almost unheard of size at
    the time:

    Even though this was an expensive PC, it cost no more with that drive
    (i.e. the highest end version) than a Micropolis hard drive of the same
    size. I.e. the PC was effectively free. :-)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 10:44:54 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 7 Oct 2024 15:17:36 +0200, Terje Mathisen wrote:

    640K was an artifact of the frame buffer placement selected by the
    original IBM PC, it could just as well have been 900+ K.

    Another MS-DOS machine, the DEC Rainbow, could be upgraded to 896KiB of
    RAM. I know because our Comp Sci department had one.

    That was the one with the dual Z80 and 8086 (8088?) chips, so it could run
    3 different OSes: CP/M-80, CP/M-86, and MS-DOS. Not more than one at once, though (that would have been some trick).

    But it was not fully hardware-compatible with the IBM PC.

    When I was hired by Hydro in 1984, my boss decided that he liked the
    Rainbow best, so he took responsibility for that model, while I got all
    IBM compatibles: Hardware/Software/add-on cards/etc for a company with
    77K employees in 130 countries.

    By the time Hydro was broken up into 3-4 separate companies (around
    2008?) I had to share that responsibility with 2-300 other people.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Oct 8 16:00:17 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model.

    If you look at the 8086 manuals, that's clearly what they had in mind.

    What I don't get is that the 286's segment stuff was so slow. Even
    if you wanted to write code that put objects in segments, you really
    couldn't. You had to minimize the number of segment switches to get
    decent performance.

    Would it have been differently if the 8086/8088 had already had
    protected mode? I think that having one segment per object would have
    been too inefficient, and also that 8192 segments is not enough for
    that kind of usage, ...

    That's all true, but I'd think the slowness of segment switches would
    be even more of a problem.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Tue Oct 8 16:23:32 2024
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Here's another speculation: The 286 protected mode was what they
    already had in mind when they built the 8086, but there were not
    enough transistors to do it in the 8086, so they did real mode, and in
    the 80286 they finally got around to it. And the idea was (like AFAIK
    in the iAPX432) to have one segment per object and per procedure,
    i.e., the large memory model.

    If you look at the 8086 manuals, that's clearly what they had in mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 8 20:53:00 2024
    On Tue, 8 Oct 2024 6:16:12 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

    Is protected mode not "how Pascal" thinks of memory and objects in
    memory ??

    How is that different from C?

    In pascal you cannot subtract pointers to different objects,
    in C you can.

    Whereas by the time 286 got out, everybody was wanting flat memory ala
    C.

    When did they not want that?

    The Algol family of block structure gave the illusion that flat
    was less necessary and it could all be done with lexical address-
    ing and block scoping rules.

    Then malloc() and mmap() came along.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Tue Oct 8 21:03:40 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    If you look at the 8086 manuals, that's clearly what they had in mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    Right, and they appeared not to care or realize it was a performance problem.

    They didn't even do obvious things like see if you're reloading the same value into the segment register and skip the rest of the setup. Sure, you could put checks in your code and skip the segment load but that would make your code a lot bigger and uglier.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 8 22:28:00 2024
    In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    Dave Cutler came from DEC (where he was one of the resident
    Unix-haters) to mastermind the Windows NT project in 1988. When did
    the OS/2-NT pivot take place?

    1990, after the release of Windows 3.0, which was an immediate commercial success. It was the first version that you could get serious work out of.
    It's been compared to a camel: a vicious brute at times, but capable of
    doing a lot of carrying.

    <https://en.wikipedia.org/wiki/OS/2#1990:_Breakup>

    Funny, you'd think they would use that same _personality_ system to
    implement WSL1, the Linux-emulation layer. But they didn't.

    They were called subsystems in Windows NT, and ran on top of the NT
    kernel. The POSIX one came first, and was very limited, followed by the
    Interix one that was called Windows Services for Unix. Programs for both
    of these were in PE-COFF format, not ELF. There was also the OS/2
    subsystem, but it only ran text-mode programs.

    The POSIX subsystem was there to meet US government purchasing
    requirements, not to be used for anything serious. I can't imagine Dave
    Cutler was keen on it.

    WSL1 seems to have been something odd: rather than a single subsystem, a
    bunch of mini-subsystems. However, VMS/NT kernels just have different assumptions about programs from Unix-style kernels, so they went to
    lightweight virtualisation in WSL2.

    <https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux#History>

    The same problem seems to have messed up all the attempts to provide good
    Unix emulation on VMS. It's notable that MICA started out trying to
    provide both VMS and Unix APIs, but this was dropped in favour of a
    separate Unix OS before MICA was cancelled.

    <https://en.wikipedia.org/wiki/DEC_MICA#Design_goals>

    I think the whole _personality_ concept, along with the supposed
    portability to non-x86 architectures, had just bit-rotted away by
    that point.

    Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Oct 9 08:48:26 2024
    On 08/10/2024 22:53, MitchAlsup1 wrote:
    On Tue, 8 Oct 2024 6:16:12 +0000, Lawrence D'Oliveiro wrote:

    On Mon, 7 Oct 2024 23:13:25 +0000, MitchAlsup1 wrote:

    Is protected mode not "how Pascal" thinks of memory and objects in
    memory ??

    How is that different from C?

    In pascal you cannot subtract pointers to different objects,
    in C you can.

    No, you can't - unless the pointers are of compatible types, and each
    points to a sub-object within the same encompassing object. So if you
    have two pointers that point within the same array, you can subtract
    them. If they point to different objects, trying to subtract them is
    undefined behaviour.


    Whereas by the time 286 got out, everybody was wanting flat memory ala
    C.

    When did they not want that?

    The Algol family of block structure gave the illusion that flat
    was less necessary and it could all be done with lexical address-
    ing and block scoping rules.

    Then malloc() and mmap() came along.

    malloc() does not need a flat memory space. C does not need a flat
    memory space. Indeed, people use C all the time on systems where memory
    is disjoint.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Wed Oct 9 10:24:34 2024
    On 08/10/2024 09:28, Anton Ertl wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:

    Whereas by the time 286 got out, everybody was wanting flat
    memory ala C.

    It's interesting that, when C was standardized, the segmentation found
    its way into it by disallowing subtracting and comparing between
    addresses in different objects.

    It is difficult to talk about the timing of features (either things that
    are allowed, or things explicitly disallowed) before the standardisation
    of C, as there was no single language "C". Different variants supported
    by different compilers had different rules.

    This disallows performing certain
    forms of induction variable elimination by hand. So while flat memory
    is C culture so much that you write "flat memory ala C", the
    standardized subset of C (what standard C fanatics claim is the only
    meaning of "C") actually specifies a segmented memory model.


    No, the C standard does not in any sense specify a segmented memory
    model. Nor does it specify a non-segmented or flat or contiguous memory.

    The nearest it gets is the description of converting between pointers
    and integers, where it says that the conversion of a pointer to an
    integer might not fit in any integer type, in which case the conversions
    are undefined behaviour - but if they /are/ convertible, the intention
    is that the value (of type "uintptr_t") should be consistent with "the addressing structure of the execution environment".

    The way C is specified is intended to be strong enough to allow
    programmers to do all they generally need to do using portable code
    (i.e., code that doesn't rely on anything other than standard
    behaviour), without unnecessarily restricting the kinds of systems that
    can implement C, and without unnecessarily restricting what people can
    write in non-portable code.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library. (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    In practice, on all but the most niche or specialised platforms, if you
    do feel you need to compare random pointers, you can cast them to
    uintptr_t and compare these. That will generally work on segmented, non-contiguous or flat memories.


    An interesting case is the Forth standard. It specifies "contiguous regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.


    Forth does not require a flat memory model in the hardware, as far as I
    am aware, any more than C does. (I appreciate that your knowledge of
    Forth is /vastly/ greater than mine.) A Forth implementation could
    interpret part of the address value as the segment or other memory block identifier and part of it as an index into that block, just as a C implementation can.

    A flat address model is almost certainly more /efficient/, for C, Forth
    and many other languages. But that does not mean a particular model is /required/ or specified by the language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Oct 9 16:28:19 2024
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library. (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 9 16:42:38 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library. (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    In most every mainstream implementation, memmove() is written
    in assembler in order to inject the appropriate prefeches and
    follow the recommended instruction usage per the target architecture
    software optimization guide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Wed Oct 9 18:10:44 2024
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Dallman on Wed Oct 9 13:37:41 2024
    John Dallman wrote:
    In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

    Dave Cutler came from DEC (where he was one of the resident
    Unix-haters) to mastermind the Windows NT project in 1988. When did
    the OS/2-NT pivot take place?

    1990, after the release of Windows 3.0, which was an immediate commercial success. It was the first version that you could get serious work out of. It's been compared to a camel: a vicious brute at times, but capable of
    doing a lot of carrying.

    <https://en.wikipedia.org/wiki/OS/2#1990:_Breakup>

    Funny, you'd think they would use that same _personality_ system to
    implement WSL1, the Linux-emulation layer. But they didn't.

    They were called subsystems in Windows NT, and ran on top of the NT
    kernel. The POSIX one came first, and was very limited, followed by the Interix one that was called Windows Services for Unix. Programs for both
    of these were in PE-COFF format, not ELF. There was also the OS/2
    subsystem, but it only ran text-mode programs.

    The POSIX subsystem was there to meet US government purchasing
    requirements, not to be used for anything serious. I can't imagine Dave Cutler was keen on it.

    The Posix interface support was there so *MS* could bid on US government
    and military contracts which, at that time frame, were making noise about
    it being standard for all their contracts.
    The Posix DLLs didn't come with WinNT, you had to ask MS for them specially.

    The US government eventually stopped pushing for Posix and Windows
    support for it quietly disappeared.

    WinNT's OS2 subsystem also quietly disappeared.

    WSL1 seems to have been something odd: rather than a single subsystem, a bunch of mini-subsystems. However, VMS/NT kernels just have different assumptions about programs from Unix-style kernels, so they went to lightweight virtualisation in WSL2.

    Yes. VMS and WinNT handle memory sections differently than *nix.
    That difference makes fork() system call essentially impossible to
    implement on VMS/WinNT except by copying the address space.

    Note that back then Posix did not require fork be supported,
    just fork-exec (aka spawn) which does not require duplicating memory space, just carrying file and socket handles to the child process which
    NT handles natively.

    In the VMS/WinNT way, each memory section is defined as either shared
    or private when created and cannot be changed. This allows optimizations
    in page table and page file handling.

    Whereas in *nix a process can map a file and there is just one section user, then fork and now there are multiple section users. Then that child can
    change the address space and fork again. *nix needs to maintain various
    data structures to support forking memory just in case it happens.

    WSL1 was an _emulation_ of Linux essentially as a subsystem like OS2 and
    Posix were supported. WSL1 apparently supported fork() but did so by
    copying memory space making it slow, whereas fork-exec/spawn would be fast. Trying to emulate Linux with a privileged subsystem of helper processes
    was likely (I never used it) a lot of work, slow, and flaky.

    WSL2 sounds like they tossed the whole WSL1 thing and built a hyper-V
    virtual machine to run native Linux on top of WinNT as a host.

    <https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux#History>

    The same problem seems to have messed up all the attempts to provide good Unix emulation on VMS. It's notable that MICA started out trying to
    provide both VMS and Unix APIs, but this was dropped in favour of a
    separate Unix OS before MICA was cancelled.

    <https://en.wikipedia.org/wiki/DEC_MICA#Design_goals>

    I think the whole _personality_ concept, along with the supposed
    portability to non-x86 architectures, had just bit-rotted away by
    that point.

    Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

    John

    Back then "object oriented" and "micro-kernel" buzzwords were all the rage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Oct 9 16:01:42 2024
    In the VMS/WinNT way, each memory section is defined as either shared
    or private when created and cannot be changed. This allows optimizations
    in page table and page file handling.

    Interesting. Do you happen to have a pointer for further reading
    about it?

    *nix needs to maintain various data structures to support forking
    memory just in case it happens.

    I can't imagine what those datastructures would be (which might be just
    another way to say that I was brought up on POSIX and can't imagine the
    world differently).


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Wed Oct 9 22:22:16 2024
    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Oct 9 22:20:42 2024
    On 09/10/2024 18:28, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".  Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.  (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    They don't have to write it in standard, portable C. Standard libraries
    will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they want.

    You will find that most implementations of memmove() are done by
    converting the pointers to a unsigned integer type and comparing those
    values. The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
    writers can use C99 for theirs).

    Such implementations will not be portable to all systems. They won't
    work on a target that has some kind of "fat" pointers or segmented
    pointers that can't be translated properly to integers.

    That's okay, of course. For targets that have such complications, that standard library function will be written in a different way.

    The avrlibc library used by gcc for the AVR has its memmove()
    implemented in assembly for speed, as does musl for some architectures.

    There are lots of parts of the standard C library that cannot be written completely in portable standard C. (How would you write a function that handles files? You need non-portable OS calls.) That's why these
    things are in the standard library in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Wed Oct 9 21:37:30 2024
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to David Brown on Wed Oct 9 14:52:39 2024
    On 10/9/2024 1:20 PM, David Brown wrote:
    On 09/10/2024 18:28, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 8:24:34 +0000, David Brown wrote:

    On 08/10/2024 09:28, Anton Ertl wrote:.

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".  Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.  (Standard
    library implementations don't need to be portable, and can rely on
    extensions or other compiler-specific features.)

    Somebody has to write memmove() and they want to use C to do it.

    They don't have to write it in standard, portable C.  Standard libraries will, sometimes, use "magic" - they can be in assembly, or use compiler extensions, or target-specific features, or "-fsecret-flag-for-std-lib" compiler flags, or implementation-dependent features, or whatever they
    want.

    You will find that most implementations of memmove() are done by
    converting the pointers to a unsigned integer type and comparing those values.  The type chosen may be implementation-dependent, or it may be "uintptr_t" (even if you are using C90 for your code, the library
    writers can use C99 for theirs).

    Such implementations will not be portable to all systems.  They won't
    work on a target that has some kind of "fat" pointers or segmented
    pointers that can't be translated properly to integers.

    That's okay, of course.  For targets that have such complications, that standard library function will be written in a different way.

    The avrlibc library used by gcc for the AVR has its memmove()
    implemented in assembly for speed, as does musl for some architectures.

    There are lots of parts of the standard C library that cannot be written completely in portable standard C.  (How would you write a function that handles files?  You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.

    I agree with everything you say up until the last sentence. There are
    several languages, mostly older ones like Fortran and COBOL, where the
    file handling/I/O are defined portably within the language proper, not
    in a separate library. It just moves the non-portable stuff from the
    library writer (as in C) to the compiler writer (as in Fortran, COBOL, etc.)


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Wed Oct 9 23:16:34 2024
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    In the VMS/WinNT way, each memory section is defined as either shared
    or private when created and cannot be changed. This allows optimizations
    in page table and page file handling.

    Interesting. Do you happen to have a pointer for further reading
    about it?

    *nix needs to maintain various data structures to support forking
    memory just in case it happens.

    I can't imagine what those datastructures would be (which might be just >another way to say that I was brought up on POSIX and can't imagine the
    world differently).


    http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Oct 10 00:33:41 2024
    On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be written
    completely in portable standard C.  (How would you write a function that
    handles files? 

    Do you mean things other than open(), close(), read(), write(), lseek()
    ??

    You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Oct 10 08:30:37 2024
    On 10/10/2024 02:33, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 21:52:39 +0000, Stephen Fuld wrote:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be written >>> completely in portable standard C.  (How would you write a function that >>> handles files?

    Do you mean things other than open(), close(), read(), write(), lseek()
    ??


    The C standard library provides functions like fopen(), fclose(),
    fwrite(), etc. It provides them because programs often need such functionality, and you cannot write them yourself in portable standard
    C. (As Stephen pointed out, C could have had them built into the
    language - for many good reasons, C did not go that route.)

    The functions you list here are the POSIX names - not the C standard
    library names. Those POSIX functions cannot be implemented in portable standard C either if you exclude making wrappers around the standard
    library functions.

    In both cases - implementing the standard library functions or
    implementing the POSIX functions - you need something beyond standard C,
    such as a way to call OS API's.

    You need non-portable OS calls.)  That's why these
    things are in the standard library in the first place.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stephen Fuld on Thu Oct 10 08:24:32 2024
    On 09/10/2024 23:52, Stephen Fuld wrote:
    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be
    written completely in portable standard C.  (How would you write a
    function that handles files?  You need non-portable OS calls.)  That's
    why these things are in the standard library in the first place.

    I agree with everything you say up until the last sentence.  There are several languages, mostly older ones like Fortran and COBOL, where the
    file handling/I/O are defined portably within the language proper, not
    in a separate library.  It just moves the non-portable stuff from the library writer (as in C) to the compiler writer (as in Fortran, COBOL,
    etc.)



    I meant that this is why these features have to be provided, rather than
    left for the user to implement themselves. They could also have been
    provided in the language itself (as was done in many other languages) -
    the point is that you cannot write the file access functions in pure
    standard C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Oct 10 08:31:52 2024
    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not. It has absolutely /nothing/ to do with the ISA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Thu Oct 10 18:38:55 2024
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not. It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Thu Oct 10 20:00:29 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C. That's why I said there was no >connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers. Often that can be >eliminated when the compiler optimises the functions inline - when the >compiler knows the size of the move/copy, it can optimise directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code generator,
    its run-time support library, and C standard libraries that can work
    better if they are optimised for each new generation of processor.
    Sometimes you just need to re-compile the library with a newer compiler
    and appropriate flags, other times you need to modify the library source >code. None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle memcpy
    and memset.

    They're three-instruction sets; prolog/body/epilog. There are separate
    sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Thu Oct 10 21:21:20 2024
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can
    implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C. That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers. Often that can be eliminated when the compiler optimises the functions inline - when the
    compiler knows the size of the move/copy, it can optimise directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code generator,
    its run-time support library, and C standard libraries that can work
    better if they are optimised for each new generation of processor.
    Sometimes you just need to re-compile the library with a newer compiler
    and appropriate flags, other times you need to modify the library source
    code. None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Thu Oct 10 23:54:15 2024
    On Thu, 10 Oct 2024 20:00:29 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C. That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers. Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise >directly.

    The use of wider register sizes can help to some extent, but not
    once you have reached the width of the internal buses or cache
    bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries
    that can work better if they are optimised for each new generation
    of processor. Sometimes you just need to re-compile the library with
    a newer compiler and appropriate flags, other times you need to
    modify the library source code. None of this is specific to
    memmove().

    But it is true that you get an easier and more future-proof
    memmove() and memcopy() if you have an ISA that supports scalable
    vector processing of some kind, such as ARM and RISC-V have, rather
    than explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
    memcpy and memset.

    They're three-instruction sets; prolog/body/epilog. There are
    separate sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.

    People that have more clue about Arm Inc schedule than myself
    expect Arm Cortex cores that implement these instructions to be
    announced next May and to appear in actual [expensive] phones in 2026.
    Which probably means 2027 at best for Neoverse cores.

    It's hard to make an educated guess about schedule of other Arm core
    designers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Oct 10 21:03:33 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 10 Oct 2024 20:00:29 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    David Brown <david.brown@hesbynett.no> writes:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:


    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C. That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers. Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    The use of wider register sizes can help to some extent, but not
    once you have reached the width of the internal buses or cache
    bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries
    that can work better if they are optimised for each new generation
    of processor. Sometimes you just need to re-compile the library with
    a newer compiler and appropriate flags, other times you need to
    modify the library source code. None of this is specific to
    memmove().

    But it is true that you get an easier and more future-proof
    memmove() and memcopy() if you have an ISA that supports scalable
    vector processing of some kind, such as ARM and RISC-V have, rather
    than explicitly sized SIMD registers.


    Note that ARMv8 (via FEAT_MOPS) does offer instructions that handle
    memcpy and memset.

    They're three-instruction sets; prolog/body/epilog. There are
    separate sets for forward vs. forward-or-backward copies.

    The prolog instruction preconditions the copy and copies
    an IMPDEF portion.

    The body instruction performs an IMPDEF Portion and

    the epilog instruction finalizes the copy.

    The three instructions are issued consecutively.

    People that have more clue about Arm Inc schedule than myself
    expect Arm Cortex cores that implement these instructions to be
    announced next May and to appear in actual [expensive] phones in 2026.
    Which probably means 2027 at best for Neoverse cores.

    It's hard to make an educated guess about schedule of other Arm core >designers.

    In the mean time, they've have "DC ZVA" for the special case of
    memset(,0,) since ARMv8.0.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Thu Oct 10 21:30:38 2024
    On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

    On 10/10/2024 20:38, MitchAlsup1 wrote:
    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.

    {
    memmove( p, q, size );
    }

    Where the compiler produces the MM instruction itself. Looks damn
    close to standard C to me !!
    OR
    for( int i = 0, i < size; i++ )
    p[i] = q[i];

    Which gets compiled to memcpy()--also looks to be standard C.
    OR

    p_struct = q_struct;

    gets compiled to::

    memmove( &p_struct, &q_struct, sizeof( q_struct ) );

    also looks to be std C.

    The thing is you are no longer writing memmove(), you are simply
    teaching the compiler to recognizes its _use_ cases directly. In
    addition, these will always be within spitting distance of as fast
    as one can perform those activities.

    That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers.

    Given that we are talking about GBOoO machines here, the several
    AGEN units[1,2,3] have plenty of calculation BW to determine order
    without wasting cycles getting started.

    But given LBIO machine, the ability to process memory to memory moves
    at cache port width is always an advantage except for cases needing
    only 1 read or 1 write--if you build the HW with these in mind.

    Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

    In HW they should always be optimized.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Fri Oct 11 14:10:13 2024
    On 10/10/2024 23:30, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 19:21:20 +0000, David Brown wrote:

    On 10/10/2024 20:38, MitchAlsup1 wrote:
    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.

          {
               memmove( p, q, size );
          }


    What is that circular reference supposed to do? The whole discussion
    has been about the /fact/ that you cannot implement the "memmove"
    function in a C standard library using fully portable standard C code.

    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?


    You can implement "memcpy" in portable standard C, using a loop and
    array or pointer syntax (somewhat like your loop below, but with the
    correct type for the index). But you cannot do so for memmove() because
    you cannot identify the direction you need to run your loop in an
    efficient and fully portable manner.

    It does not matter what the target is - the target is totally irrelevant
    for /portable/ standard C code. If the target made a difference, it
    would not be portable!

    I can't understand why this is causing you difficulty.

    Perhaps you simply didn't understand what you wrote a few posts back,
    when you claimed that the reason people writing portable standard C code
    cannot write an efficient memmove() implementation is "a symptom of bad
    ISA design".


    Where the compiler produces the MM instruction itself. Looks damn
    close to standard C to me !!
    OR
          for( int i = 0, i < size; i++ )
               p[i] = q[i];

    Which gets compiled to memcpy()--also looks to be standard C.
    OR

          p_struct = q_struct;

    gets compiled to::

          memmove( &p_struct, &q_struct, sizeof( q_struct ) );

    also looks to be std C.


    Those are standard C, yes. And a good compiler will optimise such code.
    And if the target has some kind of scalable vector support or other
    dedicated instructions for moving or copying memory, it can do a better
    job of optimising the code.

    That has /nothing/ to do with the point under discussion.


    I think you are simply confused about what you are talking about here.
    Either you don't know what is meant by writing portable standard C, or
    you don't know what is meant by implementing a C standard library, or
    you haven't actually been reading the posts you replied to. You seem determined to make the point that /your/ ISA has useful and efficient instructions and features for memory copy functionality, while the x86
    ISA does not, and that means /your/ ISA is good design and the x86 ISA
    is bad design.

    Now, I will fully agree with you that the x86 is not a good design. The
    modern x86 processor devices are proof that you /can/ polish a turd.
    And I fully agree with you that instructions for arbitrary length vector instructions of various sorts (of which memory copying is the simplest operation) have many advantages over SIMD using fixed-size vector
    registers. (ARM and RISC-V also agree with you there.)

    But that is all irrelevant to the discussion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brian G. Lucas on Fri Oct 11 13:37:03 2024
    On 10/10/2024 23:19, Brian G. Lucas wrote:
    On 10/10/24 2:21 PM, David Brown wrote:
    [ SNIP]

    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there
    was no connection between the two concepts.

    If the compiler generates the memmove instruction, then one doesn't
    have to write memmove() is C - it is never called/used.


    The common case is that a good compiler will generate inline code for
    some cases - typically known (at compile-time) small sizes - and call a
    generic library function when the size is not known or is over a certain
    size. Then there are some targets where it will always call the library
    code, and some where it will always generate inline code.

    Even if the compiler /can/ generate inline code, there can be
    circumstances when it will not do so - such as if you have not enabled optimisation, or are optimising for size, or using a weaker compiler, or calling the function indirectly.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    The use of wider register sizes can help to some extent, but not once
    you have reached the width of the internal buses or cache bandwidth.

    In general, there will be many aspects of a C compiler's code
    generator, its run-time support library, and C standard libraries that
    can work better if they are optimised for each new generation of
    processor. Sometimes you just need to re-compile the library with a
    newer compiler and appropriate flags, other times you need to modify
    the library source code.  None of this is specific to memmove().

    But it is true that you get an easier and more future-proof memmove()
    and memcopy() if you have an ISA that supports scalable vector
    processing of some kind, such as ARM and RISC-V have, rather than
    explicitly sized SIMD registers.


    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable to
    /what/ ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Fri Oct 11 15:13:17 2024
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his my66k
    LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad idea
    for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification, i.e.
    exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
    effort.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Fri Oct 11 16:54:13 2024
    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his my66k
    LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad idea
    for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification, i.e.
    exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove is certainly sufficiently important to justify an additional development
    effort.


    That explanation helps a little, but only a little. I wasn't suggesting anything - or if I was, it was several posts ago and the context has
    long since been snipped. Can you be more explicit about what you think
    I was suggesting, and why it might not be a good idea for targeting a
    "my66k" ISA? (That is not a processor I have heard of, so you'll have
    to give a brief summary of any particular features that are relevant here.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Stephen Fuld on Fri Oct 11 08:15:29 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

    On 10/9/2024 1:20 PM, David Brown wrote:

    There are lots of parts of the standard C library that cannot be
    written completely in portable standard C. (How would you write
    a function that handles files? You need non-portable OS calls.)
    That's why these things are in the standard library in the first
    place.

    I agree with everything you say up until the last sentence. There
    are several languages, mostly older ones like Fortran and COBOL,
    where the file handling/I/O are defined portably within the
    language proper, not in a separate library. It just moves the
    non-portable stuff from the library writer (as in C) to the
    compiler writer (as in Fortran, COBOL, etc.)

    What I think you mean is that I/O and file handling are defined as
    part of the language rather than being written in the language.
    Assuming that's true, what you're saying is not at odds with what
    David said. I/O and so forth cannot be written in unaugmented
    standard C without changing the language. Given the language as
    it is, these things must be put in the standard library, because
    they cannot be provided in the existing language.

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library. In
    particular, it makes for a very clean distinction between two
    kinds of implementation, what the C standard calls a freestanding implementation (which excludes most of the library) and a hosted
    implementation (which includes the whole library). This facility
    is what allows C to run easily on very small processors, because
    there is no overhead for non-essential language features. That is
    not to say such things couldn't be arranged for Fortran or COBOL,
    but it would be harder, because those languages are not designed
    to be separable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Fri Oct 11 18:55:29 2024
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Fri Oct 11 15:21:47 2024
    Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    In the VMS/WinNT way, each memory section is defined as either shared
    or private when created and cannot be changed. This allows optimizations >>> in page table and page file handling.
    Interesting. Do you happen to have a pointer for further reading
    about it?

    *nix needs to maintain various data structures to support forking
    memory just in case it happens.
    I can't imagine what those datastructures would be (which might be just
    another way to say that I was brought up on POSIX and can't imagine the
    world differently).


    http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

    Yeah, that's a great book on how VMS works in detail.
    My copy is v1.0 from 1981.
    It describes the various data structures, some down to the bit level.
    Then chapter 15 Paging Dynamics walks through the details of how
    paging works.

    A book of comparable detail on Linux (but dated) would be:

    Understanding the Linux Virtual Memory Manager, Gorman, 2007 https://www.kernel.org/doc/gorman/pdf/understand.pdf

    Of a similar nature on Windows but without the detail of the above two is:

    (this appears to be two volumes jammed together)
    Windows Internals 6th ed vol 1&2, 2012 https://empyreal96.github.io/nt-info-depot/Windows-Internals-PDFs/Windows%20Internals%206e%20Part1%2B2.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sat Oct 12 00:02:32 2024
    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

          .global memmove
    memmove:
          MM     R2,R1,R3
          RET

    sure !

    You are either totally clueless, or you are trolling. And I know you
    are not clueless.

    This discussion has become pointless.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Fri Oct 11 23:32:20 2024
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

          .global memmove
    memmove:
          MM     R2,R1,R3
          RET

    sure !

    You are either totally clueless, or you are trolling. And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sat Oct 12 05:06:05 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sat Oct 12 05:11:44 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:

    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different
    objects? For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL. A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details. (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they
    can implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard
    library memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    Throughout this long thread you keep missing the point. Having
    different instructions available doesn't change the definition
    of the C language. It is possible to write code in standard C
    (which means, C that does NOT depend on any internal details of
    any implementation) to copy bytes from one place to another with
    semantics matching those of memmove(), BUT that code is clunky.
    To get a decent implementation of memmove() semantics requires
    knowledge of some internal implementation details that are not
    part of standard C. Whether those details are part of the
    compiler or part of the runtime environment (the library) is
    irrelevant - they still aren't part of standard C. Adding new
    instructions to the ISA, no matter what those new instructions
    are, cannot change that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sat Oct 12 15:20:13 2024
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    In the VMS/WinNT way, each memory section is defined as either shared
    or private when created and cannot be changed. This allows optimizations >>>> in page table and page file handling.
    Interesting. Do you happen to have a pointer for further reading
    about it?

    *nix needs to maintain various data structures to support forking
    memory just in case it happens.
    I can't imagine what those datastructures would be (which might be just
    another way to say that I was brought up on POSIX and can't imagine the
    world differently).


    http://bitsavers.org/pdf/dec/vax/vms/training/EY-8264E-DP_VMS_Internals_and_Data_Structures_4.4_1988.pdf

    Yeah, that's a great book on how VMS works in detail.
    My copy is v1.0 from 1981.

    I also have a printed copy from 1981, along with the
    internals class notes and the microfiche.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Sat Oct 12 17:16:44 2024
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bernd Linsel@21:1/5 to David Brown on Sat Oct 12 19:26:30 2024
    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?



    --
    Bernd Linsel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Brian G. Lucas on Sat Oct 12 18:17:18 2024
    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not
    available?

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is
    loaded and that count value could be say 5. Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over prefetch
    to cover being wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Sat Oct 12 18:33:17 2024
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you
    are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would enjoy hearing about comparisons of different ways things functions like memcpy() and memset() can be implemented in different architectures and optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.

    There are only two decisions to make in memcpy, are the copies less than
    copy sized aligned, and do the pointers overlap in copy size.

    For hardware this simplifies down to perhaps two types of copies, easy and hard.

    If you make hard fast, and you will, then two versions is all you need, not
    the dozens of choices with 1k of code you need in C.

    Often you know which of the two you want at compile time from the pointer
    type.

    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Oct 12 18:32:48 2024
    On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

    The 3rd Operand can, indeed, be a constant.
    That causes no restartability problem when you have a place to
    store the current count==index, so that when control returns
    and you re-execute MM, it sees that x amount has already been
    done, and C-X is left.

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    That is what Predication is for.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Brett on Sat Oct 12 18:37:35 2024
    On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve
    branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not available?

    There is always a count available; it can come from a register or an
    immediate.

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is loaded and that count value could be say 5.

    The instruction cannot start until the count in known. You don't start
    an FMAC until all 3 operands are ready, either.

    Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over
    prefetch to cover being wrong.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to mitchalsup@aol.com on Sun Oct 13 01:25:13 2024
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Sat, 12 Oct 2024 18:17:18 +0000, Brett wrote:

    Brian G. Lucas <bagel99@gmail.com> wrote:
    On 10/12/24 12:06 AM, Brett wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch
    prediction is easier and the code is shorter.

    Yes.
    #include <string.h>

    void memmoverr(char to[], char fm[], size_t cnt)
    {
    memmove(to, fm, cnt);
    }

    void memmoverd(char to[], char fm[])
    {
    memmove(to, fm, 0x100000000);
    }
    Yields:
    memmoverr: ; @memmoverr
    mm r1,r2,r3
    ret
    memmoverd: ; @memmoverd
    mm r1,r2,#4294967296
    ret

    Excellent!

    Though I guess forwarding a const is probably a thing today to improve >>>> branch prediction, which is normally HORRIBLE for short branch counts.

    What is the default virtual loop count if the register count is not
    available?

    There is always a count available; it can come from a register or an immediate.

    Worst case the source and dest are in cache, and the count is 150 cycles
    away in memory. So hundreds of chars could be copied until the value is
    loaded and that count value could be say 5.

    The instruction cannot start until the count in known. You don't start
    an FMAC until all 3 operands are ready, either.

    That simplifies a lot of issues, thanks!

    Lots of work and time
    discarded, so you play the odds, perhaps to the low side and over
    prefetch to cover being wrong.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Sun Oct 13 10:56:20 2024
    On Sat, 12 Oct 2024 18:32:48 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Sat, 12 Oct 2024 5:06:05 +0000, Brett wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void * s2, size_t n)
    {
    return memmove(s1, s2, n);
    }

    in your library's source?

    .global memmove
    memmove:
    MM R2,R1,R3
    RET

    sure !


    Can R3 be a const, that causes issues for restartability, but branch prediction is easier and the code is shorter.

    The 3rd Operand can, indeed, be a constant.
    That causes no restartability problem when you have a place to
    store the current count==index, so that when control returns
    and you re-execute MM, it sees that x amount has already been
    done, and C-X is left.

    I don't understand this paragraph.
    Does constant as a 3rd operand cause restartability problem?
    Or does it not?
    If it does not, then how?
    Do you have a private field in thread state? Saved on stack by by
    interrupt uCode ?
    OS people would not like it. They prefer to have full control even when
    they don't use it 99.999% of the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Brett on Sun Oct 13 10:31:49 2024
    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you >>>> are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like
    memcpy() and memset() can be implemented in different architectures and
    optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.



    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that David
    is defending is that memmove() cannot be implemented "efficiently" in /standard/ C source code, on /any/ HW, because it would require
    comparing /C pointers/ that point to potentially different /C objects/,
    which is not defined behavior in standard C, whether compiled to machine
    code, or executed by an interpreter of C code, or executed by a human programmer performing what was called "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and David
    is not disputing that. But Mitch seems not to understand or not to see
    the issue about standard C vs memmove().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Sun Oct 13 12:00:48 2024
    On Fri, 11 Oct 2024 16:54:13 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his
    my66k LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad
    idea for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification,
    i.e. exactly the same mechanism that is done on "non-scalable" architectures, would provide better performance. And memcpy/memmove
    is certainly sufficiently important to justify an additional
    development effort.


    That explanation helps a little, but only a little. I wasn't
    suggesting anything - or if I was, it was several posts ago and the
    context has long since been snipped.

    You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

    Can you be more explicit about
    what you think I was suggesting, and why it might not be a good idea
    for targeting a "my66k" ISA? (That is not a processor I have heard
    of, so you'll have to give a brief summary of any particular features
    that are relevant here.)


    The proper spelling appears to be My 66000.
    For starter, My 66000 has no SIMD. It does not even have dedicated FP
    register file. Both FP and Int share common 32x64bit register space.

    More importantly, it has dedicate instruction with exactly the same
    semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
    The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
    that in at least several out of multitude of implementations it will
    suck.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Niklas Holsti on Sun Oct 13 12:26:22 2024
    On Sun, 13 Oct 2024 10:31:49 +0300
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I
    know you are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply
    to? Are you interested in replying, and engaging in the
    discussion? Or are you just looking for a chance to promote your
    own architecture, no matter how tenuous the connection might be to
    other posts?

    Again, let me say that I agree with what you are saying - I agree
    that an ISA should have instructions that are efficient for what
    people actually want to do. I agree that it is a good thing to
    have instructions that let performance scale with advances in
    hardware ideally without needing changes in compiled binaries, and
    at least without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I
    would enjoy hearing about comparisons of different ways things
    functions like memcpy() and memset() can be implemented in
    different architectures and optimised for different sizes, or how
    scalable vector instructions can work in comparison to fixed-size
    SIMD instructions.

    But at the moment, this potential is lost because you are posting
    total shite about implementing memmove() in standard C. It is
    disappointing that someone with your extensive knowledge and
    experience cannot see this. I am finding it all very frustrating.




    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that
    David is defending is that memmove() cannot be implemented
    "efficiently" in /standard/ C source code, on /any/ HW, because it
    would require comparing /C pointers/ that point to potentially
    different /C objects/, which is not defined behavior in standard C,
    whether compiled to machine code, or executed by an interpreter of C
    code, or executed by a human programmer performing what was called
    "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU instructions, or by dedicated instructions such as Mitch's MM, and
    David is not disputing that. But Mitch seems not to understand or not
    to see the issue about standard C vs memmove().


    Sufficiently advanced compiler can recognize patterns and replace them
    with built-in sequences.

    In case of memmove() the most easily recognizable pattern in 100%
    standard C99 appears to be:

    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }

    I don't suggest that real implementation in Brian's compiler is like
    that. Much more likely his implementation uses non-standard C and looks approximately like:
    void *memmove(void *dest, const void *src, size_t count {
    return __builtin_memmove(dest, src, count);
    }

    However, implementing the first variant efficiently is well within
    abilities of good compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bernd Linsel on Sun Oct 13 12:57:06 2024
    On 12/10/2024 19:26, Bernd Linsel wrote:
    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?


    Absolutely, yes.

    But in a thread branch discussing C, details of C are relevant.

    I don't expect any random regular here to know "language lawyer" details
    of the C standards. I don't expect people here to care about them.
    People in comp.lang.c care about them - for people here, the main
    interest in C is for programs to run on the computer architectures that
    are the real interest.


    But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
    what other posters write. The point under discussion was that you
    cannot implement an efficient "memmove()" function in fully portable
    standard C. That's a fact - it is a well-established fact. Another
    clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the
    advantage and the disadvantage of writing code in portable standard C.

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
    architecture discussions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Michael S on Sun Oct 13 13:33:55 2024
    On 2024-10-13 12:26, Michael S wrote:
    On Sun, 13 Oct 2024 10:31:49 +0300
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2024-10-12 21:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:

    [ snip ]

    But at the moment, this potential is lost because you are posting
    total shite about implementing memmove() in standard C. It is
    disappointing that someone with your extensive knowledge and
    experience cannot see this. I am finding it all very frustrating.




    [ snip discussion of HW ]


    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    I think your reply proves David's complaint: you did not read, or did
    not understand, what David is frustrated about. The true fact that
    David is defending is that memmove() cannot be implemented
    "efficiently" in /standard/ C source code, on /any/ HW, because it
    would require comparing /C pointers/ that point to potentially
    different /C objects/, which is not defined behavior in standard C,
    whether compiled to machine code, or executed by an interpreter of C
    code, or executed by a human programmer performing what was called
    "desk testing" in the 1960s.

    Obviously memmove() can be implemented efficently in non-standard C
    where such pointers can be compared, or by sequences of ordinary ALU
    instructions, or by dedicated instructions such as Mitch's MM, and
    David is not disputing that. But Mitch seems not to understand or not
    to see the issue about standard C vs memmove().


    Sufficiently advanced compiler can recognize patterns and replace them
    with built-in sequences.


    Sure.


    In case of memmove() the most easily recognizable pattern in 100%
    standard C99 appears to be:

    void *memmove( void *dest, const void *src, size_t count)
    {
    if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
    }
    return dest;
    }


    Yes.


    I don't suggest that real implementation in Brian's compiler is like
    that. Much more likely his implementation uses non-standard C and looks approximately like:
    void *memmove(void *dest, const void *src, size_t count {
    return __builtin_memmove(dest, src, count);
    }

    However, implementing the first variant efficiently is well within
    abilities of good compiler.


    Yes, but it is not required by the C standard, so the fact remains that
    there is no standard way of implementing memmove() in a way that is
    "efficient" in the sense that it ensures that a copy to and from a
    temporary will /not/ happen.

    In practice, of course, memmove() is implemented in a non-portable way
    or by in-line code, as everybody understands.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Sun Oct 13 14:10:20 2024
    On 13/10/2024 11:00, Michael S wrote:
    On Fri, 11 Oct 2024 16:54:13 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 11/10/2024 14:13, Michael S wrote:
    On Fri, 11 Oct 2024 13:37:03 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 10/10/2024 23:19, Brian G. Lucas wrote:

    Not applicable.


    I don't understand what you mean by that. /What/ is not applicable
    to /what/ ?


    Brian probably meant to say that that it is not applicable to his
    my66k LLVM back end.

    But I am pretty sure that what you suggest is applicable, but bad
    idea for memcpy/memmove routine that targets Arm+SVE.
    Dynamic dispatch based on concrete core features/identification,
    i.e. exactly the same mechanism that is done on "non-scalable"
    architectures, would provide better performance. And memcpy/memmove
    is certainly sufficiently important to justify an additional
    development effort.


    That explanation helps a little, but only a little. I wasn't
    suggesting anything - or if I was, it was several posts ago and the
    context has long since been snipped.

    You suggested that "scalable" vector extension are preferable for memcpy/memmove implementation over "non-scalable" SIMD.

    I certainly suggested that they have some advantages, yes. I don't know
    nearly enough details about implementations and practical usage to know
    if scalable vector instructions are /always/ better than non-scalable
    SIMD with fixed-size registers, either from the viewpoint of their
    efficiency at runtime or their implementation in hardware.

    It seems to me that if the compiler knows the size of a memcpy/memmove,
    then the best results would probably be achieved by the compiler
    inlining the copy using fixed size registers of a suitable size. If it
    does not know the size, then I would expect (but I don't know for sure)
    that a hardware scalable vector instruction should be more efficient
    than using fixed-size registers. If that were not the case, then I
    wonder why scalable vector hardware has become popular recently in ISAs.

    If you - or someone else - knows enough to say more about this, then I'd
    be glad to learn about it.


    Can you be more explicit about
    what you think I was suggesting, and why it might not be a good idea
    for targeting a "my66k" ISA? (That is not a processor I have heard
    of, so you'll have to give a brief summary of any particular features
    that are relevant here.)


    The proper spelling appears to be My 66000.
    For starter, My 66000 has no SIMD. It does not even have dedicated FP register file. Both FP and Int share common 32x64bit register space.


    OK.

    More importantly, it has dedicate instruction with exactly the same
    semantics as memmove(). Pretty much the same as ARM64. In both cases instruction is defined, but not yet implemented in production silicon.
    The difference is that in case of ARM64 we can be reasonably sure that eventually it will be implemented in production silicon. Which means
    that in at least several out of multitude of implementations it will
    suck.


    So if I understand you correctly, your argument is that scalable vector instructions - at least for copying memory - is slow in hardware implementations, and thus it would be better to simply copy memory in a
    loop using larger fixed-size registers? I would find that surprising,
    but as I said, I don't know the details of implementations.

    (I do know that in the 68k family, the hardware division instruction was dropped for later devices after it was realised that a software division routine was faster than the hardware instruction. So such strange
    things have happened.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Brett on Sun Oct 13 13:58:14 2024
    On 12/10/2024 20:33, Brett wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 01:32, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 22:02:32 +0000, David Brown wrote:

    On 11/10/2024 20:55, MitchAlsup1 wrote:
    On Fri, 11 Oct 2024 12:10:13 +0000, David Brown wrote:


    Do you think you can just write this :

    void * memmove(void * s1, const void  * s2, size_t n)
    {
        return memmove(s1, s2, n);
    }

    in your library's source?

           .global memmove
    memmove:
           MM     R2,R1,R3
           RET

    sure !

    You are either totally clueless, or you are trolling.  And I know you >>>> are not clueless.

    This discussion has become pointless.

    The point is that there are a few things that may be hard to do
    with {decode, pipeline, calculations, specifications...}; but
    because they are so universally needed; these, too, should
    "get into ISA".

    One good reason to put them in ISA is to preserve the programmers
    efforts over decades, so they don't have to re-write libc every-
    time a new set of instructions come out.

    Moving an arbitrary amount of memory from point a to point b
    happens to fall into that universal need. Setting an arbitrary
    amount of memory to a value also falls into that universal
    need.

    Again, I have to ask - do you bother to read the posts you reply to?
    Are you interested in replying, and engaging in the discussion? Or are
    you just looking for a chance to promote your own architecture, no
    matter how tenuous the connection might be to other posts?

    Again, let me say that I agree with what you are saying - I agree that
    an ISA should have instructions that are efficient for what people
    actually want to do. I agree that it is a good thing to have
    instructions that let performance scale with advances in hardware
    ideally without needing changes in compiled binaries, and at least
    without needing changes in source code.

    I believe there is an interesting discussion to be had here, and I would
    enjoy hearing about comparisons of different ways things functions like
    memcpy() and memset() can be implemented in different architectures and
    optimised for different sizes, or how scalable vector instructions can
    work in comparison to fixed-size SIMD instructions.

    But at the moment, this potential is lost because you are posting total
    shite about implementing memmove() in standard C. It is disappointing
    that someone with your extensive knowledge and experience cannot see
    this. I am finding it all very frustrating.

    There are only two decisions to make in memcpy, are the copies less than
    copy sized aligned, and do the pointers overlap in copy size.


    Are you confused about memcpy() and memmove()? If so, let's clear that
    one up from the start. For memcpy(), there are no overlap issues - the
    person using it promises that the source and destination areas do not
    overlap, and no one cares what might happen if they do. For memmove(),
    the areas /may/ overlap, and the copy is done as though the source were
    copied first to a temporary area, and then from the temporary area to
    the destination.

    For memcpy(), there can be several issues to consider for efficient implementations that can be skipped for a simple loop copying byte for
    byte. An efficient implementation will probably want to copy with
    larger sizes, such as using 32-bit, 64-bit, or bigger registers. For
    some targets, that is only possible for aligned data (and for some,
    unaligned accesses may be allowed but emulated by traps, making them
    massively slower than byte-by-byte accesses). The best choice of size
    will be implementation and target dependent, as will methods of
    determining alignment (if that is relevant). I'm guessing that by your somewhat muddled phrase "are the copies less than copy sized aligned",
    you meant something on those lines.

    For memmove(), you generally also need to decide if your copy loop
    should run upwards or downwards, and that must be done in an implementation-dependent manner. It is conceivable that for a target
    with more complex memory setups - perhaps allowing the same memory to be accessible in different ways via different segments - that this is not
    enough.

    For hardware this simplifies down to perhaps two types of copies, easy and hard.

    For most targets, yes.


    If you make hard fast, and you will, then two versions is all you need, not the dozens of choices with 1k of code you need in C.


    That makes little sense. What "1k of code" do you need in C?
    Implementations of memcpy() and memmove() are implementation and target-specific, not general portable standard C. There is no single C implementation of these functions.

    It is an obvious truism that if you have hardware instructions that can implement an efficient memcpy() and/or memmove() on a target, then the implementation-specific implementations of these functions on that
    target will be small, simple and efficient.

    Often you know which of the two you want at compile time from the pointer type.

    In short your complaints are wrong headed in not understanding what
    hardware memcpy can do.


    What complaints? I haven't made any complains about implementing these functions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Sun Oct 13 15:45:37 2024
    David Brown <david.brown@hesbynett.no> writes:
    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.

    When you implements something like, say

    vsum(double *a, double *b, double *c, size_t n);

    where a, b, and c may point to arrays in different objects, or may
    point to overlapping parts of the same object, and the result vector c
    in the overlap case should be the same as in the no-overlap case
    (similar to memmove()), being able to compare pointers to possibly
    different objects also comes in handy.

    Another example is when the programmer uses the address as a key in,
    e.g., a binary search tree. And, as you write, casting to intptr_t is
    not guarenteed to work by the C standard, either.

    An example that probably compares pointers to the same object as far
    as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
    When you have two free variables, and you unify them, in the
    implementation one variable points to the other one. Now which should
    point to which? The younger variable should point to the older one,
    because it will die sooner. How do you know which variable is
    younger? You compare the addresses; the variables reside on a stack,
    so the younger one is closer to the top.

    If that stack is one object as far as the C standard is concerned,
    there is no problem with that solution. If the stack is implemented
    as several objects (to make it easier growable; I don't know if there
    is a Prolog implementation that does that), you first have to check in
    which piece it is (maybe with a binary search), and then possibly
    compare within the stack piece at hand.

    An interesting case is the Forth standard. It specifies "contiguous
    regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.


    Forth does not require a flat memory model in the hardware, as far as I
    am aware, any more than C does. (I appreciate that your knowledge of
    Forth is /vastly/ greater than mine.) A Forth implementation could
    interpret part of the address value as the segment or other memory block >identifier and part of it as an index into that block, just as a C >implementation can.

    I.e., what you are saying is that one can simulate a flat-memory model
    on a segmented memory model. Certainly. In the case of the 8086 (and
    even more so on the 286) the costs of that are so high that no
    widely-used Forth system went there.

    One can also simulate segmented memory (a natural fit for many
    programming languages) on flat memory. In this case the cost is much
    smaller, plus it gives the maximum flexibility about segment/object
    sizes and numbers. That is why flat memory has won.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Sun Oct 13 19:36:04 2024
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/10/2024 19:26, Bernd Linsel wrote:
    On 12.10.24 17:16, David Brown wrote:


    [snip rant]



    You are aware that this is c.arch, not c.lang.c?


    Absolutely, yes.

    But in a thread branch discussing C, details of C are relevant.

    I don't expect any random regular here to know "language lawyer" details
    of the C standards. I don't expect people here to care about them.
    People in comp.lang.c care about them - for people here, the main
    interest in C is for programs to run on the computer architectures that
    are the real interest.


    But if someone engages in a conversation about C, I /do/ expect them to understand some basics, and I /do/ expect them to read and think about
    what other posters write. The point under discussion was that you
    cannot implement an efficient "memmove()" function in fully portable
    standard C. That's a fact - it is a well-established fact. Another
    clear and inarguable fact is that particular ISAs or implementations are completely irrelevant to fully portable standard C - that is both the advantage and the disadvantage of writing code in portable standard C.

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable standard C is weaknesses in the x86 ISA), so that we can clear up his misunderstandings and move on to the more interesting computer
    architecture discussions.

    MemMove in C is fundamentally two void pointers and a count of bytes to
    move.

    C does not care what the alignment of those two void pointers is.

    ALU’s are so cheap as to be free, a dedicated MM unit can have a shifter
    and mask with a buffer, and happily copy aligned chunks from the source and write aligned chunks to the dest, even though both are odd aligned in
    different ways, and overlapping the same buffer.

    Note that writes have byte enables, you can write 5 bytes in one go to
    cache, to finish off the end of a series of aligned writes.

    My 66000 only has one MM instruction because when you throw enough hardware
    at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy version.

    I detailed the hardware to do this several years ago on Real World Tech.
    And such hardware has been available for many decades in DMA units.

    The 6502 based GameBoy had a MemMove DMA unit as it was many times faster copying bytes than the 6502 was, and doubled the overall performance of the GameBoy.

    One ring to rule them all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Sun Oct 13 21:21:11 2024
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different objects? >>>>>>> For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers,
    rather than having only a valid pointer or NULL.  A compiler,
    for example, might want to store the fact that an error occurred
    while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can
    rely on what application programmers cannot, their implementation
    details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you write
    an efficient memmove() in standard C.  That's why I said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up that
    is proportionally more costly for small transfers.  Often that can be eliminated when the compiler optimises the functions inline - when the compiler knows the size of the move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
    inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.

    I.e. totally removing the need for compiler tricks or wide register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today), and
    then the memmove() calls will usually be inlined.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Brett on Sun Oct 13 19:43:34 2024
    Brett <ggtgp@yahoo.com> writes:
    David Brown <david.brown@hesbynett.no> wrote:

    All I am asking Mitch to do is to understand this, and to stop saying
    silly things (such as implementing memmove() by calling memmove(), or
    that the /reason/ you can't implement memmove() efficiently in portable
    standard C is weaknesses in the x86 ISA), so that we can clear up his
    misunderstandings and move on to the more interesting computer
    architecture discussions.
    <snip>
    My 66000 only has one MM instruction because when you throw enough hardware >at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy version.

    I detailed the hardware to do this several years ago on Real World Tech.

    Such hardware (memcpy/memmove/memfill) was available in 1965 on the Burroughs medium systems mainframes. In the 80s, support was added for hashing
    strings as well.

    It's not a new concept. In fact, there were some tricks that could
    be used with overlapping source and destination buffers that would
    replicate chunks of data).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Sun Oct 13 23:01:53 2024
    On Sun, 13 Oct 2024 19:43:34 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Brett <ggtgp@yahoo.com> writes:
    David Brown <david.brown@hesbynett.no> wrote:

    All I am asking Mitch to do is to understand this, and to stop
    saying silly things (such as implementing memmove() by calling
    memmove(), or that the /reason/ you can't implement memmove()
    efficiently in portable standard C is weaknesses in the x86 ISA),
    so that we can clear up his misunderstandings and move on to the
    more interesting computer architecture discussions.
    <snip>
    My 66000 only has one MM instruction because when you throw enough
    hardware at the problem, one instruction is all you need.

    And it also covers MemCopy, and yes there is a backwards copy
    version.

    I detailed the hardware to do this several years ago on Real World
    Tech.

    Such hardware (memcpy/memmove/memfill) was available in 1965 on the
    Burroughs medium systems mainframes. In the 80s, support was added
    for hashing strings as well.

    It's not a new concept. In fact, there were some tricks that could
    be used with overlapping source and destination buffers that would
    replicate chunks of data).

    The difference is that today for strings of certain size, say from 200
    bytes to half of your L1D cache, if your precios HW copies less than 50
    bytes per clock then people would complain that it is slower than snail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Mon Oct 14 15:19:32 2024
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different
    objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers, >>>>>>> rather than having only a valid pointer or NULL.  A compiler, >>>>>>> for example, might want to store the fact that an error occurred >>>>>>> while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can >>>>>>> rely on what application programmers cannot, their implementation >>>>>>> details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there
    was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a single instruction which does the entire memmove() operation, and has the
    inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.

    I am not missing that at all. And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and
    memmove. (For my own kind of work, I'd worry about such looping
    instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a
    given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time. You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today), and
    then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C. That
    is independent of any ISA, any specialist instructions for memory moves,
    and any compiler optimisations. And it is independent of the fact that
    some good compilers can inline at least some calls to memcpy() and
    memmove() today, using whatever instructions are most efficient for the
    target.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Mon Oct 14 17:04:28 2024
    On 13/10/2024 17:45, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    When would you ever /need/ to compare pointers to different objects?
    For almost all C programmers, the answer is "never". Pretty much the
    only example people ever give of needing such comparisons is to
    implement memmove() efficiently - but you don't need to implement
    memmove(), because it is already in the standard library.

    When you implements something like, say

    vsum(double *a, double *b, double *c, size_t n);

    where a, b, and c may point to arrays in different objects, or may
    point to overlapping parts of the same object, and the result vector c
    in the overlap case should be the same as in the no-overlap case
    (similar to memmove()), being able to compare pointers to possibly
    different objects also comes in handy.


    OK, I can agree with that - /if/ you need such a function. I'd suggest
    that when you are writing code that might call such a function, you've a
    very good idea whether you want to do "vec_c = vec_a + vec_b;", or
    "vec_c += vec_a;" (that is, "b" and "c" are the same). In other words,
    the programmer calling vsum already knows if there are overlaps, and
    you'd get the best results if you had different functions for the
    separate cases.

    It is conceivable that you don't know if there is an overlap, especially
    if you are only dealing with parts of arrays rather than full arrays,
    but I think such cases will be rare.

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it. Since a fully
    defined portable method might not be possible (or at least, not
    efficiently possible) for some weird targets, and it's a good thing that
    C supports weird targets, I think perhaps the ideal would be to have
    some feature that exists if and only if you can do sensible comparisons.
    This could be an additional <stdint.h> pointer type, or some pointer
    compare macros, or a pre-defined macro to say if you can simply use
    uintptr_t for the purpose (as you can on most modern C implementations).

    Another example is when the programmer uses the address as a key in,
    e.g., a binary search tree. And, as you write, casting to intptr_t is
    not guarenteed to work by the C standard, either.

    Casting to uintptr_t (why would one want a /signed/ address?) is all you
    need for most systems - and for any target where casting to uintptr_t
    will not be sufficient here, the type uintptr_t will not exist and you
    get a nice, safe hard compile-time error rather than silently UB code.
    For uses like this, you don't need to compare pointers - comparing the
    integers converted from the pointers is fine. (Imagine a system where converted addresses consist of a 16-bit segment number and a 16-bit
    offset, where the absolute address is the segment number times a scale
    factor, plus the offset. You can't easily compare two pointers for real address ordering by converting them to an integer type, but the result
    of casting to uintptr_t is still fine for your binary tree.)


    An example that probably compares pointers to the same object as far
    as the C standard is concerned, but feel like different objects to the programmer, is logic variables (in, e.g., a Prolog implementation).
    When you have two free variables, and you unify them, in the
    implementation one variable points to the other one. Now which should
    point to which? The younger variable should point to the older one,
    because it will die sooner. How do you know which variable is
    younger? You compare the addresses; the variables reside on a stack,
    so the younger one is closer to the top.

    If that stack is one object as far as the C standard is concerned,
    there is no problem with that solution. If the stack is implemented
    as several objects (to make it easier growable; I don't know if there
    is a Prolog implementation that does that), you first have to check in
    which piece it is (maybe with a binary search), and then possibly
    compare within the stack piece at hand.


    My only experience of Prolog was working through a short tutorial
    article when I was a teenager - I have no idea about implementations!

    But again I come back to the same conclusion - there are situations
    where being able to compare addresses can be useful, but it is very rare
    for most programmers to ever actually need to do so. And I think it is
    good that there is a widely portable way to achieve this, by casting to uintptr_t and comparing those integers. There are things that people
    want to do with C programming that can be done with
    implementation-specific code, but which cannot be done with fully
    portable standard code. While it is always nice if you /can/ use fully portable solutions (while still being clear and efficient), it's okay to
    have non-portable code when you need it.

    An interesting case is the Forth standard. It specifies "contiguous
    regions", which correspond to objects in C, but in Forth each address
    is a cell and can be added, subtracted, compared, etc. irrespective of
    where it came from. So Forth really has a flat-memory model. It has
    had that since its origins in the 1970s. Some of the 8086
    implementations had some extensions for dealing with more than 64KB,
    but they were never standardized and are now mostly forgotten.


    Forth does not require a flat memory model in the hardware, as far as I
    am aware, any more than C does. (I appreciate that your knowledge of
    Forth is /vastly/ greater than mine.) A Forth implementation could
    interpret part of the address value as the segment or other memory block
    identifier and part of it as an index into that block, just as a C
    implementation can.

    I.e., what you are saying is that one can simulate a flat-memory model
    on a segmented memory model.

    Yes.

    Certainly. In the case of the 8086 (and
    even more so on the 286) the costs of that are so high that no
    widely-used Forth system went there.


    OK.

    That's much the same as C on segmented targets.

    One can also simulate segmented memory (a natural fit for many
    programming languages) on flat memory. In this case the cost is much smaller, plus it gives the maximum flexibility about segment/object
    sizes and numbers. That is why flat memory has won.


    Sure, flat memory is nicer in many ways.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Mon Oct 14 16:40:26 2024
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different >>>>>>>>> objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>> rather than having only a valid pointer or NULL.  A compiler, >>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>> while parsing a subexpression as a special pointer constant.

    Compilers often have the unfair advantage, though, that they can >>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>> details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they can >>>>>>> implement an efficient memmove() even though a pure standard C
    programmer cannot (other than by simply calling the standard library >>>>>>> memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA. >>>>
    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said there >>> was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly or
    using inline assembly, rather than in non-portable C (which is the
    common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that
    can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and has
    the inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very close
    to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced hardware MM instruction could be a very efficient way to implement both memcpy and memmove.  (For my own kind of work, I'd worry about such looping instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a
    given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will /sometimes/ get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register
    operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today),
    and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write an efficient memmove() implementation using pure portable standard C.  That
    is independent of any ISA, any specialist instructions for memory moves,
    and any compiler optimisations.  And it is independent of the fact that some good compilers can inline at least some calls to memcpy() and
    memmove() today, using whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on c.arch,
    I really don't think any of us really disagree, it is just that we have
    been discussing two (mostly) orthogonal issues.

    a) memmove/memcpy are so important that people have been spending a lot
    of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
    comparison of arbitrary pointers).

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
    if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
    fact so fast that it obviates any need for tricky coding to replace it.

    Ideally, it should be able to copy a single object, up to a cache line
    in size, in the same or less time needed to do so manually with a SIMD
    512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)

    REP MOVSB on x86 does the canonical memcpy() operation, originally by
    moving single bytes, and this was so slow that we also had REP MOVSW
    (moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on
    64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in fact
    handle any kinds of alignments and sizes, while doing the actual
    transfer at maximum bus speeds, i.e. at least one cache line/cycle for
    things already in $L1.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Mon Oct 14 19:08:56 2024
    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to
    different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in
    pointers, rather than having only a valid pointer or
    NULL.  A compiler, for example, might want to store the >>>>>>>>> fact that an error occurred while parsing a subexpression
    as a special pointer constant.

    Compilers often have the unfair advantage, though, that
    they can rely on what application programmers cannot, their >>>>>>>>> implementation details.  (Some do not, such as f2c). >>>>>>>>
    Standard library authors have the same superpowers, so that
    they can
    implement an efficient memmove() even though a pure standard >>>>>>>> C programmer cannot (other than by simply calling the
    standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of
    libc writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the >>>>>> ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let
    you write an efficient memmove() in standard C.  That's why I
    said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in
    assembly or using inline assembly, rather than in non-portable C
    (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things
    up that is proportionally more costly for small transfers.Â
    Often that can be eliminated when the compiler optimises the
    functions inline - when the compiler knows the size of the
    move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and
    has the inside knowledge about cache (residency at level x? width
    in bytes)/memory ranges/access rights/etc needed to do so in a
    very close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced
    hardware MM instruction could be a very efficient way to implement
    both memcpy and memmove.  (For my own kind of work, I'd worry
    about such looping instructions causing an unbounded increased in
    interrupt latency, but that too is solvable given enough hardware
    effort.)

    And I agree that once you have an "MM" (or similar) instruction,
    you don't need to re-write the implementation for your memmove()
    and memcpy() library functions for every new generation of
    processors of a given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will
    /sometimes/ get benefits from doing so, but it is not as simple as
    Mitch made out.

    I.e. totally removing the need for compiler tricks or wide
    register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and
    to recognize common patterns (just as most compilers already do
    today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to
    write an efficient memmove() implementation using pure portable
    standard C. That is independent of any ISA, any specialist
    instructions for memory moves, and any compiler optimisations.
    And it is independent of the fact that some good compilers can
    inline at least some calls to memcpy() and memmove() today, using
    whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on
    c.arch, I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and
    his willingness to share that freely with others. That's why I have
    found this very frustrating.


    a) memmove/memcpy are so important that people have been spending a
    lot of time & effort trying to make it faster, with the
    complication that in general it cannot be implemented in pure C
    (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a
    simple byte-copy loop, without needing to compare pointers. But an implementation that copies in larger blocks than a byte requires implementation dependent behaviour to determine alignments, or it
    must rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized
    that if a cpu architecture actually has an instruction designed to
    do this particular job, it behooves cpu architects to make sure
    that it is in fact so fast that it obviates any need for tricky
    coding to replace it.

    Yes.

    Ideally, it should be able to copy a single object, up to a cache
    line in size, in the same or less time needed to do so manually
    with a SIMD 512-bit load followed by a 512-bit store (both ops
    masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep" instructions.

    No, that's not true. And according to my understanding, that's not what
    Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in PSW
    instead of being part of the opcode).
    REP MOVSW/D/Q were introduced because back then processors were small
    and stupid. When your processor is big and smart you don't need them
    any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin to
    REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the next
    logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

    Now, is all that a good idea? I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary for
    1st-class implementation of these instructions is useful not only for
    memory copy.
    So, may be, it makes sense to expose this hardware in more generic ways.
    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better ways
    that I was not thinking about.

    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.
    One does not need "good implementation" in a sense you have in mind.
    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.

    They would make it easier to write efficient
    implementations of these standard library functions for targets that
    had such instructions - but that would be implementation-specific
    code. And that is one of the reasons that C standard library
    implementations are tied to the specific compiler and target, and the
    writers of these libraries have "superpowers" and are not limited to
    standard C.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Mon Oct 14 17:19:40 2024
    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to different >>>>>>>>>> objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in pointers, >>>>>>>>> rather than having only a valid pointer or NULL.  A compiler, >>>>>>>>> for example, might want to store the fact that an error occurred >>>>>>>>> while parsing a subexpression as a special pointer constant. >>>>>>>>>
    Compilers often have the unfair advantage, though, that they can >>>>>>>>> rely on what application programmers cannot, their implementation >>>>>>>>> details.  (Some do not, such as f2c).

    Standard library authors have the same superpowers, so that they >>>>>>>> can
    implement an efficient memmove() even though a pure standard C >>>>>>>> programmer cannot (other than by simply calling the standard
    library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of libc
    writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the ISA. >>>>>
    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let you
    write an efficient memmove() in standard C.  That's why I said
    there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in assembly
    or using inline assembly, rather than in non-portable C (which is
    the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things up
    that is proportionally more costly for small transfers.  Often that >>>> can be eliminated when the compiler optimises the functions inline -
    when the compiler knows the size of the move/copy, it can optimise
    directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and has
    the inside knowledge about cache (residency at level x? width in
    bytes)/memory ranges/access rights/etc needed to do so in a very
    close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced hardware
    MM instruction could be a very efficient way to implement both memcpy
    and memmove.  (For my own kind of work, I'd worry about such looping
    instructions causing an unbounded increased in interrupt latency, but
    that too is solvable given enough hardware effort.)

    And I agree that once you have an "MM" (or similar) instruction, you
    don't need to re-write the implementation for your memmove() and
    memcpy() library functions for every new generation of processors of a
    given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will /sometimes/
    get benefits from doing so, but it is not as simple as Mitch made out.


    I.e. totally removing the need for compiler tricks or wide register
    operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and to
    recognize common patterns (just as most compilers already do today),
    and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to write
    an efficient memmove() implementation using pure portable standard C.
    That is independent of any ISA, any specialist instructions for memory
    moves, and any compiler optimisations.  And it is independent of the
    fact that some good compilers can inline at least some calls to
    memcpy() and memmove() today, using whatever instructions are most
    efficient for the target.

    David, you and Mitch are among my most cherished writers here on c.arch,
    I really don't think any of us really disagree, it is just that we have
    been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and his willingness to share that freely with others. That's why I have found
    this very frustrating.


    a) memmove/memcpy are so important that people have been spending a lot
    of time & effort trying to make it faster, with the complication that in general it cannot be implemented in pure C (which disallows direct
    comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a simple byte-copy loop, without needing to compare pointers. But an
    implementation that copies in larger blocks than a byte requires
    implementation dependent behaviour to determine alignments, or it must
    rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized that
    if a cpu architecture actually has an instruction designed to do this particular job, it behooves cpu architects to make sure that it is in
    fact so fast that it obviates any need for tricky coding to replace it.


    Yes.

    Ideally, it should be able to copy a single object, up to a cache line
    in size, in the same or less time needed to do so manually with a SIMD 512-bit load followed by a 512-bit store (both ops masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally by
    moving single bytes, and this was so slow that we also had REP MOVSW
    (moving 16-bit entities) and then REP MOVSD on the 386 and REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in fact handle any kinds of alignments and sizes, while doing the actual
    transfer at maximum bus speeds, i.e. at least one cache line/cycle for
    things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do these
    basic operations faster than a software loop or the x86 "rep"
    instructions. And I fully agree that these would be useful features in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C. They would make it easier to write efficient
    implementations of these standard library functions for targets that had
    such instructions - but that would be implementation-specific code. And
    that is one of the reasons that C standard library implementations are
    tied to the specific compiler and target, and the writers of these
    libraries have "superpowers" and are not limited to standard C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Mon Oct 14 19:02:51 2024
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Mon Oct 14 22:20:42 2024
    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for equality).
    Rarely needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer because
    of odd birds.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Oct 14 19:39:41 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Oct 14 23:46:10 2024
    On Tue, 8 Oct 2024 20:53:00 +0000, MitchAlsup1 wrote:

    The Algol family of block structure gave the illusion that flat was less necessary and it could all be done with lexical address-
    ing and block scoping rules.

    Then malloc() and mmap() came along.

    Algol-68 already had heap allocation and flex arrays. (The folks over in MULTICS land were working on mmap.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to EricP on Mon Oct 14 23:55:59 2024
    On Wed, 09 Oct 2024 13:37:41 -0400, EricP wrote:

    The Posix interface support was there so *MS* could bid on US government
    and military contracts which, at that time frame, were making noise
    about it being standard for all their contracts.
    The Posix DLLs didn't come with WinNT, you had to ask MS for them
    specially.

    And that whole POSIX subsystem was so sadistically, unusably awful, it
    just had to be intended for show as a box-ticking exercise, nothing more.

    <https://www.youtube.com/watch?v=BOeku3hDzrM>

    Back then "object oriented" and "micro-kernel" buzzwords were all the
    rage.

    OO still lives on in higher-level languages. Microsoft’s one attempt to incorporate its OO architecture--Dotnet--into the lower layers of the OS,
    in Windows Vista, was an abject, embarrassing failure which hopefully
    nobody will try to repeat.

    On the other hand, some stubborn holdouts are still fond of microkernels
    -- you just have to say the whole idea is pointless, and they come out of
    the woodwork in a futile attempt to disagree ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Mon Oct 14 23:51:27 2024
    On Tue, 8 Oct 2024 22:28 +0100 (BST), John Dallman wrote:

    The same problem seems to have messed up all the attempts to provide
    good Unix emulation on VMS.

    Was it the Perl build scripts that, at some point their compatibility
    tests on a *nix system, would announce “Congratulations! You’re not
    running EUNICE!”.

    In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence D'Oliveiro) wrote:

    I think the whole _personality_ concept, along with the supposed
    portability to non-x86 architectures, had just bit-rotted away by that
    point.

    Some combination of that, Microsoft confidence that "of course we can do something better now!" - they are very prone to overconfidence - and the terrible tendency of programmers to ignore the details of the old code.

    It was the Microsoft management that did it -- the culmination of a whole sequence of short-term, profit-oriented decisions over many years ...
    decades. What may have started out as an “elegant design” finally became unrecognizable as such.

    Compare what was happening to Linux over the same time interval, where the programmers were (largely) not beholden to managers and bean counters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Oct 15 00:14:25 2024
    On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for equality).
    Rarely needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer because
    of odd birds.

    So, you are saying that 286 in its hey-day was/is odd ?!?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Tue Oct 15 00:17:04 2024
    On Mon, 14 Oct 2024 23:51:27 +0000, Lawrence D'Oliveiro wrote:

    On Tue, 8 Oct 2024 22:28 +0100 (BST), John Dallman wrote:

    The same problem seems to have messed up all the attempts to provide
    good Unix emulation on VMS.

    Was it the Perl build scripts that, at some point their compatibility
    tests on a *nix system, would announce “Congratulations! You’re not running EUNICE!”.

    In article <vdvvae$1k931$2@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    I think the whole _personality_ concept, along with the supposed
    portability to non-x86 architectures, had just bit-rotted away by that
    point.

    Some combination of that, Microsoft confidence that "of course we can do
    something better now!" - they are very prone to overconfidence - and the
    terrible tendency of programmers to ignore the details of the old code.

    It was the Microsoft management that did it -- the culmination of a
    whole
    sequence of short-term, profit-oriented decisions over many years ... decades. What may have started out as an “elegant design” finally became unrecognizable as such.

    Compare what was happening to Linux over the same time interval, where
    the
    programmers were (largely) not beholden to managers and bean counters.

    Last 5 words are unnecessary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Oct 15 00:15:49 2024
    On Mon, 14 Oct 2024 19:39:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality). Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    Stick to the question asked. Registers were 16-binary digits,
    and segment registers enabled access to 24-bit address space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Levine on Tue Oct 15 05:20:10 2024
    On Tue, 8 Oct 2024 21:03:40 -0000 (UTC), John Levine wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

    If you look at the 8086 manuals, that's clearly what they had in mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    Right, and they appeared not to care or realize it was a performance
    problem.

    They didn’t expect anybody to make serious use of it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Tue Oct 15 10:41:41 2024
    On Tue, 15 Oct 2024 00:14:25 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 19:20:42 +0000, Michael S wrote:

    On Mon, 14 Oct 2024 19:02:51 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    That's their problem. The rest of the C world shouldn't suffer
    because of odd birds.

    So, you are saying that 286 in its hey-day was/is odd ?!?

    In its heyday 80286 was used as MUCH faster 8088.
    286-as-286 was/is odd creature. I'd dare to say that it had no heyday.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Oct 15 11:16:55 2024
    On Mon, 14 Oct 2024 23:55:59 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 09 Oct 2024 13:37:41 -0400, EricP wrote:

    The Posix interface support was there so *MS* could bid on US
    government and military contracts which, at that time frame, were
    making noise about it being standard for all their contracts.
    The Posix DLLs didn't come with WinNT, you had to ask MS for them specially.

    And that whole POSIX subsystem was so sadistically, unusably awful,
    it just had to be intended for show as a box-ticking exercise,
    nothing more.

    <https://www.youtube.com/watch?v=BOeku3hDzrM>

    Back then "object oriented" and "micro-kernel" buzzwords were all
    the rage.

    OO still lives on in higher-level languages. Microsoft’s one attempt
    to incorporate its OO architecture--Dotnet--into the lower layers of
    the OS, in Windows Vista, was an abject, embarrassing failure which
    hopefully nobody will try to repeat.


    It sounds like you confusing .net with something unrelated.
    Probably with Microsoft's failed WinFS filesystem.
    WinFS was *not* object-oriented.

    AFAIK, .net is hugely successful application development technology
    that was never incorporated into lower layers of the OS.

    If you are interested in failed attempts to incorporate .net into
    something it does not fit then please consider Silverlight.
    But then, the story of Silverlight is not dissimilar to the story of
    in-browser Java, with main difference that the latter was more harmful
    to the industry.

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless, and
    they come out of the woodwork in a futile attempt to disagree ...

    Seems, you are not ashamed to admit your trolling tactics.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Tue Oct 15 10:53:30 2024
    On 14/10/2024 18:08, Michael S wrote:
    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 16:40, Terje Mathisen wrote:
    David Brown wrote:
    On 13/10/2024 21:21, Terje Mathisen wrote:
    David Brown wrote:
    On 10/10/2024 20:38, MitchAlsup1 wrote:
    On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

    On 09/10/2024 23:37, MitchAlsup1 wrote:
    On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

    On 09/10/2024 20:10, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    When would you ever /need/ to compare pointers to
    different objects?
    For almost all C programmers, the answer is "never".

    Sometimes, it is handy to encode certain conditions in
    pointers, rather than having only a valid pointer or
    NULL.  A compiler, for example, might want to store the >>>>>>>>>>> fact that an error occurred while parsing a subexpression >>>>>>>>>>> as a special pointer constant.

    Compilers often have the unfair advantage, though, that
    they can rely on what application programmers cannot, their >>>>>>>>>>> implementation details.  (Some do not, such as f2c). >>>>>>>>>>
    Standard library authors have the same superpowers, so that >>>>>>>>>> they can
    implement an efficient memmove() even though a pure standard >>>>>>>>>> C programmer cannot (other than by simply calling the
    standard library
    memmove() function!).

    This is more a symptom of bad ISA design/evolution than of
    libc writers needing superpowers.

    No, it is not.  It has absolutely /nothing/ to do with the >>>>>>>> ISA.

    For example, if ISA contains an MM instruction which is the
    embodiment of memmove() then absolutely no heroics are needed
    of desired in the libc call.


    The existence of a dedicated assembly instruction does not let
    you write an efficient memmove() in standard C.  That's why I
    said there was no connection between the two concepts.

    For some targets, it can be helpful to write memmove() in
    assembly or using inline assembly, rather than in non-portable C
    (which is the common case).

    Thus, it IS a symptom of ISA evolution that one has to rewrite
    memmove() every time wider SIMD registers are available.

    It is not that simple.

    There can often be trade-offs between the speed of memmove() and
    memcpy() on large transfers, and the overhead in setting things
    up that is proportionally more costly for small transfers.Â
    Often that can be eliminated when the compiler optimises the
    functions inline - when the compiler knows the size of the
    move/copy, it can optimise directly.

    What you are missing here David is the fact that Mitch's MM is a
    single instruction which does the entire memmove() operation, and
    has the inside knowledge about cache (residency at level x? width
    in bytes)/memory ranges/access rights/etc needed to do so in a
    very close to optimal manner, for both short and long transfers.

    I am not missing that at all.  And I agree that an advanced
    hardware MM instruction could be a very efficient way to implement
    both memcpy and memmove.  (For my own kind of work, I'd worry
    about such looping instructions causing an unbounded increased in
    interrupt latency, but that too is solvable given enough hardware
    effort.)

    And I agree that once you have an "MM" (or similar) instruction,
    you don't need to re-write the implementation for your memmove()
    and memcpy() library functions for every new generation of
    processors of a given target family.

    What I /don't/ agree with is the claim that you /do/ need to keep
    re-writing your implementations all the time.  You will
    /sometimes/ get benefits from doing so, but it is not as simple as
    Mitch made out.

    I.e. totally removing the need for compiler tricks or wide
    register operations.

    Also apropos the compiler library issue:

    You start by teaching the compiler about the MM instruction, and
    to recognize common patterns (just as most compilers already do
    today), and then the memmove() calls will usually be inlined.


    The original compile library issue was that it is impossible to
    write an efficient memmove() implementation using pure portable
    standard C. That is independent of any ISA, any specialist
    instructions for memory moves, and any compiler optimisations.
    And it is independent of the fact that some good compilers can
    inline at least some calls to memcpy() and memmove() today, using
    whatever instructions are most efficient for the target.

    David, you and Mitch are among my most cherished writers here on
    c.arch, I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I agree. It's a "god dag mann, økseskaft" situation.

    I have a huge respect for Mitch, his knowledge and experience, and
    his willingness to share that freely with others. That's why I have
    found this very frustrating.


    a) memmove/memcpy are so important that people have been spending a
    lot of time & effort trying to make it faster, with the
    complication that in general it cannot be implemented in pure C
    (which disallows direct comparison of arbitrary pointers).


    Yes.

    (Unlike memmov(), memcpy() can be implemented in standard C as a
    simple byte-copy loop, without needing to compare pointers. But an
    implementation that copies in larger blocks than a byte requires
    implementation dependent behaviour to determine alignments, or it
    must rely on unaligned accesses being allowed by the implementation.)

    b) Mitch have, like Andy ("Crazy") Glew many years before, realized
    that if a cpu architecture actually has an instruction designed to
    do this particular job, it behooves cpu architects to make sure
    that it is in fact so fast that it obviates any need for tricky
    coding to replace it.

    Yes.

    Ideally, it should be able to copy a single object, up to a cache
    line in size, in the same or less time needed to do so manually
    with a SIMD 512-bit load followed by a 512-bit store (both ops
    masked to not touch anything it shouldn't)


    Yes.

    REP MOVSB on x86 does the canonical memcpy() operation, originally
    by moving single bytes, and this was so slow that we also had REP
    MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
    REP MOVSQ on 64-bit cpus.

    With a suitable chunk of logic, the basic MOVSB operation could in
    fact handle any kinds of alignments and sizes, while doing the
    actual transfer at maximum bus speeds, i.e. at least one cache
    line/cycle for things already in $L1.


    I agree on all of that.

    I am quite happy with the argument that suitable hardware can do
    these basic operations faster than a software loop or the x86 "rep"
    instructions.

    No, that's not true. And according to my understanding, that's not what
    Terje wrote.
    REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
    details - fixed registers for src, dest, len and Direction flag in PSW instead of being part of the opcode).

    My understanding of what Terje wrote is that REP MOVSB /could/ be an
    efficient solution if it were backed by a hardware block to run well
    (i.e., transferring as many bytes per cycle as memory bus bandwidth
    allows). But REP MOVSB is /not/ efficient - and rather than making it
    work faster, Intel introduced variants with wider fixed sizes.

    Could REP MOVSB realistically be improved to be as efficient as the instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I don't
    know. Intel and AMD have had many decades to do so, so I assume it's
    not an easy improvement.

    REP MOVSW/D/Q were introduced because back then processors were small
    and stupid. When your processor is big and smart you don't need them
    any longer. REP MOVSB is sufficient.
    New Arm64 instruction that are hopefully coming next year are akin to
    REP MOVSB rather than to MOVSW/D/Q.
    Instructions for memmove, also defined by Arm and by Mitch, is the next logical step. IMHO, the main gain here is not measurable improvement in performance, but saving of code size when inlined.

    Now, is all that a good idea?

    That's a very important question.

    I am not 100% convinced.
    One can argue that streaming alignment hardware that is necessary for 1st-class implementation of these instructions is useful not only for
    memory copy.
    So, may be, it makes sense to expose this hardware in more generic ways.

    I believe that is the idea of "scalable vector" instructions as an
    alternative philosophy to wide explicit SIMD registers. My expectation
    is that SVE implementations will be more effort in the hardware than
    SIMD for any specific SIMD-friendly size point (i.e., power-of-two
    widths). That usually corresponds to lower clock rates and/or higher
    latency and more coordination from extra pipeline stages.

    But once you have SVE support in place, then memcpy() and memset() are
    just examples of vector operations that you get almost for free when you
    have hardware for vector MACs and other operations.

    May be, via Load Multiple Register? It was present in Arm's A32/T32,
    but didn't make it into ARM64. Or, may be, there are even better ways
    that I was not thinking about.

    And I fully agree that these would be useful features
    in general-purpose processors.

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, my goalposts have been in the same place all the time. Some others
    have been kicking the ball at a completely different set of goalposts,
    but I have kept the same point all along.

    One does not need "good implementation" in a sense you have in mind.

    Maybe not - but /that/ would be moving the goalposts.

    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.


    But that would /not/ be an efficient implementation of memmove() in
    plain portable standard C.

    What do I mean by an "efficient" implementation in fully portable
    standard C? There are two possible ways to think about that. One is
    that the operations on the abstract machine are efficient. The other is
    that the code is likely to result in efficient code over a wide range of real-world compilers, options, and targets. And I think it goes without
    saying that the implementation must not rely on any
    implementation-defined behaviour or anything beyond the minimal limits
    given in the C standards, and it must not introduce any new real or
    potential UB.

    Your "memmove()" implementation fails on several counts. It is
    inefficient in the abstract machine - it copies everything twice instead
    of once. It is inefficient in real-world implementations of all sorts
    and countless targets - being efficient for some compilers with some
    options on some targets (most of them hypothetical) does /not/ qualify
    as an efficient implementation. And quite clearly it risks causing
    failures from stack overflow in situations where the user would normally
    expect memmove() to function safely (on implementations other than those
    few that turn it into efficient object code).

    They would make it easier to write efficient
    implementations of these standard library functions for targets that
    had such instructions - but that would be implementation-specific
    code. And that is one of the reasons that C standard library
    implementations are tied to the specific compiler and target, and the
    writers of these libraries have "superpowers" and are not limited to
    standard C.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to John Levine on Tue Oct 15 11:59:27 2024
    On Tue, 8 Oct 2024 21:03:40 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    If you look at the 8086 manuals, that's clearly what they had in
    mind.

    What I don't get is that the 286's segment stuff was so slow.

    It had to load the whole segment descriptor from RAM and possibly
    perform some additional setup.

    Right, and they appeared not to care or realize it was a performance
    problem.

    They didn't even do obvious things like see if you're reloading the
    same value into the segment register and skip the rest of the setup.
    Sure, you could put checks in your code and skip the segment load but
    that would make your code a lot bigger and uglier.


    The question is how slowness of 80286 segments compares to
    contemporaries that used segment-based protected memory.
    Wikipedia lists following machines as examples of segmentation:
    - Burroughs B5000 and following Burroughs Large Systems
    - GE 645 -> Honeywell 6080
    - Prime 400 and successors
    - IBM System/38
    They also mention S/370, but to me segmentation in S/370 looks very
    different and probably not intended for fine-grained protection.

    Of those Burroughs B5900 looks to me as the most comparable to 80286.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Tue Oct 15 12:38:40 2024
    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard way to
    compare independent pointers (other than just for equality).  Rarely
    needing something does not mean /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to
    statically allocated data. Then you would expect the segment to be the
    same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different segments.
    Then the comparison here might not give the same result as a full
    virtual address comparison - but that does not matter. If the pointers
    came from different mallocs, they could not overlap and memmove() can
    run either direction.

    The same applies to other uses, such as indexing in a binary search tree
    or a hash map - the comparison above will be correct when it matters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Tue Oct 15 14:22:46 2024
    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.




    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!

    ---
    * https://theretroweb.com/motherboards/s/compaq-deskpro-286e-p-n-001226

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Tue Oct 15 14:09:58 2024
    On 15/10/2024 13:22, Michael S wrote:
    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to
    statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the
    pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.




    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!


    I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
    occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

    But I would expect that in almost any practical system where you can use
    "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

    The exceptions would be systems where pointers hold more than just
    addresses, such as access control information or bounds that mean they
    are larger than the largest integer type on the target.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to D'Oliveiro on Tue Oct 15 18:40:00 2024
    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of the advantages of a microkernel to its developers, and avoids the need for
    lots of context switches. It doesn't let you easily replace low-level OS components, but not many people actually want that.

    <https://en.wikipedia.org/wiki/Hybrid_kernel>

    Windows NT and Apple's XNU, used in all their operating systems, are both hybrid kernels, so the idea is somewhat practical.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Michael S on Tue Oct 15 18:40:00 2024
    In article <20241015111655.000064b3@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    OO still lives on in higher-level languages. Microsoft_s one
    attempt to incorporate its OO architecture--Dotnet--into the
    lower layers of the OS, in Windows Vista, was an abject,
    embarrassing failure which hopefully nobody will try to repeat.

    AFAIK, .net is hugely successful application development technology
    that was never incorporated into lower layers of the OS.

    You're correct. There was an experimental Microsoft OS that was almost
    entirely written in .NET but it was never commercialised.

    <https://en.wikipedia.org/wiki/Singularity_(operating_system)>

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Tue Oct 15 18:57:07 2024
    jgd@cix.co.uk (John Dallman) writes:
    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
    lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.

    It's useful to note that the primary shortcoming of a
    microkernel (domain crossing latency) is mostly not a problem
    on RISC processors (like ARM64) where the ring change
    takes about the same amount of time as a function call.

    One might also argue that in many aspects, a hypervisor is
    a 'microkernel' with some hardware support on most modern
    CPUs.

    Disclaimer: I spent most of the 90's working with the
    Chorus microkernel.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to David Brown on Tue Oct 15 19:46:23 2024
    David Brown <david.brown@hesbynett.no> wrote:
    On 15/10/2024 13:22, Michael S wrote:
    On Tue, 15 Oct 2024 12:38:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2024 21:02, MitchAlsup1 wrote:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality).  Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??


    void * p = ...
    void * q = ...

    uintptr_t pu = (uintptr_t) p;
    uintptr_t qu = (uintptr_t) q;

    if (pu > qu) {
    ...
    } else if (pu < qu) {
    ...
    } else {
    ...
    }


    If your comparison needs to actually match up with the real virtual
    addresses, then this will not work. But does that actually matter?

    Think about using this comparison for memmove().

    Consider where these pointers come from. Maybe they are pointers to
    statically allocated data. Then you would expect the segment to be
    the same in each case, and the uintptr_t comparison will be fine for
    memmove(). Maybe they come from malloc() and are in different
    segments. Then the comparison here might not give the same result as
    a full virtual address comparison - but that does not matter. If the
    pointers came from different mallocs, they could not overlap and
    memmove() can run either direction.

    The same applies to other uses, such as indexing in a binary search
    tree or a hash map - the comparison above will be correct when it
    matters.




    It's all fine for as long as there are no objects bigger than 64KB.
    But with 16MB of virtual memory and with several* MB of physical memory
    one does want objects that are bigger than 64KB!


    I don't know how such objects would be allocated and addressed in such a system. (I didn't do much DOS/Win16 programming, and on the few
    occasions when I needed structures bigger than 64KB in total, they were structured in multiple levels.)

    But I would expect that in almost any practical system where you can use "p++" to step through big arrays, you can also convert the pointer to a uintptr_t and compare as shown above.

    The exceptions would be systems where pointers hold more than just
    addresses, such as access control information or bounds that mean they
    are larger than the largest integer type on the target.

    EGA graphics had more than 64k, smart software would group one or more scan lines into segments for bit mapping the array. A bit mapper works a scan
    line at a time so segment changes were not that expensive. This was
    profoundly faster than using pixel pokes and the other default methods of changing bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Oct 15 17:26:29 2024
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue Oct 15 21:55:44 2024
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Oct 15 22:05:56 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

    POSIX adds some extensions (marked 'CX').

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Tue Oct 15 19:51:27 2024
    On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
    wrote:

    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence >D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of the >advantages of a microkernel to its developers, and avoids the need for
    lots of context switches. It doesn't let you easily replace low-level OS >components, but not many people actually want that.

    Actually, I think there are a whole lot of people who can't afford
    non-stop server hardware but would greatly appreciate not having to
    waste time with a shutdown/reboot every time some OS component gets
    updated.

    YMMV.


    <https://en.wikipedia.org/wiki/Hybrid_kernel>

    Windows NT and Apple's XNU, used in all their operating systems, are both >hybrid kernels, so the idea is somewhat practical.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Oct 16 00:24:07 2024
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/malloc.html

    POSIX adds some extensions (marked 'CX').

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to George Neuner on Wed Oct 16 07:36:29 2024
    George Neuner wrote:
    On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
    wrote:

    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of the
    advantages of a microkernel to its developers, and avoids the need for
    lots of context switches. It doesn't let you easily replace low-level OS
    components, but not many people actually want that.

    Actually, I think there are a whole lot of people who can't afford
    non-stop server hardware but would greatly appreciate not having to
    waste time with a shutdown/reboot every time some OS component gets
    updated.

    YMMV.

    This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
    As soon as you have more than a single instance of a particular
    server/service, then you replace them in groups so that the service sees
    zero downtime even though all the servers have been updated/replaced.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Wed Oct 16 09:17:03 2024
    On 16/10/2024 07:36, Terje Mathisen wrote:
    George Neuner wrote:
    On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
    wrote:

    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of
    the
    advantages of a microkernel to its developers, and avoids the need for
    lots of context switches. It doesn't let you easily replace low-level OS >>> components, but not many people actually want that.

    Actually, I think there are a whole lot of people who can't afford
    non-stop server hardware but would greatly appreciate not having to
    waste time with a shutdown/reboot every time some OS component gets
    updated.

    YMMV.

    This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
    As soon as you have more than a single instance of a particular server/service, then you replace them in groups so that the service sees
    zero downtime even though all the servers have been updated/replaced.


    That's fine - /if/ you have a service that can easily be spread across
    multiple systems, and you can justify the cost of that. Setting up a
    database server is simple enough.

    Setting up a database server along with a couple of read-only
    replications is harder. Adding a writeable failover secondary is harder
    still. Making sure that everything works /perfectly/ when the primary
    goes down for maintenance, and that everything is consistent afterwards,
    is even harder. Being sure it still all works even while the different
    parts have different versions during updates typically means you have to duplicate the whole thing so you can do test runs. And if the database
    server is not open source, your license costs will be absurd, compared
    to what you actually need to provide the service - usually just one
    server instance.

    Clouds do nothing to help any of that.

    But clouds /do/ mean that your virtual machine can be migrated (with
    zero, or almost zero, downtime) to another physical server if there are hardware problems or during hardware maintenance. And if you can do
    easy snapshots with your cloud / VM infrastructure, then you can roll
    back if things go badly wrong. So you have a single server instance,
    you plan a short period of downtime, take a snapshot, stop the service, upgrade, restart. That's what almost everyone does, other than the
    /really/ big or /really/ critical service providers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stefan Monnier on Wed Oct 16 09:21:59 2024
    On 15/10/2024 23:26, Stefan Monnier wrote:
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    In an ideal world, it would be better if we could define `malloc` and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.


    I don't see an advantage in being able to implement them in standard C.
    I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require specific
    time constraints on these functions. In such cases, you are not
    interested in writing fully portable software - it will already contain
    many implementation-specific features or use compiler extensions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Wed Oct 16 09:38:20 2024
    On 15/10/2024 23:55, MitchAlsup1 wrote:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    It's a very good philosophy in programming language design that the core language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need non-standard C - the standard library is part of the implementation.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was standardised, and AFAIK it was in K&R C. But no one (in authority) ever claimed it could be implemented purely in standard C. What do you think
    has changed?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Oct 16 11:18:19 2024
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.
    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".
    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.
    I don't see an advantage in being able to implement them in standard C.

    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.
    E.g. say if your application wants to use a region/pool/zone-based
    memory management.

    The fact that malloc can't be implemented in standard C is evidence
    that standard C may not be general-purpose enough to accommodate an
    application that wants to use a custom-designed allocator.

    I don't disagree with you, from a practical perspective:

    - in practice, C serves us well for Emacs's GC, even though that can't
    be written in standard C.
    - it's not like there are lots of other languages out there that offer
    you portability together with the ability to define your own `malloc`.

    But it's still a weakness, just a fairly minor one.

    The reason why you might want your own special memmove, or your own special malloc, is that you are doing niche and specialised software.

    Region/pool/zone-based memory management is common enough that I would
    not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
    Can't think of a practical reason to implement my own `memove`, OTOH.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 16 15:38:47 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it >>>entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Stefan Monnier on Wed Oct 16 19:57:03 2024
    (Please do not snip or omit attributions. There are Usenet standards
    for a reason.)

    On 16/10/2024 17:18, Stefan Monnier wrote:
    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.
    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".
    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.
    I don't see an advantage in being able to implement them in standard C.

    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.

    That makes no sense to me. We are talking about implementing standard
    library functions. If you want to implement other functions, go ahead.

    Or do you mean that it is only possible to implement related functions
    (such as memory pools) if you also can implement malloc in fully
    portable standard C? That would make a little more sense if it were
    true, but it is not. First, you can implement such functions in implementation-specific code, so you are not hindered from writing the
    code you want. Secondly, standard C provides functions such as malloc()
    and aligned_alloc() that give you the parts you need - the fact that you
    need something outside of standard C to implement malloc() does not
    imply that you need those same features to implement your additional
    functions.

    E.g. say if your application wants to use a region/pool/zone-based
    memory management.

    The fact that malloc can't be implemented in standard C is evidence
    that standard C may not be general-purpose enough to accommodate an application that wants to use a custom-designed allocator.


    No, it is not - see above.

    And remember how C was designed and how it was intended to be used. The
    aim was to be able to write portable code that could be reused on many
    systems, and /also/ implementation, OS and target specific code for
    maximum efficiency, systems programming, and other non-portable work. A typical C program combines these - some parts can be fully portable,
    other parts are partially portable (such as to any POSIX system, or
    targets with 32-bit int and 8-bit char), and some parts may be very compiler-specific or target specific.

    That's not an indication of failure of C for general-purpose
    programming. (But I would certainly not suggest that C is the best
    choice of language for many "general" programming tasks.)

    I don't disagree with you, from a practical perspective:

    - in practice, C serves us well for Emacs's GC, even though that can't
    be written in standard C.
    - it's not like there are lots of other languages out there that offer
    you portability together with the ability to define your own `malloc`.

    But it's still a weakness, just a fairly minor one.

    The reason why you might want your own special memmove, or your own special >> malloc, is that you are doing niche and specialised software.

    Region/pool/zone-based memory management is common enough that I would
    not call it "niche", FWIW, and it's also used in applications that do want portability (GCC and Apache come to mind).
    Can't think of a practical reason to implement my own `memove`, OTOH.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Oct 16 20:00:27 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    But more problematic is the implementation of free() without
    knowing how to compare pointers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Oct 16 22:18:49 2024
    On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Agreed, but once you HAVE a way of getting memory (by whatever name)
    you can write malloc in std. C.

    But more problematic is the implementation of free() without
    knowing how to compare pointers.

    Never wrote a program that actually needs free--I have re-written
    programs that used free to avoid using free, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to david.brown@hesbynett.no on Wed Oct 16 21:19:34 2024
    On Wed, 16 Oct 2024 09:17:03 +0200, David Brown
    <david.brown@hesbynett.no> wrote:

    On 16/10/2024 07:36, Terje Mathisen wrote:
    George Neuner wrote:
    On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
    wrote:

    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of
    the
    advantages of a microkernel to its developers, and avoids the need for >>>> lots of context switches. It doesn't let you easily replace low-level OS >>>> components, but not many people actually want that.

    Actually, I think there are a whole lot of people who can't afford
    non-stop server hardware but would greatly appreciate not having to
    waste time with a shutdown/reboot every time some OS component gets
    updated.

    YMMV.

    This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
    As soon as you have more than a single instance of a particular
    server/service, then you replace them in groups so that the service sees
    zero downtime even though all the servers have been updated/replaced.


    That's fine - /if/ you have a service that can easily be spread across >multiple systems, and you can justify the cost of that. Setting up a >database server is simple enough.

    Setting up a database server along with a couple of read-only
    replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
    goes down for maintenance, and that everything is consistent afterwards,
    is even harder. Being sure it still all works even while the different
    parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
    to what you actually need to provide the service - usually just one
    server instance.

    Clouds do nothing to help any of that.

    But clouds /do/ mean that your virtual machine can be migrated (with
    zero, or almost zero, downtime) to another physical server if there are >hardware problems or during hardware maintenance. And if you can do
    easy snapshots with your cloud / VM infrastructure, then you can roll
    back if things go badly wrong. So you have a single server instance,
    you plan a short period of downtime, take a snapshot, stop the service, >upgrade, restart. That's what almost everyone does, other than the
    /really/ big or /really/ critical service providers.

    For various definitions of "short period of downtime". 8-)

    Fortunately, Linux installs updates - or stages updates for restart -
    much faster than Windoze. But rebooting to the point that all the
    services are running still can take several minutes.

    That can feel like an eternity when it's the only <whatever> server in
    a small business.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Thu Oct 17 02:35:07 2024
    According to David Brown <david.brown@hesbynett.no>:
    Setting up a database server along with a couple of read-only
    replications is harder. Adding a writeable failover secondary is harder >still. Making sure that everything works /perfectly/ when the primary
    goes down for maintenance, and that everything is consistent afterwards,
    is even harder. Being sure it still all works even while the different
    parts have different versions during updates typically means you have to >duplicate the whole thing so you can do test runs. And if the database >server is not open source, your license costs will be absurd, compared
    to what you actually need to provide the service - usually just one
    server instance.

    Clouds do nothing to help any of that.

    AWS provides a database service that does most of that. You can spin
    up databases, read-only mirrors, failover from one region to another,
    staging environments to test upgrades. They offer MySQL and
    PostgreSQL, as well as Oracle and DB2.

    It's still a fair amount of work, but way less than doing it all yourself.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Wed Oct 16 23:06:24 2024
    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of >>>>> a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it >>>>entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and >>>>> `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space. Because that
    space can be treated as a single array of char, and comparing pointers
    to elements of the same array is legal, the only thing I can see that
    prevents writing malloc() in standard C would be the need to somhow
    define the array from the /language's/ POV (not the compiler's) prior
    to using it.

    Which circles back to why something like

    char (*heap)[ULONG_MAX] = ... ;

    would/does not satisfy the language's requirement. All the compilers
    I have ever seen would have been happy with it, but none of them ever
    needed something like it anyway. Conversion to <an integer type> also
    would always work, but also was never needed.

    I am not a language lawyer - I don't even pretend to understand the
    arguments against allowing general pointer comparison.


    Aside: I have worked on architectures (DSPs) having disjoint memory
    spaces, spaces with differing bit widths, and even spaces where [sans
    MMU] the same physical address had multiple logical addresses whose
    use depended on the type of access.

    I have written allocators and even a GC for such architectures. Never
    had a problem convincing C compilers to compare pointers - the only
    issue I ever faced was whether the result actually was meaningful to
    the program.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to david.brown@hesbynett.no on Wed Oct 16 23:32:41 2024
    On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
    <david.brown@hesbynett.no> wrote:


    It's a very good philosophy in programming language design that the core >language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and >otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need >non-standard C - the standard library is part of the implementation.

    But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
    compiler flags to be using a different compiler.]

    Why? Because once these things are discovered, many programmers will
    see their advantages and lack the discipline to avoid using them for
    more general application work.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was >standardised, and AFAIK it was in K&R C. But no one (in authority) ever >claimed it could be implemented purely in standard C. What do you think
    has changed?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Thomas Koenig on Thu Oct 17 00:40:34 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C. I am not asking
    if it is still in the std libraries, I am asking what happened
    to make it impossible to write malloc() in std. C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Right. And that is why malloc(), or some essential internal component
    of malloc(), has to be platform specific, and thus malloc() must be
    supplied by the implementation (which means both the compiler and the
    standard library).

    But more problematic is the implementation of free() without knowing
    how to compare pointers.

    Once there is a way to get additional memory from whatever underlying environment is there, malloc() and free() can be implemented (and I
    believe most often are implemented) without needing to compare
    pointers. Note: pointers can be tested for equality without having
    to compare them relationally, and testing pointers for equality is
    well-defined between any two pointers (which may need to be converted
    to 'void *' to avoid a type mismatch).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 01:18:04 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    On Wed, 16 Oct 2024 20:00:27 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    The paragraaph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C. I am not
    asking if it is still in the std libraries, I am asking what
    happened to make it impossible to write malloc() in standard C ?!?

    You need to reserve memory by some way from the operating system,
    which is, by necessity, outside of the scope of C (via brk(),
    GETMAIN, mmap() or whatever).

    Agreed, but once you HAVE a way of getting memory (by whatever name)
    you can write malloc in standard C.

    The point is that getting more memory is inherently platform
    specific, which is why malloc() must be defined by each particular implementation, and so was put in the standard library.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 02:48:49 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any
    existing functionality that cannot be written using the language
    is a sign of a weakness because it shows that despite being
    "general purpose" it fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT need to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc`
    and `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be standard K&R C--what dropped if from the
    standard??

    It still is part of the ISO C standard.

    The paragraph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C.

    No, it didn't. In the original book (my copy is from the third
    printing of the first edition, copyright 1978), on page 175 there
    is a function 'alloc()' that shows how to write a memory allocator.
    The code in alloc() calls 'morecore()', described as follows:

    The function morecore obtains storage from the operating system.
    The details of how this is done of course vary from system to
    system. In UNIX, the system entry sbrk() returns a pointer to n
    more bytes of storage. [...]

    An implementation of morecore() is shown on the next page, and
    it indeed uses sbrk() to get more memory. That makes it UNIX
    specific, not portable standard C. Both alloc() and morecore()
    are part of chapter 8, "The UNIX System Interface".

    Note also that chapter 7, titled "Input and Output" and describing
    the standard library, mentions in section 7.9, "Some Miscellaneous
    Functions", the function calloc() as part of the standard library.
    (There is no mention of malloc().) The point of having a standard
    library is that the functions it contains depend on details of the
    underlying OS and thus cannot be written in platform-agnostic code.
    Being platform portable is the defining property of "standard C".

    (Amusing aside: the entire standard library seems to be covered by
    just #include <stdio.h>.)

    I am not
    asking if it is still in the standard libraries, I am asking what
    happened to make it impossible to write malloc() in standard C ?!?

    Functions such as sbrk() are not part of the C language. Whether
    it's called calloc() or malloc(), memory allocation has always
    needed access to some facilities not provided by the C language
    itself. The function malloc() is not any more writable in standard
    K&R C than it is in standard ISO C (except of course malloc() can
    be implemented by using calloc() internally, but that depends on
    calloc() being part of the standard library).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Thu Oct 17 03:16:13 2024
    George Neuner <gneuner2@comcast.net> writes:

    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    [...]

    malloc() used to be standard K&R C--what dropped it from the
    standard ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written in
    standard C. It used to be written in standard K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space.

    Not necessarily.

    Because that space can be treated as a single array of char,

    Not always.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Thu Oct 17 03:17:33 2024
    mitchalsup@aol.com (MitchAlsup1) writes:

    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing
    functionality that cannot be written using the language is a sign of
    a weakness because it shows that despite being "general purpose" it
    fails to cover this specific "purpose".

    That is a foolish statement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to John Levine on Thu Oct 17 14:41:06 2024
    On 17/10/2024 04:35, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    Setting up a database server along with a couple of read-only
    replications is harder. Adding a writeable failover secondary is harder
    still. Making sure that everything works /perfectly/ when the primary
    goes down for maintenance, and that everything is consistent afterwards,
    is even harder. Being sure it still all works even while the different
    parts have different versions during updates typically means you have to
    duplicate the whole thing so you can do test runs. And if the database
    server is not open source, your license costs will be absurd, compared
    to what you actually need to provide the service - usually just one
    server instance.

    Clouds do nothing to help any of that.

    AWS provides a database service that does most of that. You can spin
    up databases, read-only mirrors, failover from one region to another,
    staging environments to test upgrades. They offer MySQL and
    PostgreSQL, as well as Oracle and DB2.

    It's still a fair amount of work, but way less than doing it all yourself.


    That's an additional service they provide - it's not an inherent part of
    a cloud infrastructure. Still, it sounds like a useful service, and one
    that I might find useful in the future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Thu Oct 17 14:39:45 2024
    On 17/10/2024 03:19, George Neuner wrote:
    On Wed, 16 Oct 2024 09:17:03 +0200, David Brown
    <david.brown@hesbynett.no> wrote:

    On 16/10/2024 07:36, Terje Mathisen wrote:
    George Neuner wrote:
    On Tue, 15 Oct 2024 18:40 +0100 (BST), jgd@cix.co.uk (John Dallman)
    wrote:

    In article <vekb2f$1co97$6@dont-email.me>, ldo@nz.invalid (Lawrence
    D'Oliveiro) wrote:

    On the other hand, some stubborn holdouts are still fond of
    microkernels -- you just have to say the whole idea is pointless,
    and they come out of the woodwork in a futile attempt to disagree

    The idea is impractical, not pointless. A hybrid kernel gives most of >>>>> the
    advantages of a microkernel to its developers, and avoids the need for >>>>> lots of context switches. It doesn't let you easily replace low-level OS >>>>> components, but not many people actually want that.

    Actually, I think there are a whole lot of people who can't afford
    non-stop server hardware but would greatly appreciate not having to
    waste time with a shutdown/reboot every time some OS component gets
    updated.

    YMMV.

    This is _exactly_ (one of) the problem(s) cloud infrastructure solves:
    As soon as you have more than a single instance of a particular
    server/service, then you replace them in groups so that the service sees >>> zero downtime even though all the servers have been updated/replaced.


    That's fine - /if/ you have a service that can easily be spread across
    multiple systems, and you can justify the cost of that. Setting up a
    database server is simple enough.

    Setting up a database server along with a couple of read-only
    replications is harder. Adding a writeable failover secondary is harder
    still. Making sure that everything works /perfectly/ when the primary
    goes down for maintenance, and that everything is consistent afterwards,
    is even harder. Being sure it still all works even while the different
    parts have different versions during updates typically means you have to
    duplicate the whole thing so you can do test runs. And if the database
    server is not open source, your license costs will be absurd, compared
    to what you actually need to provide the service - usually just one
    server instance.

    Clouds do nothing to help any of that.

    But clouds /do/ mean that your virtual machine can be migrated (with
    zero, or almost zero, downtime) to another physical server if there are
    hardware problems or during hardware maintenance. And if you can do
    easy snapshots with your cloud / VM infrastructure, then you can roll
    back if things go badly wrong. So you have a single server instance,
    you plan a short period of downtime, take a snapshot, stop the service,
    upgrade, restart. That's what almost everyone does, other than the
    /really/ big or /really/ critical service providers.

    For various definitions of "short period of downtime". 8-)

    Yes, indeed.


    Fortunately, Linux installs updates - or stages updates for restart -
    much faster than Windoze. But rebooting to the point that all the
    services are running still can take several minutes.


    My experience is that the updates on Linux servers are usually fast (for desktops they can be slow, but that is usually because you have far more
    and bigger programs). Updates for virtual machines are particularly
    fast because you generally have a minimum of programs in the VM.
    Restarts are also fast for virtual machines - physical servers are often
    slow to restart, sometimes taking many minutes before they get to the
    point of starting the OS boot.

    That can feel like an eternity when it's the only <whatever> server in
    a small business.

    Sure. But for most small businesses, it's not hard to find off-peak
    times when you can have hours of downtime without causing a problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Thu Oct 17 16:16:42 2024
    On 17/10/2024 05:06, George Neuner wrote:
    On Wed, 16 Oct 2024 15:38:47 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 22:05:56 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 15 Oct 2024 21:26:29 +0000, Stefan Monnier wrote:

    There is an advantage to the C approach of separating out some
    facilities and supplying them only in the standard library.

    It goes a bit further: for a general purpose language, any existing >>>>>> functionality that cannot be written using the language is a sign of >>>>>> a weakness because it shows that despite being "general purpose" it >>>>>> fails to cover this specific "purpose".

    One of the key ways C got into the minds of programmers was that
    one could write stuff like printf() in C and NOT needd to have it
    entirely built-into the language.

    In an ideal world, it would be better if we could define `malloc` and >>>>>> `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    It still is part of the ISO C standard.

    The paragraaph with 3 >'s indicates malloc() cannot be written
    in std. C. It used to be written in std. K&R C.

    K&R may have been 'de facto' standard C, but not 'de jure'.

    Unix V6 malloc used the 'brk' system call to allocate space
    for the heap. Later versions used 'sbrk'.

    Those are both kernel system calls.

    Yes, but malloc() subdivides an already provided space. Because that
    space can be treated as a single array of char, and comparing pointers
    to elements of the same array is legal, the only thing I can see that prevents writing malloc() in standard C would be the need to somhow
    define the array from the /language's/ POV (not the compiler's) prior
    to using it.


    It is common for malloc() implementations to ask the OS for large chunks
    of memory, then subdivide it and pass it out to the application. When
    the chunk(s) it has run out, it will ask for more from the OS. You
    could reasonably argue that each chunk it gets may be considered a
    single unsigned char array, but that is certainly not true for
    additional chunks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to George Neuner on Thu Oct 17 16:25:01 2024
    On 17/10/2024 05:32, George Neuner wrote:
    On Wed, 16 Oct 2024 09:38:20 +0200, David Brown
    <david.brown@hesbynett.no> wrote:


    It's a very good philosophy in programming language design that the core
    language should only contain what it has to contain - if a desired
    feature can be put in a library and be equally efficient and convenient
    to use, then it should be in the standard library, not the core
    language. It is much easier to develop, implement, enhance, adapt, and
    otherwise change things in libraries than the core language.

    And it is also fine, IMHO, that some things in the standard library need
    non-standard C - the standard library is part of the implementation.

    But it is a problem if the library has to be written using a different compiler. [For this purpose I would consider specifying different
    compiler flags to be using a different compiler.]

    Specifying different flags would technically give you a different /implementation/, but it would not normally be considered a different /compiler/. I see no problem at all if libraries (standard library or otherwise) are compiled with different flags. I can absolutely
    guarantee that the flags I use for compiling my application code are not
    the same as those used for compiling the static libraries that came with
    my toolchains. Using different /compilers/ could be a significant inconvenience, and might mean you lose additional features (such as
    link-time optimisation), but as long as the ABI is consistent then they
    should work fine.


    Why? Because once these things are discovered, many programmers will
    see their advantages and lack the discipline to avoid using them for
    more general application work.


    Really? Have you ever looked at the source code for a library such as
    glibc or newlib? Most developers would look at that and quickly shy
    away from all the macros, additional compiler-specific attributes,
    conditional compilation, and the rest of it. Very, very few would look
    into the details to see if there are any "tricks" or "secret" compiler extensions they can copy. And with very few exceptions, all the compiler-specific features will already be documented and available to programmers enthusiastic enough to RTFM.


    In an ideal world, it would be better if we could define `malloc` and
    `memmove` efficiently in standard C, but at least they can be
    implemented in non-standard C.

    malloc() used to be std. K&R C--what dropped if from the std ??

    The function has always been available in C since the language was
    standardised, and AFAIK it was in K&R C. But no one (in authority) ever
    claimed it could be implemented purely in standard C. What do you think
    has changed?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 18 05:56:03 2024
    On Tue, 15 Oct 2024 11:16:55 +0300, Michael S wrote:

    AFAIK, .net is hugely successful application development technology that
    was never incorporated into lower layers of the OS.

    Look up the infamous “Longhorn reset”. Microsoft had to chuck away a bunch of low-performance, high-overhead code and try again, and Dotnet was the reason. This delayed Windows Vista by about a year and a half, and it was
    still a rush to get it out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to John Dallman on Fri Oct 18 05:57:14 2024
    On Tue, 15 Oct 2024 18:40 +0100 (BST), John Dallman wrote:

    Windows NT and Apple's XNU, used in all their operating systems, are
    both hybrid kernels, so the idea is somewhat practical.

    The fact that both are regularly outperformed (and outfeatured) by Linux,
    on hardware that is supposedly optimized for those specific proprietary
    OSes, just reinforces my point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Michael S on Fri Oct 18 07:01:08 2024
    On Tue, 15 Oct 2024 11:59:27 +0300, Michael S wrote:

    The question is how slowness of 80286 segments compares to
    contemporaries that used segment-based protected memory.
    Wikipedia lists following machines as examples of segmentation:
    - Burroughs B5000 and following Burroughs Large Systems
    - GE 645 -> Honeywell 6080
    - Prime 400 and successors
    - IBM System/38

    Certainly the first two had “segmentation” that was nothing like the Intel (mis)interpretation of the concept.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Oct 18 12:47:53 2024
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Fri Oct 18 06:00:54 2024
    Michael S <already5chosen@yahoo.com> writes:

    On Mon, 14 Oct 2024 17:19:40 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    [...]

    My only point of contention is that the existence or lack of such
    instructions does not make any difference to whether or not you can
    write a good implementation of memcpy() or memmove() in portable
    standard C.

    You are moving a goalpost.

    No, he isn't.

    One does not need "good implementation" in a sense you have in mind.
    All one needs is an implementation that pattern matching logic of
    compiler unmistakably recognizes as memove/memcpy. That is very easily
    done in standard C. For memmove, I had shown how to do it in one of the
    posts below. For memcpy its very obvious, so no need to show.

    You have misunderstood the meaning of "standard C", which means
    code that does not rely on any implementation-specific behavior.
    "All one needs is an implementation that ..." already invalidates
    the requirement that the code not rely on implementation-specific
    behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Fri Oct 18 05:39:02 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [ISA support for copying possibly overlapping regions of memory]

    [Separately, what is possible to do in portable standard C]

    [...] I really don't think any of us really disagree, it is just
    that we have been discussing two (mostly) orthogonal issues.

    I would summarize the string of conversations as follows.

    It started with talking about what is or is not possible in
    "standard C", by which is meant C that does not rely on any implementation-specific behavior. (Topic A.)

    The discussion shifted after a comment about how to provide
    architectual support for copying one region of memory to
    another, where the areas of memory might overlap. (Topic B.)

    After the introduction of Topic B, most of the subsequent
    conversation either ignored Topic A or conflated the two
    topics.

    The key point is that Topic B has nothing to do with Topic A,
    and vice versa. It's like asking why it's colder in the
    mountains than it is in the summer: both parts have something
    to do with temperature, but in spite of that there is no
    meaningful relationship between them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Fri Oct 18 14:06:17 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/ needing
    it.

    OK, take a segmented memory model with 16-bit pointers and a 24-bit
    virtual address space. How do you actually compare to segmented
    pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    Unisys discontinued that line of systems in 1992.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Oct 18 17:34:16 2024
    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/
    needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models
    suffered from the same problem as 80286 - the segment of maximal size
    didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits in
    the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to David Brown on Fri Oct 18 17:38:55 2024
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard C.
    I /do/ see an advantage in being able to do so well in non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require specific
    time constraints on these functions.  In such cases, you are not
    interested in writing fully portable software - it will already contain
    many implementation-specific features or use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Fri Oct 18 16:19:08 2024
    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully standard
    way to compare independent pointers (other than just for
    equality). Rarely needing something does not mean /never/
    needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models

    No. The systems I described above are from the medium
    systems family (B2000/B3000/B4000). The B5000/B6000/B7000
    (large) family systems were a completely different stack based
    architecture with a 48-bit word size. The Small systems (B1000)
    supported task-specific dynamic microcode loading (different
    microcode for a cobol app vs. a fortran app).

    Medium systems evolved from the Electrodata Datatron and 220 (1954) through
    the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
    was also developed at the old Electrodata plant in Pasadena
    (where I worked in the 80s) - eventually large systems moved
    out - the more capable large systems (B7XXX) were designed in Tredyffrin
    Pa, the less capable large systems (B5XXX) were designed in Mission Viejo, Ca.

    suffered from the same problem as 80286 - the segment of maximal size
    didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits in
    the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.

    Large systems still exist today in emulation[*], as do the
    former Univac (Sperry 2200) systems. The last medium system
    (V380) was retired by the City of Santa Ana in 2010 (almost two
    decades after Unisys cancelled the product line) and was moved
    to the Living Computer Museum.

    City of Santa Ana replaced the single 1980 vintage V380 with
    29 windows servers.

    After the merger of Burroughs and Sperry in '86 there were six
    different mainframe architectures - by 1990, all but
    two (2200 and large systems) had been terminated.

    [*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Vir Campestris on Fri Oct 18 21:45:37 2024
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in
    non-standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require
    specific time constraints on these functions.  In such cases, you are
    not interested in writing fully portable software - it will already
    contain many implementation-specific features or use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C. You are relying on implementation details, or writing code that is only suitable for a
    particular implementation (or set of implementations). It is normal to
    write this kind of thing in C, but it is non-portable C. (Or at least,
    not fully portable C.)

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly. The bit you can't write in fully portable standard C is
    the comparison of the pointers so you know which direction to do the
    copying.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Sat Oct 19 19:46:41 2024
    On Fri, 18 Oct 2024 16:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 18 Oct 2024 14:06:17 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 14 Oct 2024 19:39:41 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 14 Oct 2024 15:04:28 +0000, David Brown wrote:

    On 13/10/2024 17:45, Anton Ertl wrote:

    I do think it would be convenient if there were a fully
    standard way to compare independent pointers (other than
    just for equality). Rarely needing something does not mean
    /never/ needing it.

    OK, take a segmented memory model with 16-bit pointers and a
    24-bit virtual address space. How do you actually compare to
    segmented pointers ??

    Depends. On the Burroughs mainframe there could be eight
    active segments and the segment number was part of the pointer.

    Pointers were 32-bits (actually 8 BCD digits)

    S s OOOOOO

    Where 'S' was a sign digit (C or D), 's' was the
    segment number (0-7) and OOOOOO was the six digit
    offset within the segment (500kB/1000kD each).

    A particular task (process) could have up to
    one million "environments", each environment
    could have up to 100 "memory areas (up to 1000kD)
    of which the first eight were loaded into the
    processor base/limit registers. Index registers
    were 8 digits and were loaded with a pointer as
    described above. Operands could optionally select
    one of the index registers and the operand address
    was treated as an offset to the index register;
    there were 7 index registers.

    Access to memory areas 8-99 use string instructions
    where the pointer was 16 BCD digits:

    EEEEEEMM SsOOOOOO

    Where EEEEEE was the evironment number (0-999999);
    environments starting with D00000 were reserved for
    the MCP (Operating System). MM was the memory area
    number and the remaining eight digits described the
    data within the memory area. A subroutine call could
    call within a memory area or switch to a new environment.

    Memory area 1 was the code region for the segment,
    Memory area 0 held the stack and some global variables
    and was typically shared by all environments.
    Memory areas 2-7 were application dependent and could
    be configured to be shared between environments at
    link time.

    What was the size of phiscal address space ?
    I would suppose, more than 1,000,000 words?

    It varied based on the generation. In the
    1960s, a half megabyte (10^6 digits)
    was the limit.

    In the 1970s, the architecture supported
    10^8 digits, the largest B4800 systems
    were shipped with 2 million digits (1MB).
    In 1979, the B4900 was introduced supporting
    up to 10MB (20 MD), later increased to
    20MB/40MD.

    In the 1980s, the largest systems (V500)
    supported up to 10^9 digits. It
    was that generation of machine where the
    environment scheme was introduced.

    Binaries compiled in 1966 ran on all
    generations without recompilation.

    There was room in the segmentation structures
    for up to 10^18 digit physical addresses
    (where the segments were aligned on 10^3
    digit boundaries).

    So, can it be said that ar least some of B6500-compatible models

    No. The systems I described above are from the medium
    systems family (B2000/B3000/B4000).

    I didn't realize that you were not talking about Large Systems.
    I didn't even know that Medium Systems used segmented memory.
    Sorry.

    The B5000/B6000/B7000
    (large) family systems were a completely different stack based
    architecture with a 48-bit word size. The Small systems (B1000)
    supported task-specific dynamic microcode loading (different
    microcode for a cobol app vs. a fortran app).

    Medium systems evolved from the Electrodata Datatron and 220 (1954)
    through the Burroughs B300 to the Burroughs B3500 by 1965. The B5000
    was also developed at the old Electrodata plant in Pasadena
    (where I worked in the 80s) - eventually large systems moved
    out - the more capable large systems (B7XXX) were designed in
    Tredyffrin Pa, the less capable large systems (B5XXX) were designed
    in Mission Viejo, Ca.

    suffered from the same problem as 80286 - the segment of maximal size >didn't cover all linear (or physical) address space?
    Or their index register width was increased to accomodate 1e9 digits
    in the single segment?


    Unisys discontinued that line of systems in 1992.

    I thought it lasted longer. My impresion was that there were still
    hardware implemntation (alongside with emulation on Xeons) sold up
    until 15 years ago.

    Large systems still exist today in emulation[*], as do the
    former Univac (Sperry 2200) systems. The last medium system
    (V380) was retired by the City of Santa Ana in 2010 (almost two
    decades after Unisys cancelled the product line) and was moved
    to the Living Computer Museum.

    City of Santa Ana replaced the single 1980 vintage V380 with
    29 windows servers.

    After the merger of Burroughs and Sperry in '86 there were six
    different mainframe architectures - by 1990, all but
    two (2200 and large systems) had been terminated.

    [*] Clearpath Libra https://www.unisys.com/client-education/clearpath-forward-libra-servers/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to David Brown on Sun Oct 20 21:51:30 2024
    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in non-
    standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised software.
    For example, you might be making real-time software and require
    specific time constraints on these functions.  In such cases, you are
    not interested in writing fully portable software - it will already
    contain many implementation-specific features or use compiler
    extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying on implementation details, or writing code that is only suitable for a particular implementation (or set of implementations).  It is normal to write this kind of thing in C, but it is non-portable C.  (Or at least,
    not fully portable C.)

    Ah, I see your point. Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard C is
    the comparison of the pointers so you know which direction to do the
    copying.

    It's a long time since I had to mistrust a compiler so much that I was
    pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Vir Campestris on Mon Oct 21 08:58:05 2024
    On 20/10/2024 22:51, Vir Campestris wrote:
    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in standard
    C. I /do/ see an advantage in being able to do so well in non-
    standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own
    special malloc, is that you are doing niche and specialised
    software. For example, you might be making real-time software and
    require specific time constraints on these functions.  In such
    cases, you are not interested in writing fully portable software -
    it will already contain many implementation-specific features or use
    compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying
    on implementation details, or writing code that is only suitable for a
    particular implementation (or set of implementations).  It is normal
    to write this kind of thing in C, but it is non-portable C.  (Or at
    least, not fully portable C.)

    Ah, I see your point. Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    Yes.

    I think /every/ implementation will require communication with the OS,
    if there is an OS - otherwise it will need support from other parts of
    the toolchain (such as symbols created in a linker script to define the
    heap area - that's the typical implementation in small embedded systems).

    The nearest you could get to a portable implementation would be using a
    local unsigned char array as the heap, but I don't believe that would be
    fully correct according to the effective type rules (or the "strict
    aliasing" or type-based aliasing rules, if you prefer those terms). It
    would also not be good enough for the needs of many programs.

    Of course, a fair amount of the code for malloc/free can written in
    fully portable C - and almost all of it can be written in a somewhat
    vaguely defined "widely portable C" where you can mask pointer bits to
    handle alignment, and other such conveniences.


    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C.


    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard C
    is the comparison of the pointers so you know which direction to do
    the copying.

    It's a long time since I had to mistrust a compiler so much that I was pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.


    Looking at the generated assembly is usually not a matter of mistrusting
    the compiler. One of the reasons I do so is to check that the compiler
    can generate efficient object code from my source code, in cases where I
    need maximal efficiency. I'd rather not write assembly unless I really
    have to!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Mon Oct 21 09:21:42 2024
    David Brown wrote:
    On 20/10/2024 22:51, Vir Campestris wrote:
    On 18/10/2024 20:45, David Brown wrote:
    On 18/10/2024 18:38, Vir Campestris wrote:
    On 16/10/2024 08:21, David Brown wrote:

    I don't see an advantage in being able to implement them in
    standard C. I /do/ see an advantage in being able to do so well in
    non- standard, implementation-specific C.

    The reason why you might want your own special memmove, or your own >>>>> special malloc, is that you are doing niche and specialised
    software. For example, you might be making real-time software and
    require specific time constraints on these functions.  In such
    cases, you are not interested in writing fully portable software -
    it will already contain many implementation-specific features or
    use compiler extensions.

    I have a vague feeling that once upon a time I wrote a malloc for an
    embedded system. Having only one process it had access to the entire
    memory range, and didn't need to talk to the OS. Entirely C is quite
    feasible there.


    Sure - but you are not writing portable standard C.  You are relying >>> on implementation details, or writing code that is only suitable for
    a particular implementation (or set of implementations).  It is
    normal to write this kind of thing in C, but it is non-portable C.Â
    (Or at least, not fully portable C.)

    Ah, I see your point. Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    Yes.

    I think /every/ implementation will require communication with the OS,
    if there is an OS - otherwise it will need support from other parts of
    the toolchain (such as symbols created in a linker script to define the
    heap area - that's the typical implementation in small embedded systems).

    The nearest you could get to a portable implementation would be using a local unsigned char array as the heap, but I don't believe that would be fully correct according to the effective type rules (or the "strict aliasing" or type-based aliasing rules, if you prefer those terms).  It would also not be good enough for the needs of many programs.

    Of course, a fair amount of the code for malloc/free can written in
    fully portable C - and almost all of it can be written in a somewhat
    vaguely defined "widely portable C" where you can mask pointer bits to
    handle alignment, and other such conveniences.


    But memmove? On an 80286 it will be using rep movsw, rather than a
    software loop, to copy the memory contents to the new location.

    _That_ does require assembler, or compiler extensions, not standard C. >>>>

    It would normally be written in C, and the compiler will generate the
    "rep" assembly.  The bit you can't write in fully portable standard
    C is the comparison of the pointers so you know which direction to do
    the copying.

    It's a long time since I had to mistrust a compiler so much that I was
    pulling the assembler apart. It sounds as though they have got smarter
    in the meantime.

    I just checked BTW, and you are correct.


    Looking at the generated assembly is usually not a matter of mistrusting
    the compiler.  One of the reasons I do so is to check that the compiler
    can generate efficient object code from my source code, in cases where I need maximal efficiency.  I'd rather not write assembly unless I really have to!

    For near-light-speed code I used to write it first in C, optimize that,
    then I would translate it into (inline) asm and re-optimize based on
    having the full cpu architecture available, before in the final stage I
    would use the asm experience to tweak the C just enough to let the
    compiler generate machine code quite close (90+%) to my best asm, while
    still being portable to any cpu with more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C version
    was still fast enough that a couple of years later I got a prize in the
    mail: Someone in France had submitted my C code, with my name & address,
    to a similar competition there and it was still faster than anyone else. :-)

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Oct 21 14:04:42 2024
    I don't see an advantage in being able to implement them in standard C.
    It means you can likely also implement a related yet different API
    without having your code "demoted" to non-standard.
    That makes no sense to me. We are talking about implementing standard library functions. If you want to implement other functions, go ahead.

    No, I'm talking about a very general principle that applies to
    languages, libraries, etc...

    For example, in Emacs I always try [and don't always succeed] to make
    sure that the default behavior for a given functionality can be
    implemented using the official API entry points of the underlying
    library, because it makes it more likely that whoever wants to replace
    that behavior with something else will be able to do it without having
    to break abstraction barriers.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Vir Campestris on Mon Oct 21 23:17:10 2024
    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Mon Oct 21 23:52:59 2024
    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Tue Oct 22 01:09:49 2024
    On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require communication with the OS
    there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.

    Guess what the “OS” part of “POSIX” stands for.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Oct 21 18:32:27 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such I book I guarantee I will want to buy one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Tue Oct 22 08:27:12 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such I book I guarantee I will want to buy one.

    Thank you Tim!

    Probably not a book but I would consider writing a series of blog posts
    similar to that, now that I am about to retire: My wife and I will both
    go on "permanent vacation" starting a week before Christmas. :-)

    I already know that this will give me more time to work on digital
    mapping projects (ref my https://mapant.no/ Norwegian topo map generated
    from ~50 TB of LiDAR), but if there's an interest in optimization I
    might do that as well.

    BTW, I am also open to doing some consulting work, if the problems are interesting enough. :-)

    Regards,
    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ldo@nz.invalid on Tue Oct 22 17:26:06 2024
    On Tue, 22 Oct 2024 01:09:49 -0000 (UTC), Lawrence D'Oliveiro
    <ldo@nz.invalid> wrote:

    On Mon, 21 Oct 2024 23:52:59 +0000, MitchAlsup1 wrote:

    On Mon, 21 Oct 2024 23:17:10 +0000, Lawrence D'Oliveiro wrote:

    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require communication with the OS
    there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    POSIX is an environment not an OS.

    Guess what the “OS” part of “POSIX” stands for.

    It's still an just environment - POSIX defines only an interface, not
    an implementation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Terje Mathisen on Wed Oct 23 07:25:42 2024
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such a book I guarantee I will want to buy one.

    Thank you Tim!

    I know from past experience you are good at this. I would love
    to hear what you have to say.

    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    P.S. Is the email address in your message a good way to reach you?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Tim Rentsch on Wed Oct 23 18:11:57 2024
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Oct 23 18:27:06 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    And start working for "HER". (Honeydew list).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Scott Lurndal on Wed Oct 23 21:12:57 2024
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    And start working for "HER". (Honeydew list).

    My wife do have a small list of things that we (i.e. I) could do when we retire...

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Oct 23 21:11:59 2024
    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    I'm still connected to Mill Computing as well.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Oct 23 21:01:01 2024
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    I'm still connected to Mill Computing as well.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Tim Rentsch on Wed Oct 23 21:09:47 2024
    Tim Rentsch wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    [C vs assembly]

    For near-light-speed code I used to write it first in C, optimize
    that, then I would translate it into (inline) asm and re-optimize
    based on having the full cpu architecture available, before in the
    final stage I would use the asm experience to tweak the C just
    enough to let the compiler generate machine code quite close
    (90+%) to my best asm, while still being portable to any cpu with
    more or less the same capabilities.

    One example: When I won an international competition to write the
    fastest Pentomino solver (capable of finding all 2339/1010/368/2
    solutions of the 6x10/5x12/4x15/3x20 layouts), I also included the
    portable C version.

    My asm submission was twice as fast as anyone else, while the C
    version was still fast enough that a couple of years later I got a
    prize in the mail: Someone in France had submitted my C code,
    with my name & address, to a similar competition there and it was
    still faster than anyone else. :-)

    I hope you will consider writing a book, "Writing Fast Code" (or
    something along those lines). The core of the book could be, oh,
    let's say between 8 and 12 case studies, starting with a problem
    statement and tracing through the process that you followed, or
    would follow, with stops along the way showing the code at each
    of the different stages.

    If you do write such a book I guarantee I will want to buy one.

    Thank you Tim!

    I know from past experience you are good at this. I would love
    to hear what you have to say.

    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    I'm sure you're right!

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work". In any case I hope you both enjoy
    the time.

    P.S. Is the email address in your message a good way to reach you?

    Yes, that is my personal domain, so it won't be affected by my retirement.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Thu Oct 24 07:39:52 2024
    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week
    before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy >>>> the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    I don't know that usage, I thought quires was a typesetting/printing
    measure?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Tim Rentsch on Thu Oct 24 06:55:20 2024
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    One thing I have thought of is a wiki of optimization techniques that
    contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Thu Oct 24 10:00:16 2024
    On 24/10/2024 08:55, Anton Ertl wrote:
    Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Probably not a book but I would consider writing a series of blog
    posts similar to that, now that I am about to retire:

    You could try writing one blog post a month on the subject. By
    this time next year you will have plenty of material and be well
    on your way to putting a book together. (First drafts are always
    the hardest part...)

    One thing I have thought of is a wiki of optimization techniques that contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.


    Would it make sense to start something under Wikibooks on Wikipedia? I
    have no experience with it myself, but it looks to me like a way to have
    a collaborative collection of related knowledge. It could provide the structure and framework, saving you (plural) from having to set up a
    wiki, blog, or whatever.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Thu Oct 24 16:34:45 2024
    David Brown <david.brown@hesbynett.no> writes:
    On 24/10/2024 08:55, Anton Ertl wrote:
    One thing I have thought of is a wiki of optimization techniques that
    contains descriptions of the techniques and case studies, but I have
    not yet implemented this idea.


    Would it make sense to start something under Wikibooks on Wikipedia?

    Yes, I was thinking about that. In the bookshelf on computer
    programming <https://en.wikibooks.org/wiki/Shelf:Computer_programming>
    there are two "Books nearing completion" that have "Opti" in the
    title:

    https://en.wikibooks.org/wiki/Optimizing_Code_for_Speed https://en.wikibooks.org/wiki/Optimizing_C%2B%2B

    Looking at the contents of the former, it's rather short and
    high-level, and I don't think it's intended for the kind of project we
    have in mind.

    The latter is more in the direction I have in mind, but the limitation
    to C++ is, well, limiting.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Thu Oct 24 18:32:22 2024
    On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week >>>>>> before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual
    vacation and self-chosen "work".  In any case I hope you both enjoy >>>>> the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do
    want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    I don't know that usage, I thought quires was a typesetting/printing
    measure?

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Lawrence D'Oliveiro on Sun Oct 27 20:42:09 2024
    On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:
    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    One of the other groups I'm following just for the hell of it is
    comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

    "cannot be a _truly_ portable" is what I meant. Portable to most machine
    is easy - just write for Windows. POSIX will give you a larger subset -
    but still a subset.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Terje Mathisen on Sun Oct 27 20:45:09 2024
    On 23/10/2024 20:12, Terje Mathisen wrote:

    My wife do have a small list of things that we (i.e. I) could do when we retire...

    Since I retired the garden is looking much better, I've started to win
    the odd trophy sailing, most of the house has been redecorated...

    But best of all - I've lost 5kG and been able to stop worrying about my
    weight!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Vir Campestris on Sun Oct 27 21:04:49 2024
    On Sun, 27 Oct 2024 20:42:09 +0000, Vir Campestris wrote:

    I'm pretty sure you don't get POSIX in your 64kb (max).

    <https://news.ycombinator.com/item?id=34981059>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Schultz@21:1/5 to Vir Campestris on Sun Oct 27 17:55:52 2024
    On 10/27/24 3:42 PM, Vir Campestris wrote:
    On 22/10/2024 00:17, Lawrence D'Oliveiro wrote:
    On Sun, 20 Oct 2024 21:51:30 +0100, Vir Campestris wrote:

    Because some implementations will require
    communication with the OS there cannot be a truly portable malloc.

    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    One of the other groups I'm following just for the hell of it is
    comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

    Ignores the 16 bit versions of CP/M: 8086, 68000, Z8000.

    --
    http://davesrocketworks.com
    David Schultz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Oct 27 23:58:05 2024
    According to Vir Campestris <vir.campestris@invalid.invalid>:
    There can if you have a portable OS API. The only serious candidate for
    that is POSIX.

    One of the other groups I'm following just for the hell of it is
    comp.os.cpm/ I'm pretty sure you don't get POSIX in your 64kb (max).

    Mini-Unix got nearly all of v6 Unix in 56K bytes.

    See https://gunkies.org/wiki/MINI-UNIX

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Mon Oct 28 11:39:57 2024
    MitchAlsup1 wrote:
    On Thu, 24 Oct 2024 5:39:52 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 19:11:59 +0000, Terje Mathisen wrote:

    MitchAlsup1 wrote:
    On Wed, 23 Oct 2024 14:25:42 +0000, Tim Rentsch wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:

    My wife and I will both go on "permanent vacation" starting a week >>>>>>> before Christmas. :-)

    I'm guessing that permanent vacation will be some mixture of actual >>>>>> vacation and self-chosen "work".  In any case I hope you both >>>>>> enjoy
    the time.

    Just remember, retirement does not mean you "stop working"
    it means you "stop working for HIM".

    Exactly!

    I have unlimited amounts of potential/available mapping work, and I do >>>> want to get back to NTP Hackers.

    We recently started (officially) on the 754-2029 revision.

    Are you going to put in something equivalent to quires ??

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    OK, I have seen and used "Super-accumulator" as the term for those, I
    have thought about implementing one in carry-save redundant form, but
    that might be more redundancy than really needed?

    Having a carry bit for every byte should still make it possible to
    handle several additions/cycle, right?

    I'm assuming the real cost is in the alignment network needed to route incoming addends into the right slice.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Oct 28 16:30:46 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Mon Oct 28 10:12:08 2024
    On 10/28/2024 9:30 AM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    Another newer alternative. This came up on my news feed. I haven't
    looked at the details at all, so I can't comment on it.

    https://arxiv.org/abs/2410.03692

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Mon Oct 28 18:14:20 2024
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 10/28/2024 9:30 AM, Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    Another newer alternative. This came up on my news feed. I haven't
    looked at the details at all, so I can't comment on it.

    https://arxiv.org/abs/2410.03692

    That is about another number representation for AI, trying to squeeze
    more AI performance out of few bits.

    Personally, I like the approach of doing analog calculation for
    the low-accuracy dot products that they do, followed by an A/D
    converter. There is a company doing that, but I forget its name.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Mon Oct 28 15:24:18 2024
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load
    the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Of course, once you have 168-byte registers people are going to
    think of new uses for them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Tue Oct 29 06:33:50 2024
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.


    Of course, once you have 168-byte registers people are going to
    think of new uses for them.

    SIMD from hell? Pretend that a CPU is a graphics card? :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Oct 29 08:07:50 2024
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility

    These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    If I was implementing this I would probably want some redundant storage
    to limit carry propagation, so maybe 48 bits per 64-bit chunk, in which
    case I would need about 2800 bits or 6 of those 512-bit SIMD regs.

    SIMD from hell? Pretend that a CPU is a graphics card? :-)

    Writing this as a throughput task could make it fit better within a GPU?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Oct 29 14:19:13 2024
    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    IIUC you can already implement such a thing with standard IEEE
    operations, based on the "standard" Knuth approach of computing the
    exact result of `a + b` as the sum of x + y where x is the "normal" sum
    of a + b (and hence y holds the remaining bits lost to rounding).

    I wonder how often this is used in practice.

    Intuitively it should be possible to make it reasonably efficient, where
    you first compute the "naive" sum but also keep the N remaining numbers representing the bits lost to each of the N roundings. I.e. you take in
    a vector "as" of N numbers and return a pair of the "naive" sum plus
    a vector of N rounding errors.

    Σ as => (round(Σ As), rs)
    such that round(Σ As) = the naive IEEE sum of as
    and Σ as = round(Σ As) + Σ rs

    You can then recursively compute "Σ rs" in the same way. At each step of
    the recursion you can compute round(Σ |rs|) to estimate an upper bound
    on the remaining error and thus stop when that error is smaller than
    1 ULP or somesuch.

    AFAICT, if your sum is well-conditioned you should need at most 2 steps
    of the recursion, and I suspect you can predict when the next estimated
    error will be too small before you start the last recursion, so the last recursion might skip the generation of the last "rs" vector.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Tue Oct 29 14:29:28 2024
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.
    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility
    These would be very large registers. You'd need some way to store and load >> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    Right, something like 2048+52+3 = 2103 bits for data, plus some status bits. For x64 they could overlay it onto AVX-512 register file in groups of 5
    and use existing SIMD instructions for management.
    That would allow them to pack 3 accumulators into registers z0..z14.

    For RISC-V they have the large vector registers, 32 * 256-bits each I think,
    so again 3 accumulators.

    So its a plausible proposition.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Oct 29 19:57:25 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>
    These would be very large registers. You'd need some way to store and load >>> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit:

    "A floating-point accumulator occupies a 168-byte storage area that is
    aligned on a 256-byte boundary. An accumulator consists of a four-byte
    status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.
    (Insert fear and loathing for hex float here).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Oct 29 20:30:12 2024
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
    These would be very large registers. You'd need some way to store and load >>>> the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
    "A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    IBM format had one sign bit, seven exponent bits and six or fourteen >hexadecimal digits for single and double precision, respectively.
    (Insert fear and loathing for hex float here).

    Burroughs Medium systems had four exponent sign bits, eight exponent bits,
    four mantissa sign bits, and up to 400 mantissa bits. BCD, so that's an exponent range of -99 to +99 and a 1 to 100 digit mantissa.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Oct 29 20:21:11 2024
    On Tue, 29 Oct 2024 19:57:25 +0000, Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    In posits, a quire is an accumulator with as many binary digits
    as to cover max-exponent to min-exponent; so one can accumulate
    an essentially unbounded number of sums without loss of precision
    --to obtain a sum with a single rounding.

    Not restricted to posits, I believe (but the term may differ).

    At university, I had my programming
    courses on a Pascal compiler which implemented
    https://en.wikipedia.org/wiki/Karlsruhe_Accurate_Arithmetic ,
    a hardware implementation was on the 4361 as an option
    https://en.wikipedia.org/wiki/IBM_4300#High-Accuracy_Arithmetic_Facility >>>>
    These would be very large registers. You'd need some way to store and
    load
    the these for register spills, fills and task switch, as well as move
    and manage them.

    Karlsruhe above has a link to
    http://www.bitsavers.org/pdf/ibm/370/princOps/SA22-7093-0_High_Accuracy_Arithmetic_Jan84.pdf

    which describes their large accumulators as residing in memory, which
    avoids the spill/fill/switch issue but with an obvious performance hit: >>>>
    "A floating-point accumulator occupies a 168-byte storage area that is >>>> aligned on a 256-byte boundary. An accumulator consists of a four-byte >>>> status area on the left, followed by a 164-byte numeric area."

    The operands are specified by virtual address of their in-memory
    accumulator.

    Makes sense, given the time this was implemented. This was also a
    mid-range machine, not a number cruncher. I do not find the
    number of cycles that the instructions took.

    At the time, memory was just a few clock cycles away from the CPU, so
    not really that problematic. Today, such a super-accumulator would stay
    in $L1 most of the time, or at least the central, in-use cache line of
    it, would do so.


    But this was also for hex floating point. A similar scheme for IEEE
    double would need a bit more than 2048 bits, so five AVX-512 registers.

    With 1312 bits of storage, their fp inputs (hex fp?) must have had a
    smaller exponent range than ieee double.

    Terje--IEEE is all capitals.

    IBM format had one sign bit, seven exponent bits and six or fourteen hexadecimal digits for single and double precision, respectively.

    The span of an IEEE double "quire" would be the exponent-2 + fraction.
    a) The most significant non-infinity has an exponent of +1023
    b) The least significant non-underflow has an exponent of -1023
    Leaving a span of 2046 bits plus 52 denormalized bits or 2098-bits
    or 262 bytes.

    One note: When left in memory, one indexes the accumulator with
    the (exponent>>6) and fetches 2 doublewords. A carry out requires
    accessing the 3rd doubleword (possibly transitively).

    (Insert fear and loathing for hex float here).

    Heck, watching Kahan's notes on FP problems leaves one in fear of
    binary floating point representations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Oct 29 21:27:29 2024
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    (Insert fear and loathing for hex float here).

    Heck, watching Kahan's notes on FP problems leaves one in fear of
    binary floating point representations.

    True, but... hex float is so much worse.

    "Hacker's delight" has some choice words there, and the
    author worked for IBM :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jan 3 03:37:50 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 19:30, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 04/10/2024 00:17, Lawrence D'Oliveiro wrote:
    Compare this with the pain the x86 world went through, over a much longer >>>>> time, to move to 32-bit.

    The x86 started from 8-bit roots, and increased width over time, which >>>> is a very different path.

    Still, the question is why they did the 286 (released 1982) with its
    protected mode instead of adding IA-32 to the architecture, maybe at
    the start with a 386SX-like package and with real-mode only, or with
    the MMU in a separate chip (like the 68020/68551).


    I can only guess the obvious - it is what some big customer(s) were
    asking for. Maybe Intel didn't see the need for 32-bit computing in the >>markets they were targeting, or at least didn't see it as worth the cost.

    Anyone could see the problems that the PDP-11 had with its 16-bit
    limitation. Intel saw it in the iAPX 432 starting in 1975. It is
    obvious that, as soon as memory grows beyond 64KB (and already the
    8086 catered for that), the protected mode of the 80286 would be more
    of a hindrance than even the real mode of the 8086. I find it hard to believe that many customers would ask Intel for something the 80286
    protected mode with segments limited to 64KB, and even if, that Intel
    would listen to them. This looks much more like an idee fixe to me
    that one or more of the 286 project leaders had, and all customer
    input was made to fit into this idea, or was ignored.

    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    However, playing devil's advocate I can see sense in 286. IMO
    Intel targeted quite a diffferent market. IIUC main intended marker
    for 8086 were industial control and various embedded aplication.
    286 was probably intenended for similar markets, but with stronger
    emphasis on security. In control application it is typical to
    have several cooperating processes. 286 allows separate local
    descriptor tables for each task, so mutitasking program easily
    may have say 30000 descriptors. Trying to get similar number
    of separately protected objects using paging would require
    similar number of pages, which with 16 MB total address space
    leads to 512 byte pages. For smaller paged systems situation
    is even worse: with 512 kB of memory 512 byte pages lead to
    1024 pages in total which means that access control can not
    be very granular and one would get significant memory
    fragmentation for parts smaller than page. I can guess that
    Intel rejected very small pages as problematic in implementation.
    So if the goal is fine grained access control, then segementation
    for machine of size of 286 looks better than paging.

    Concerning code "model", I think that Intel assumend that
    most procedures would fit in a single segment and that
    average procedure will be of order of single kilobytes.
    Using 16-bit offsets for jumps inside procedure and
    segment-offset pair for calls is likely to lead to better
    or similar performance as purely 32-bit machine. For
    control applications it is likely that each procedure
    will access moderate number of segments and total amount
    of accessed data will be moderate. In other words, Intel
    probably considerd "mostly medium" model where procedure
    mainly accesses it data segment using just 16-bit offsets
    and occasionally accesses other segments.

    Compared to PDP-11 this leads to resonably natural
    code that use some hundreds of kilobytes of memory,
    much better than 128 kB limit of PDP-11 with separate
    code and data areas. And segment maniputlation allows
    also bigger programs.

    What went wrong? IIUC there were several control systems
    using 286 features, so there was some success. But PC-s
    became main user of x86 chips and significant fraction
    of PC-s was used for gaming. Game authors wanted direct
    access to hardware which in case of 286 forced real mode.
    Also, for long time 8088 played mayor role and PC software
    "had" to run on 8088. Software vendors theoretically could
    release separate versions for each processor or do some
    runtime switching of critical procedures, but easiest way
    was to depend on compatibility with 8088. "Better" OS-es
    went Unix way, depending on paging and not using segmentation.
    But IIUC first paging Unix appeared _after_ release of 286.
    In 286 time Multics was highly regarded and it heavily depended
    on segmentaion. MVS was using paging hardware, but was
    talking about segments, except for that MVS segmentation
    was flawed because some addresses far outside a segment were
    considered as part of different segment. I think that also
    in VMS there was some taliking about segments. So creators
    of 286 could believe that they are providing "right thing"
    and not a fake possible with paging hardware.

    Concerning the cost, ther 80286 has 134,000 transistors, compared to supposedly 68,000 for the 68000, and the 190,000 of the 68020. I am
    sure that Intel could have managed a 32-bit 8086 (maybe even with the
    nice addressing modes that the 386 has in 32-bit mode) with those
    134,000 transistors if Motorola could build the 68000 with half of
    that.

    I think that Intel could manage to build "mostly" 32-bit processor
    in transistor budget of 8086, that is have say 7 registers 32-bit
    each, where each register could be treated as a pair of 16-bit
    registers and 32-bit operations would take twice as much time
    as 16-bit operation. But I think that such processor would be
    slower (say 10% slower) than 8086 mostly because of more need to
    use longer addresses. Similarly, hypotetical 32-bit 286 would
    be slower than real 286. And I do not think thay could make
    32-bit processor with segmentation in available transistor
    buget, and even it they managed it would be slowed down by too
    long addresses (segment + 32-bit offset).

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Jan 3 08:38:49 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case
    there are too few segments (if you have one segment per object).

    Concerning code "model", I think that Intel assumend that
    most procedures would fit in a single segment and that
    average procedure will be of order of single kilobytes.
    Using 16-bit offsets for jumps inside procedure and
    segment-offset pair for calls is likely to lead to better
    or similar performance as purely 32-bit machine.

    With the 80286's segments and their slowness, that is very doubtful.
    The 8086 has branches with 8-bit offsets and branches and calls with
    16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
    offsets and branches and calls with 32-bit offsets; if 16-bit offsets
    for branches would be useful enough for performance, they could
    instead have designed the longer branch length to be 16 bits, and
    maybe a prefix for 32-bit branch offsets. That would be faster than
    what you outline, as soon as one call happens. But apparently 16-bit
    branches are not that beneficial, or they would have gone that way on
    the 386.

    Another usage of segments for code would be to put the code segment of
    a shared object (known as DLL among Windowsheads) in a segment, and
    use far calls to call functions in other shared objects, while using
    near calls within a shared object. This allows to share the code
    segments between different programs, and to locate them anywhere in
    physical memory. However, AFAIK shared objects were not a thing in
    the 80286 timeframe; Unix only got them in the late 1980s.

    I used Xenix on a 286 in 1986 or 1987; my impression is that programs
    were limited to 64KB code and 64KB data size, exactly the PDP-11 model
    you denounce.

    What went wrong? IIUC there were several control systems
    using 286 features, so there was some success. But PC-s
    became main user of x86 chips and significant fraction
    of PC-s was used for gaming. Game authors wanted direct
    access to hardware which in case of 286 forced real mode.

    Every successful software used direct access to hardware because of performance; the rest waned. Using BIOS calls was just too slow.
    Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
    writing directly to video.

    But IIUC first paging Unix appeared _after_ release of 286.

    From <https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

    |The kernel of 32V was largely rewritten by Berkeley graduate student
    |Özalp Babaoğlu to include a virtual memory implementation, and a
    |complete operating system including the new kernel, ports of the 2BSD |utilities to the VAX, and the utilities from 32V was released as 3BSD
    |at the end of 1979.

    The 80286 was introduced on February 1, 1982.

    In 286 time Multics was highly regarded and it heavily depended
    on segmentaion. MVS was using paging hardware, but was
    talking about segments, except for that MVS segmentation
    was flawed because some addresses far outside a segment were
    considered as part of different segment. I think that also
    in VMS there was some taliking about segments. So creators
    of 286 could believe that they are providing "right thing"
    and not a fake possible with paging hardware.

    There was various segmented hardware around, first and foremost (for
    the designers of the 80286), the iAPX432. And as you write, all the
    good reasons that resulted in segments on the iAPX432 also persisted
    in the 80286. However, given the slowness of segmentation, only the
    tiny (all in one segment), small (one segment for code and one for
    data), and maybe medium memory models (one data segment) are
    competetive in protected mode compared to real mode.

    So if they really had wanted protected mode to succeed, they should
    have designed in 32-bit data segments (and maybe also 32-bit code
    segments). Alternatively, if protected mode and the 32-bit addresses
    do not fit in the 286 transistor budget, a CPU that implements the
    32-bit feature and leaves away protected mode would have been more
    popular than the 80286; and (depending on how the 32-bit extension was implemented) it might have been a better stepping stone towards the
    kind of CPU with protected mode that they imagined; but the alt-386
    designers probably would not have designed in this kind of protected
    mode that they did.

    Concerning paging, all these scenarios are without paging. Paging was primarily a virtual-memory feature, not a memory-protection feature.
    It acquired memory protection only as far as it was easy with pages
    (i.e., at page granularity). So paging was not designed as a
    competition to segments as far as protection was concerned. If
    computer architects had managed to design segmentation with
    competetive performance, we would be seeing hardware with both paging
    and segmentation nowadays. Or maybe even without paging, now that
    memories tend to be big enough to make virtual memory mostly
    unnecessary.

    And I do not think thay could make
    32-bit processor with segmentation in available transistor
    buget,

    Maybe not.

    and even it they managed it would be slowed down by too
    long addresses (segment + 32-bit offset).

    On the contrary, every program that does not fit in the medium memory
    model on the 80286 would run at least as fast on such a CPU in real
    mode and significantly faster in protected mode.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Jan 3 18:11:53 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    But IIUC first paging Unix appeared _after_ release of 286.

    From ><https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

    |The kernel of 32V was largely rewritten by Berkeley graduate student
    |Özalp Babaoğlu to include a virtual memory implementation, and a
    |complete operating system including the new kernel, ports of the 2BSD >|utilities to the VAX, and the utilities from 32V was released as 3BSD
    |at the end of 1979.

    There was also a version of Western Electric unix that ran on the VAX in that time frame.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Sat Jan 4 22:40:51 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case
    there are too few segments (if you have one segment per object).

    In the second case one can pack several objects into single
    segment, so except for loct security properties this is not
    a big problem. But there is a lot of loading segment registers
    and slow loading is a problem.

    Concerning code "model", I think that Intel assumend that
    most procedures would fit in a single segment and that
    average procedure will be of order of single kilobytes.
    Using 16-bit offsets for jumps inside procedure and
    segment-offset pair for calls is likely to lead to better
    or similar performance as purely 32-bit machine.

    With the 80286's segments and their slowness, that is very doubtful.
    The 8086 has branches with 8-bit offsets and branches and calls with
    16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
    offsets and branches and calls with 32-bit offsets; if 16-bit offsets
    for branches would be useful enough for performance, they could
    instead have designed the longer branch length to be 16 bits, and
    maybe a prefix for 32-bit branch offsets.

    At that time Intel apparently wanted to avoid having too many
    instructions.

    That would be faster than
    what you outline, as soon as one call happens. But apparently 16-bit branches are not that beneficial, or they would have gone that way on
    the 386.

    For machine with 32-bit bus benefit is much smaller.

    Another usage of segments for code would be to put the code segment of
    a shared object (known as DLL among Windowsheads) in a segment, and
    use far calls to call functions in other shared objects, while using
    near calls within a shared object. This allows to share the code
    segments between different programs, and to locate them anywhere in
    physical memory. However, AFAIK shared objects were not a thing in
    the 80286 timeframe; Unix only got them in the late 1980s.

    IIUC shared segments were widely used on Multics.

    I used Xenix on a 286 in 1986 or 1987; my impression is that programs
    were limited to 64KB code and 64KB data size, exactly the PDP-11 model
    you denounce.

    Maybe. I have seen many cases where sofware essentiallt "wastes"
    good things offered by hardware.

    What went wrong? IIUC there were several control systems
    using 286 features, so there was some success. But PC-s
    became main user of x86 chips and significant fraction
    of PC-s was used for gaming. Game authors wanted direct
    access to hardware which in case of 286 forced real mode.

    Every successful software used direct access to hardware because of performance; the rest waned. Using BIOS calls was just too slow.
    Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
    writing directly to video.

    For most early graphic cards direct screen access could be allowed
    just by allocating appropriate segment. And most non-games
    could gain good performance with better system interface.
    I think that variaty of tricks used in games and their
    popularity made protected mode system much less appealing
    to vendors. And that discouraged work on better interfaces
    for non-games.

    More generally, vendors could release separate versions of
    programs for 8086 and 286 but few did so. And users having
    only binaries wanted to use 8086 on their new systems which
    led to heroic efforts like OS/2 DOS box and later Linux
    dosemu. But integration of 8086 programs with protected
    mode was solved too late for 286 model to gain traction
    (and on 286 "DOS box" had to run in real mode, breaking
    normal system protection).

    But IIUC first paging Unix appeared _after_ release of 286.

    From <https://en.wikipedia.org/wiki/History_of_the_Berkeley_Software_Distribution#3BSD>:

    |The kernel of 32V was largely rewritten by Berkeley graduate student |Ã?zalp BabaoÄ?lu to include a virtual memory implementation, and a |complete operating system including the new kernel, ports of the 2BSD |utilities to the VAX, and the utilities from 32V was released as 3BSD
    |at the end of 1979.

    The 80286 was introduced on February 1, 1982.

    OK

    In 286 time Multics was highly regarded and it heavily depended
    on segmentaion. MVS was using paging hardware, but was
    talking about segments, except for that MVS segmentation
    was flawed because some addresses far outside a segment were
    considered as part of different segment. I think that also
    in VMS there was some taliking about segments. So creators
    of 286 could believe that they are providing "right thing"
    and not a fake possible with paging hardware.

    There was various segmented hardware around, first and foremost (for
    the designers of the 80286), the iAPX432. And as you write, all the
    good reasons that resulted in segments on the iAPX432 also persisted
    in the 80286. However, given the slowness of segmentation, only the
    tiny (all in one segment), small (one segment for code and one for
    data), and maybe medium memory models (one data segment) are
    competetive in protected mode compared to real mode.

    AFAICS that covered wast majority of programs during eighties.
    Turbo Pascal offered only medium memory model and was quite
    popular. Its code generator produced mediocre output, but
    real Turbo Pascal programs used a lot of inline assembly
    and performance was not bad.

    Intel apparently assumed that programmers are willing to spend
    extra work to get good performance and IMO this was right
    as a general statement. Intel probably did not realize that
    programmers will be very reluctant to spent work on security
    features and in particular to spent work on making programs
    fast in 286 protected mode.

    So if they really had wanted protected mode to succeed, they should
    have designed in 32-bit data segments (and maybe also 32-bit code
    segments). Alternatively, if protected mode and the 32-bit addresses
    do not fit in the 286 transistor budget, a CPU that implements the
    32-bit feature and leaves away protected mode would have been more
    popular than the 80286; and (depending on how the 32-bit extension was implemented) it might have been a better stepping stone towards the
    kind of CPU with protected mode that they imagined; but the alt-386
    designers probably would not have designed in this kind of protected
    mode that they did.

    Intel probably assumend that 286 would cover most needs, especially
    given that most system had much less memory than 16 MB theoreticlly
    allowed by 286. And for bigger systems they released 386.

    Concerning paging, all these scenarios are without paging. Paging was primarily a virtual-memory feature, not a memory-protection feature.

    Yes, exactly.

    It acquired memory protection only as far as it was easy with pages
    (i.e., at page granularity). So paging was not designed as a
    competition to segments as far as protection was concerned. If
    computer architects had managed to design segmentation with
    competetive performance, we would be seeing hardware with both paging
    and segmentation nowadays. Or maybe even without paging, now that
    memories tend to be big enough to make virtual memory mostly
    unnecessary.

    And I do not think thay could make
    32-bit processor with segmentation in available transistor
    buget,

    Maybe not.

    and even it they managed it would be slowed down by too
    long addresses (segment + 32-bit offset).

    On the contrary, every program that does not fit in the medium memory
    model on the 80286 would run at least as fast on such a CPU in real
    mode and significantly faster in protected mode.

    I think that Intel considerd "programs that do it in the medium
    memory model" as tiny minority. IMO this is partially true: there
    is a class of programs which with some work fit into medium
    model, but using flat address space is easier. I think that
    on 286 (that is with 16 bit bus) those programs (assuming enough
    tuning) run faster than flat 32-bit version. But naive compilation
    in large (or huge) model leads to worse speed than flat mode.

    In a bit different spirit, for programs that do not fit in
    64kB, but are not too large there is natural temptation to
    have some "compression" scheme for pointers and use mostly
    16-bit pointers. That can be done without special hardware
    support. OTOH Intel segmentation is a specific proposal
    in such direction with hardware support. Clearly it is
    less flexible than software schemes based on native 32-bit
    addressing. But I think that Intel segmentation had some
    attractive features during eighties.

    Another thing is 386. I think that designers of 286 thought
    that 386 will remove some limitations. And 386 allowed
    bigger segmensts removing one major limitation. OTOH
    for 32-bit processor with segementation it would be natural
    to have 32-bit segment registers. It is not clear to
    me if 16-bit segment registers in 386 were deemed necessary
    for backward compatibility or maybe in 386 period flat
    fraction in Intel won and they kept segmentation mostly
    for compatibility.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jan 5 02:56:08 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>: >antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case
    there are too few segments (if you have one segment per object).

    Intel clearly had some strong opinions about how people would program
    the 286, which turned out to bear almost no relation to the way we
    actually wanted to program it.

    Some of the stuff they did was just perverse, like putting flag
    bits in the low part of the segment number rather than the high
    bit. If you had objects bigger than 64K, you had to shift
    the segment number three bits to the left when computing
    addresses.

    They also apparently didn't expect people to switch segments much.
    If you loaded a segment register with the value it already contained,
    it still fetched all of the stuff from memory. How many gates would
    it have taken to check for the same value and bypass the loads? If
    they had done that, we could have used large model calls everywhere
    since long and short calls would be about the same speed, and not
    had to screw around deciding what was a long call and what was short
    and writing glue codes to allow both kinds.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Sun Jan 5 03:55:39 2025
    On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case >>there are too few segments (if you have one segment per object).

    Intel clearly had some strong opinions about how people would program
    the 286, which turned out to bear almost no relation to the way we
    actually wanted to program it.

    Clearly, Intel thought that .text, .data, .stack, and .heap were
    about all anyone would ever need.

    Some of the stuff they did was just perverse, like putting flag
    bits in the low part of the segment number rather than the high
    bit. If you had objects bigger than 64K, you had to shift
    the segment number three bits to the left when computing
    addresses.

    Let us just agree that whatever they were thinking, they blew it.

    They also apparently didn't expect people to switch segments much.

    Clearly. They expected segments to be essentially stagnant--unlike
    the people trying to do things with x86s...

    If you loaded a segment register with the value it already contained,
    it still fetched all of the stuff from memory.

    If segment LD was 1 cycle, the number of segment changes would be
    fine. But since LDing a segment was so expensive, they probably
    could not afford the transistors and wires to do what you suggest.

    How many gates would
    it have taken to check for the same value and bypass the loads?

    286:: 180 gates per segment register
    386:: 360 gates per segment register

    If
    they had done that, we could have used large model calls everywhere
    since long and short calls would be about the same speed, and not
    had to screw around deciding what was a long call and what was short
    and writing glue codes to allow both kinds.

    If you had 32 segment registers, it probably would not have mattered
    that segment LD was slow. And if you had 32 pointing registers, you
    would not have need a LD-OP ISA.

    Sigh ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Waldek Hebisch on Sun Jan 5 08:54:29 2025
    Waldek Hebisch wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    There was various segmented hardware around, first and foremost (for
    the designers of the 80286), the iAPX432. And as you write, all the
    good reasons that resulted in segments on the iAPX432 also persisted
    in the 80286. However, given the slowness of segmentation, only the
    tiny (all in one segment), small (one segment for code and one for
    data), and maybe medium memory models (one data segment) are
    competetive in protected mode compared to real mode.

    AFAICS that covered wast majority of programs during eighties.
    Turbo Pascal offered only medium memory model and was quite
    popular. Its code generator produced mediocre output, but
    real Turbo Pascal programs used a lot of inline assembly
    and performance was not bad.

    As someone who wrote megabytes of that asm, I feel qualified to comment:

    Turbo Pascal itself 1.0 ran in Small model (64kB code & data) afair, but
    since the compiler/editor/linker/loader/debugger only used 35 kB (37 kB
    if you also loaded the text error messages), it had enough room left
    over for the source code it compiled.

    From the very beginning it supported Medium as you state, with separate
    code in the CS reg and data+stack (DS+SS) sharing a single segment.

    This way you had to use ES for all cross-segment operations,
    particularly REP MOVSB block moves.

    Later versions supported the Large model where all addresses were segment+offset pairs, as well as Huge where the segment was pointing at
    the object, rounded down to the nearest 16-byte boundary, and the offset (typically BX) was always [0-15].

    Intel apparently assumed that programmers are willing to spend
    extra work to get good performance and IMO this was right
    as a general statement. Intel probably did not realize that
    programmers will be very reluctant to spent work on security
    features and in particular to spent work on making programs
    fast in 286 protected mode.

    Protected could only be fast if segment reloads were rare, in my own
    code I would allocate arrays of largish objects as the max number that
    would fit in 64K, then grab the next.

    Terje
    PS. Happy New Year!

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Anton Ertl on Sun Jan 5 14:48:00 2025
    In article <2025Jan3.093849@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    The 8086 has branches with 8-bit offsets and branches and calls
    with 16-bit offsets. The 386 in 32-bit mode has branches with
    8-bit offsets and branches and calls with 32-bit offsets; if
    16-bit offsets for branches would be useful enough for performance,
    they could instead have designed the longer branch length to be
    16 bits, and maybe a prefix for 32-bit branch offsets. That would
    be faster than what you outline, as soon as one call happens.
    But apparently 16-bit branches are not that beneficial, or they
    would have gone that way on the 386.

    Don't assume that Intel of the early 1980s would have done enough
    simulation to explore those possibilities thoroughly. Given the mistakes
    they made in the 1970s with iAPX 432 and in the 1990s with Itanium, both through lack of simulation with varying workloads, they may well have
    been working by rules of thumb and engineering "intuition."

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Sun Jan 5 11:10:28 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case
    there are too few segments (if you have one segment per object).

    In the second case one can pack several objects into single
    segment, so except for loct security properties this is not
    a big problem.

    If you go that way, you lose all the benefits of segments, and run
    into the "segments too small" problem. Which you then want to
    circumvent by using segment and offset in your addressing of the small
    data structures, which leads to:

    But there is a lot of loading segment registers
    and slow loading is a problem.

    ...
    Using 16-bit offsets for jumps inside procedure and
    segment-offset pair for calls is likely to lead to better
    or similar performance as purely 32-bit machine.

    With the 80286's segments and their slowness, that is very doubtful.
    The 8086 has branches with 8-bit offsets and branches and calls with
    16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
    offsets and branches and calls with 32-bit offsets; if 16-bit offsets
    for branches would be useful enough for performance, they could
    instead have designed the longer branch length to be 16 bits, and
    maybe a prefix for 32-bit branch offsets.

    At that time Intel apparently wanted to avoid having too many
    instructions.

    Looking in my Pentium manual, the section on CALL has a 20 lines for
    "call intersegment", "call gate" (with priviledge variants) and "call
    to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
    286), and the "Operation" section (the specification in pseudocode)
    consumes about 4 pages, followed by a 1.5 page "Description" section.

    9 of these 10 far call variants deal with protected-mode things, so
    Intel obviously had no qualms about adding instruction variants. If
    they instead had no protected mode, but some 32-bit support, including
    the near call with 32-bit offset that I suggest, that would have
    reduced the number of instruction variants.

    I used Xenix on a 286 in 1986 or 1987; my impression is that programs
    were limited to 64KB code and 64KB data size, exactly the PDP-11 model
    you denounce.

    Maybe. I have seen many cases where sofware essentiallt "wastes"
    good things offered by hardware.

    Which "good things offered by hardware" do you see "wasted" by this
    usage in Xenix? To me this seems to be the only workable way to use
    the 286 protected mode. Ok, the medium model (near data, far code)
    may also have been somewhat workable, but looking at the cycle counts
    for the protected-mode far calls on the Pentium (and on the 286 they
    were probably even more costly), which start at 22 cycles for a "call
    gate, same priviledge" (compared to 1 cycle on the Pentium for a
    direct call near), one would strongly prefer the small model.

    Every successful software used direct access to hardware because of
    performance; the rest waned. Using BIOS calls was just too slow.
    Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
    writing directly to video.

    For most early graphic cards direct screen access could be allowed
    just by allocating appropriate segment. And most non-games
    could gain good performance with better system interface.
    I think that variaty of tricks used in games and their
    popularity made protected mode system much less appealing
    to vendors. And that discouraged work on better interfaces
    for non-games.

    MicroSoft and IBM invested lots of work in a 286 protected-mode
    interface: OS/2 1.x. It was limited to the 286 at the insistence of
    IBM, even though work started in August 1985, when they already knew
    that the 386 was coming soon. OS/2 1.0 was released in April 1987,
    1.5 years after the 386.

    OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
    too late, so the 286 killed OS/2; here we have a case of a software
    project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
    few years).

    Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
    addition to the base (8086) variant of Windows 2.0, which was released
    in December 1987), which used 386 protected mode and virtual 8086 mode
    (which was missing in the "brain-damaged" (Bill Gates) 286). So
    Windows completely ignored 286 protected mode. Windows eventually
    became a big success.

    Also, Microsoft started NT OS/2 in November 1988 to target the 386
    while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
    parted ways, NT OS/2 became Windows NT, which is the starting point of
    all remaining Windowses from Windows XP onwards.

    Xenix, apart from OS/2 the only other notable protected-mode OS for
    the 286, was ported to the 386 in 1987, after SCO secured "knowledge
    from Microsoft insiders that Microsoft was no longer developing
    Xenix", so SCO (or Microsoft) might have done it even earlier if the
    commercial situation had been less muddled; in any case, Xenix jumped
    the 286 ship ASAP.

    The verdict is: The only good use of the 286 is as a faster 8086;
    small memory model multi-tasking use is possible, but the 64KB
    segments are so limiting that everybody who understood software either
    decided to skip this twist (MicroSoft, except on their OS/2 death
    march), or jumped ship ASAP (SCO).

    More generally, vendors could release separate versions of
    programs for 8086 and 286 but few did so.

    Were there any who released software both as 8086 and a protected-mode
    80286 variants? Microsoft/SCO with Xenix, anyone else?

    And users having
    only binaries wanted to use 8086 on their new systems which
    led to heroic efforts like OS/2 DOS box and later Linux
    dosemu. But integration of 8086 programs with protected
    mode was solved too late for 286 model to gain traction
    (and on 286 "DOS box" had to run in real mode, breaking
    normal system protection).

    Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
    which does not require heroic efforts AFAIK.

    There was various segmented hardware around, first and foremost (for
    the designers of the 80286), the iAPX432. And as you write, all the
    good reasons that resulted in segments on the iAPX432 also persisted
    in the 80286. However, given the slowness of segmentation, only the
    tiny (all in one segment), small (one segment for code and one for
    data), and maybe medium memory models (one data segment) are
    competetive in protected mode compared to real mode.

    AFAICS that covered wast majority of programs during eighties.

    The "vast majority" is not enough; if a key application like Lotus
    1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
    alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
    did not limit themselves to 64KB of data.

    Turbo Pascal offered only medium memory model

    Acoording to Terje Mathiesen, it also offered the large memory model.
    On its Wikipedia page, I find: "Besides allowing applications larger
    than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
    Turbo Pascal 4.0 introduced support for the large memory model in
    1988.

    Intel apparently assumed that programmers are willing to spend
    extra work to get good performance and IMO this was right
    as a general statement. Intel probably did not realize that
    programmers will be very reluctant to spent work on security
    features and in particular to spent work on making programs
    fast in 286 protected mode.

    80286 protected mode is never faster than real mode on the same CPU,
    so the way to make programs fast on the 286 is to stick with real
    mode; using the small memory model is an alternative, but as
    mentioned, the memory limits are too restrictive.

    Intel probably assumend that 286 would cover most needs,

    As far as protected mode was concerned, they hardly could have been
    more wrong.

    especially
    given that most system had much less memory than 16 MB theoreticlly
    allowed by 286.

    They provided 24 address pins, so they obviously assumed that there
    would be 80286 systems with >8MB. 64KB segments are already too
    limiting on systems with 1MB (which was supported by the 8086),
    probably even for anything beyond 128KB.

    IMO this is partially true: there
    is a class of programs which with some work fit into medium
    model, but using flat address space is easier. I think that
    on 286 (that is with 16 bit bus) those programs (assuming enough
    tuning) run faster than flat 32-bit version.

    Maybe in real mode. Certainly not in protected mode. Just run your
    tuned large-model protected-mode program against a 32-bit small-model
    program for the same task on a 386SX (which is reported as having a
    very similar speed to the 80286 on 16-bit programs). And even if you
    find one case where the protected-mode program wins, nobody found it
    worth their time to do this nonsense. And so OS/2 flopped despite
    being backed by IBM and, until 1990, Microsoft.

    But I think that Intel segmentation had some
    attractive features during eighties.

    You are one of a tiny minority. Even Intel finally saw the light, as
    did everybody else, and nowadays segments are just a bad memory.

    Another thing is 386. I think that designers of 286 thought
    that 386 will remove some limitations. And 386 allowed
    bigger segmensts removing one major limitation. OTOH
    for 32-bit processor with segementation it would be natural
    to have 32-bit segment registers. It is not clear to
    me if 16-bit segment registers in 386 were deemed necessary
    for backward compatibility or maybe in 386 period flat
    fraction in Intel won and they kept segmentation mostly
    for compatibility.

    The latter (read the 386 oral history). The 386 designers knew that
    segments have no future, and they were right, so they kept them at a
    minimum.

    If they had gone for 32-bit segment registers (and 64-bit segment
    registers for AMD64), would segments have fared any better? I doubt
    it. Using segments would have stayed slow, and would have been
    ignored by nearly all programmers.

    These days we see segment-like things in security extensions of
    instruction sets, but slowness still plagues these extensions, and
    security researchers often find ways to subvert the promised security
    (and sometimes even more).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Waldek Hebisch on Sun Jan 5 15:20:31 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    Another usage of segments for code would be to put the code segment of
    a shared object (known as DLL among Windowsheads) in a segment, and
    use far calls to call functions in other shared objects, while using
    near calls within a shared object. This allows to share the code
    segments between different programs, and to locate them anywhere in
    physical memory. However, AFAIK shared objects were not a thing in
    the 80286 timeframe; Unix only got them in the late 1980s.

    IIUC shared segments were widely used on Multics.

    They were widely used on both the Burroughs large systems
    and the HP-3000 as well, both exemplars of segmentation
    done right, in so far as it can be.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Dallman on Sun Jan 5 15:23:38 2025
    jgd@cix.co.uk (John Dallman) writes:
    In article <6d5fa21e63e14491948ffb6a9d08485a@www.novabbs.org>, >mitchalsup@aol.com (MitchAlsup1) wrote:
    On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:
    They also apparently didn't expect people to switch segments much.

    Clearly. They expected segments to be essentially stagnant--unlike
    the people trying to do things with x86s...

    An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not. Intel may still have been thinking in
    terms of embedded systems when designing the 80286.

    I suspect that we don't, today, recall all the constraints that
    were facing Intel with respect to processor ASIC development in the
    late 70's and early 80's.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to mitchalsup@aol.com on Sun Jan 5 15:15:00 2025
    In article <6d5fa21e63e14491948ffb6a9d08485a@www.novabbs.org>, mitchalsup@aol.com (MitchAlsup1) wrote:
    On Sun, 5 Jan 2025 2:56:08 +0000, John Levine wrote:
    They also apparently didn't expect people to switch segments much.

    Clearly. They expected segments to be essentially stagnant--unlike
    the people trying to do things with x86s...

    An idea: The target markets for the 8080 and 8085 were clearly embedded systems. The Z80 and 6502 rapidly became popular in the micro-computer
    market, but the 808[05] did not. Intel may still have been thinking in
    terms of embedded systems when designing the 80286.

    The IBM PC was launched in August 1981 and around a year passed before it became clear that this machine was having a huge and lasting effect on
    the market. The 80286 was released on February 1st 1982, although it
    wasn't used much in PCs until the IBM PC/AT in August 1984.

    The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
    to have been the first version of x86 where it was obvious at the start
    of design that use in general-purpose computers would be important. It
    was far more suitable for the job than the 80286, and things developed
    from there.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Swindells@21:1/5 to Anton Ertl on Sun Jan 5 18:30:41 2025
    On Sun, 05 Jan 2025 11:10:28 GMT, Anton Ertl wrote:

    Xenix, apart from OS/2 the only other notable protected-mode OS for the
    286, was ported to the 386 in 1987, after SCO secured "knowledge from Microsoft insiders that Microsoft was no longer developing Xenix", so
    SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped the 286 ship
    ASAP.

    Microport Systems had UNIX System V for the 286.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Dallman on Sun Jan 5 17:51:34 2025
    jgd@cix.co.uk (John Dallman) writes:
    An idea: The target markets for the 8080 and 8085 were clearly embedded >systems. The Z80 and 6502 rapidly became popular in the micro-computer >market, but the 808[05] did not.

    The 8080 was used in the first microcomputers, e.g., the 1974 Altair
    8800 and the IMSAI 8080. It was important for all the CP/M machines,
    because the CP/M software (both the OS and the programs running on it)
    were written to use the 8080 instruction set, not the Z80 instruction
    set. And CP/M was the professional microcomputer OS before the IBM PC compatible market took off, despite the fact that the most popular microcomputers of the time (such as the Apple II, TRS-80 ad PET) did
    not use it; there was an add-on card for the Apple II with a Z80 for
    running CP/M, though, which shows the importance of CP/M.

    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the
    iAPX432 clearly showed that they wanted to be dominant there. It's an
    irony of history that the 8086/8088 actually went where the action
    was.

    Intel released the MCS-51 (aka 8051) in 1980 for embedded systems, and
    it's very successful there, and before that came the MCS-48 (8048) in
    1976.

    Intel may still have been thinking in
    terms of embedded systems when designing the 80286.

    I very much doubt that the segments and the 24 address bits were
    designed for embedded systems. The segments look more like an echo of
    the iAPX432 than of anything designed for embedded systems.

    The idea of some static allocation of memory for which segments might
    work may come from old mainframe systems, where programs were (in my impression) more static than PC programs and modern computing. Even
    in Unix programs, which were more dynamic than mainframe programs had
    quite a bit of static allocation in the early days; this is reflected
    in the paragraph in the GNU coding standards:

    |Avoid arbitrary limits on the length or number of any data structure, |including file names, lines, files, and symbols, by allocating all
    |data structures dynamically. In most Unix utilities, "long lines are
    |silently truncated". This is not acceptable in a GNU utility.

    The IBM PC was launched in August 1981 and around a year passed before it >became clear that this machine was having a huge and lasting effect on
    the market. The 80286 was released on February 1st 1982, although it
    wasn't used much in PCs until the IBM PC/AT in August 1984.

    The 80286 project was started in 1978, before any use of the 8086. <https://timeline.intel.com/1978/kicking-off-the-80286> claims that
    they "spent six months on field research into customers' needs alone";
    Judging by the results, maybe the customers were clueless, or maybe
    Intel asked the wrong questions.

    The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
    to have been the first version of x86 where it was obvious at the start
    of design that use in general-purpose computers would be important.

    Actually, reading the oral history of the 386, at the start the 386
    project was just an unimportant followon of the 286, while the main
    action was expected to be on the BiiN project (from which the i960
    came). Only sometime during that project the IBM PC market exploded
    and the 386 became the most important project of the company.

    But yes, they were very much aware of the needs of programmers in the
    386 project, and would probably have done something with just paging
    and no segments if they did not have the 8086 and 80286 legacy.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun Jan 5 20:01:25 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the
    iAPX432 clearly showed that they wanted to be dominant there. It's an
    irony of history that the 8086/8088 actually went where the action
    was.

    I have heard that the IBM PC was originally designed with a Z80, and fairly late
    in the process someone decided (not unreasonably) that it wouldn't be different enough from all the other Z80 boxes to be an interesting product. They wanted a 16 bit processor but for time and money reasons they stayed with the 8 bit bus they already had. The options were 68008 and 8088. Moto was only shipping samples of the 68008 while Intel could provide 8088 in quantity, so they went with the 8088.

    If Moto had been a little farther along, the history of the PC industry
    could have been quite different.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jan 5 19:40:42 2025
    On Sun, 5 Jan 2025 17:51:34 +0000, Anton Ertl wrote:

    jgd@cix.co.uk (John Dallman) writes:
    An idea: The target markets for the 8080 and 8085 were clearly embedded >>systems. The Z80 and 6502 rapidly became popular in the micro-computer >>market, but the 808[05] did not.

    The 8080 was used in the first microcomputers, e.g., the 1974 Altair
    8800 and the IMSAI 8080. It was important for all the CP/M machines,
    because the CP/M software (both the OS and the programs running on it)
    were written to use the 8080 instruction set, not the Z80 instruction
    set. And CP/M was the professional microcomputer OS before the IBM PC compatible market took off, despite the fact that the most popular microcomputers of the time (such as the Apple II, TRS-80 ad PET) did
    not use it; there was an add-on card for the Apple II with a Z80 for
    running CP/M, though, which shows the importance of CP/M.

    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the
    iAPX432 clearly showed that they wanted to be dominant there. It's an
    irony of history that the 8086/8088 actually went where the action
    was.

    Intel released the MCS-51 (aka 8051) in 1980 for embedded systems, and
    it's very successful there, and before that came the MCS-48 (8048) in
    1976.

    Intel may still have been thinking in
    terms of embedded systems when designing the 80286.

    I very much doubt that the segments and the 24 address bits were
    designed for embedded systems. The segments look more like an echo of
    the iAPX432 than of anything designed for embedded systems.

    The idea of some static allocation of memory for which segments might
    work may come from old mainframe systems, where programs were (in my impression) more static than PC programs and modern computing. Even
    in Unix programs, which were more dynamic than mainframe programs had
    quite a bit of static allocation in the early days; this is reflected
    in the paragraph in the GNU coding standards:

    |Avoid arbitrary limits on the length or number of any data structure, |including file names, lines, files, and symbols, by allocating all
    |data structures dynamically. In most Unix utilities, "long lines are |silently truncated". This is not acceptable in a GNU utility.

    The IBM PC was launched in August 1981 and around a year passed before it >>became clear that this machine was having a huge and lasting effect on
    the market. The 80286 was released on February 1st 1982, although it
    wasn't used much in PCs until the IBM PC/AT in August 1984.

    The 80286 project was started in 1978, before any use of the 8086. <https://timeline.intel.com/1978/kicking-off-the-80286> claims that
    they "spent six months on field research into customers' needs alone"; Judging by the results, maybe the customers were clueless, or maybe
    Intel asked the wrong questions.

    Or perhaps what Intel thought they heard was not closely related
    to what the customers were actually saying.

    The 80386 sampled in 1985 and was mass-produced in 1986. That would seem
    to have been the first version of x86 where it was obvious at the start
    of design that use in general-purpose computers would be important.

    Actually, reading the oral history of the 386, at the start the 386
    project was just an unimportant followon of the 286, while the main
    action was expected to be on the BiiN project (from which the i960
    came). Only sometime during that project the IBM PC market exploded
    and the 386 became the most important project of the company.

    But yes, they were very much aware of the needs of programmers in the
    386 project, and would probably have done something with just paging
    and no segments if they did not have the 8086 and 80286 legacy.

    Oh well ...

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Sun Jan 5 20:55:20 2025
    On Sun, 5 Jan 2025 20:01:25 +0000, John Levine wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the >>iAPX432 clearly showed that they wanted to be dominant there. It's an >>irony of history that the 8086/8088 actually went where the action
    was.

    I have heard that the IBM PC was originally designed with a Z80, and
    fairly late
    in the process someone decided (not unreasonably) that it wouldn't be different
    enough from all the other Z80 boxes to be an interesting product. They
    wanted a
    16 bit processor but for time and money reasons they stayed with the 8
    bit bus
    they already had. The options were 68008 and 8088. Moto was only
    shipping
    samples of the 68008 while Intel could provide 8088 in quantity, so they
    went
    with the 8088.

    If Moto had been a little farther along, the history of the PC industry
    could have been quite different.

    If Moto had done 68008 first, it may very well have turned out
    differently.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to John Levine on Sun Jan 5 20:46:43 2025
    John Levine <johnl@taugh.com> wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the
    iAPX432 clearly showed that they wanted to be dominant there. It's an
    irony of history that the 8086/8088 actually went where the action
    was.

    I have heard that the IBM PC was originally designed with a Z80, and fairly late
    in the process someone decided (not unreasonably) that it wouldn't be different
    enough from all the other Z80 boxes to be an interesting product. They wanted a
    16 bit processor but for time and money reasons they stayed with the 8 bit bus
    they already had. The options were 68008 and 8088. Moto was only shipping samples of the 68008 while Intel could provide 8088 in quantity, so they went with the 8088.

    If Moto had been a little farther along, the history of the PC industry
    could have been quite different.

    The 8088 was not a threat to any of IBM’s existing products, the 68x00 was.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Sun Jan 5 22:01:36 2025
    MitchAlsup1 wrote:
    On Sun, 5 Jan 2025 20:01:25 +0000, John Levine wrote:

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Anyway, while Zilog may have taken their sales, I very much believe
    that Intel was aware of the general-purpose computing market, and the
    iAPX432 clearly showed that they wanted to be dominant there.  It's an
    irony of history that the 8086/8088 actually went where the action
    was.

    I have heard that the IBM PC was originally designed with a Z80, and
    fairly late
    in the process someone decided (not unreasonably) that it wouldn't be
    different
    enough from all the other Z80 boxes to be an interesting product. They
    wanted a
    16 bit processor but for time and money reasons they stayed with the 8
    bit bus
    they already had. The options were 68008 and 8088. Moto was only
    shipping
    samples of the 68008 while Intel could provide 8088 in quantity, so they
    went
    with the 8088.

    If Moto had been a little farther along, the history of the PC industry
    could have been quite different.

    If Moto had done 68008 first, it may very well have turned out
    differently.

    But neither of these were possible (i.e. available) when IBM picked
    their CPU.

    I do believe that IBM did seriously consider the risk of making the PC
    too good, so that it would compete directly with their low-end systems (8100?).

    At least, that's what I assumed when the PC-AT only ran at 6MHz on a CPU
    which was designed for 8 MHz. I fondly remember a bunch of overclocking
    hacks on various 286 machines, most of them ran at 9 MHz, and I don't
    think I saw any that didn't handle 8 MHz.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Sun Jan 5 21:49:20 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    From my point of view main drawbacks of 286 is poor support for
    large arrays and problem for Lisp-like system which have a lot
    of small data structures and traverse then via pointers.

    Yes. In the first case the segments are too small, in the latter case
    there are too few segments (if you have one segment per object).

    In the second case one can pack several objects into single
    segment, so except for loct security properties this is not
    a big problem.

    If you go that way, you lose all the benefits of segments, and run
    into the "segments too small" problem. Which you then want to
    circumvent by using segment and offset in your addressing of the small
    data structures, which leads to:

    But there is a lot of loading segment registers
    and slow loading is a problem.

    ...
    Using 16-bit offsets for jumps inside procedure and
    segment-offset pair for calls is likely to lead to better
    or similar performance as purely 32-bit machine.

    With the 80286's segments and their slowness, that is very doubtful.
    The 8086 has branches with 8-bit offsets and branches and calls with
    16-bit offsets. The 386 in 32-bit mode has branches with 8-bit
    offsets and branches and calls with 32-bit offsets; if 16-bit offsets
    for branches would be useful enough for performance, they could
    instead have designed the longer branch length to be 16 bits, and
    maybe a prefix for 32-bit branch offsets.

    At that time Intel apparently wanted to avoid having too many
    instructions.

    Looking in my Pentium manual, the section on CALL has a 20 lines for
    "call intersegment", "call gate" (with priviledge variants) and "call
    to task" instructions, 10 of which probably already existed on the 286 (compared to 2 lines for "call near" instructions that existed on the
    286), and the "Operation" section (the specification in pseudocode)
    consumes about 4 pages, followed by a 1.5 page "Description" section.

    9 of these 10 far call variants deal with protected-mode things, so
    Intel obviously had no qualms about adding instruction variants. If
    they instead had no protected mode, but some 32-bit support, including
    the near call with 32-bit offset that I suggest, that would have
    reduced the number of instruction variants.

    I wrote "instructions". Intel clearly used modes and variants,
    but different call would lead to new opcode.

    I used Xenix on a 286 in 1986 or 1987; my impression is that programs
    were limited to 64KB code and 64KB data size, exactly the PDP-11 model
    you denounce.

    Maybe. I have seen many cases where sofware essentiallt "wastes"
    good things offered by hardware.

    Which "good things offered by hardware" do you see "wasted" by this
    usage in Xenix?

    Medimu mode and shared segments. Plus escape for programs needing
    bigger memory (traditional Unix programs by neccesity fit in 64kB
    limits).

    To me this seems to be the only workable way to use
    the 286 protected mode. Ok, the medium model (near data, far code)
    may also have been somewhat workable, but looking at the cycle counts
    for the protected-mode far calls on the Pentium (and on the 286 they
    were probably even more costly), which start at 22 cycles for a "call
    gate, same priviledge" (compared to 1 cycle on the Pentium for a
    direct call near), one would strongly prefer the small model.

    I have found instruction list on the web which claims 26 + m cycles
    where m in "length of next instruction" (whatever that means) for
    protected mode call using segement. Real mode call using segement
    is 13 + m cycles. Near call call is 7 + m cycles.

    Intel clearly expected that segment-changing calls are infrequent.
    AFAICS this was better than system conventions on IBM mainframes
    where "standard" call normally called memory allocation function
    to allocate stack frame. I do not have data for VAX handy, but
    VAX calls were quite complex, so probably also not fast.

    And modern data at least partially confirms Intel beliefs. When
    AMD introduced 64-bit mode thay also introduced complex calling
    convention intended to optimize speed of calls. Later there
    was a paper by Intel folks essentially claiming that this
    calling convention does not matter: C compilers inline small
    routines, so cost of calls relatively to other things is quite
    small. I think that what was inlined in 2010 would be called
    using near calls in 1982.

    Every successful software used direct access to hardware because of
    performance; the rest waned. Using BIOS calls was just too slow.
    Lotus 1-2-3 won out over VisiCalc and Multiplan by being faster from
    writing directly to video.

    For most early graphic cards direct screen access could be allowed
    just by allocating appropriate segment. And most non-games
    could gain good performance with better system interface.
    I think that variaty of tricks used in games and their
    popularity made protected mode system much less appealing
    to vendors. And that discouraged work on better interfaces
    for non-games.

    MicroSoft and IBM invested lots of work in a 286 protected-mode
    interface: OS/2 1.x. It was limited to the 286 at the insistence of
    IBM, even though work started in August 1985, when they already knew
    that the 386 was coming soon. OS/2 1.0 was released in April 1987,
    1.5 years after the 386.

    OS/2 1.x flopped, and by the time OS/2 was adjusted to the 386, it was
    too late, so the 286 killed OS/2; here we have a case of a software
    project being death-marched by tying itself to "good things offered by hardware" (except that Microsoft defected from the death march after a
    few years).

    Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
    addition to the base (8086) variant of Windows 2.0, which was released
    in December 1987), which used 386 protected mode and virtual 8086 mode
    (which was missing in the "brain-damaged" (Bill Gates) 286). So
    Windows completely ignored 286 protected mode. Windows eventually
    became a big success.

    What I recall is a bit different. IIRC first successful version of
    Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
    286 protected mode and 386 protected mode. Only later Microsoft
    dropped requirement for 8086 compatiblity. I think still later
    it dropped 286 support. Windows 95 was supposed to be 32-bit,
    but contained quite a lot of 16-bit code. IIRC system interface
    to Windows 3.0 and 3.1 was 16-bit and only later Microsoft
    released extention allowing 32-bit system calls.

    I have no information about Windows internals except for some
    public statements by Microsoft and other people, but I think
    it reasonable to assume that Windows was actually a succesful
    example of 8086/286/386 compatibility. That is their 16 bit
    code could use real mode segmentation or protected mode
    segmentation the later both for 286 and 386. For 32-bit
    version they added translation layer to transform arguments
    between 16-bit world and 32-bit world. It is possible
    that this translation layer involved a lot of effort. IIUC
    DEC when porting VMS to Alpha essentially gave up using
    32-bit pointers as main interface.

    Anyway, it seems that Windows was at least as tied to 286
    as OS/2 when it became sucessful and dropped 286 support
    later. And for long time after dropping 286 support
    Windows massively used 16-bit segments.

    Also, Microsoft started NT OS/2 in November 1988 to target the 386
    while IBM was still working on 286 OS/2. Eventually Microsoft and IBM
    parted ways, NT OS/2 became Windows NT, which is the starting point of
    all remaining Windowses from Windows XP onwards.

    Xenix, apart from OS/2 the only other notable protected-mode OS for
    the 286, was ported to the 386 in 1987, after SCO secured "knowledge
    from Microsoft insiders that Microsoft was no longer developing
    Xenix", so SCO (or Microsoft) might have done it even earlier if the commercial situation had been less muddled; in any case, Xenix jumped
    the 286 ship ASAP.

    The verdict is: The only good use of the 286 is as a faster 8086;
    small memory model multi-tasking use is possible, but the 64KB
    segments are so limiting that everybody who understood software either decided to skip this twist (MicroSoft, except on their OS/2 death
    march), or jumped ship ASAP (SCO).

    As I mentioned above I do not believe your claim about Microsoft.
    There were DOS-extenders which allowed use of 286 protected mode
    under DOS. They were used by several software vendors. Clearly,
    programming for flat 32-bit mode is easier and on software market
    that matters more than other factors.

    I think that 286 protected mode is good for its intended use, that
    is protected multitasking systems having more than 64 kB but less
    than say 4 MB. Of course, if you have a lot of hardware resources,
    than 32-bit system using paging may be easier to create. Also,
    speed is tricky: on 486 (and possibly 386) hardware task switch
    was easy to use, but slower than tuned purely software
    implementation. In other parts reloading of segment registers
    could slow down things quite a lot, so 16-bit protected mode
    required a lot of tuning to minimize number of times when
    segement registers were reloaded.

    I do not know if people used 286 in this way, but natural use
    of 286 is as a debugger for 8086 programs. That is use segment
    protection to catch stray accesses. Once program works OK
    deliver it as a real mode program on 8086 gaining speed and
    bigger market.

    AFAIK Linux started using 32-bit mode but heavily depending on
    386 segmentation. Rater quickly dependence on segments was
    limited and what remained was well isolated. But I think that
    Linux shows that _creating_ simple multitasking system is
    easier using hardware properties coming together with 286
    segmentation.

    Intel misjudged what is typical in programs. But they were not
    alone in this. I have translation of Tanenbaum book on computer
    architecture from 1976 (original, translation is from 1983).
    Tanenbaum is very posivite about segmentation, descriptors and
    "high level machines". He gave simple examples where descriptors
    and microprogrammed "high level machine" are supposed to give
    better performance than more conventianal machine.

    And as I already wrote, Intel misjudged market for 286. They
    could guess that 286 system will be too expensive for home
    market for long time. They probably did not expect that
    286 will find its way into PC-s.

    More generally, vendors could release separate versions of
    programs for 8086 and 286 but few did so.

    Were there any who released software both as 8086 and a protected-mode
    80286 variants? Microsoft/SCO with Xenix, anyone else?

    IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
    to say "supported on Windows". That is Windows 3.0 on 286 almost
    surely used 286 protected mode and probably run "Windows" programs
    in protected mode. But Windows also supported 8086 and Microsoft
    guidelines insisted that proper "Windows program" should run on
    8086.

    On DOS I do not remember names of specific programs. I remember
    Phar Lap who provided 286 DOS extender and quite a few programs
    used it. Browsing trough binaries on machines that I used I saw
    the name several times. Sometimes program using DOS extender
    would clearly say that it requires 286, but I vaguely remember
    cases with separate 286 binaries and 8086 binaries where startup
    code loaded right binary. Probably there were also cases whare
    needed switching was hidden inside a single binary.

    And users having
    only binaries wanted to use 8086 on their new systems which
    led to heroic efforts like OS/2 DOS box and later Linux
    dosemu. But integration of 8086 programs with protected
    mode was solved too late for 286 model to gain traction
    (and on 286 "DOS box" had to run in real mode, breaking
    normal system protection).

    Linux never ran on a 80286, and DOSemu uses the virtual 8086 mode,
    which does not require heroic efforts AFAIK.

    Well, baside virtual 8086 mode there is tricky code to get
    right effect. A lot of late "DOS" programs dependend on DOS
    extenders and significant fraction of such programs run fine
    under dosemu. I do not know if Windows ever got its DOS box
    to level of dosemu, but when I used dosemu I heard that
    various things did not work in Windows DOS box.

    There was various segmented hardware around, first and foremost (for
    the designers of the 80286), the iAPX432. And as you write, all the
    good reasons that resulted in segments on the iAPX432 also persisted
    in the 80286. However, given the slowness of segmentation, only the
    tiny (all in one segment), small (one segment for code and one for
    data), and maybe medium memory models (one data segment) are
    competetive in protected mode compared to real mode.

    AFAICS that covered wast majority of programs during eighties.

    The "vast majority" is not enough; if a key application like Lotus
    1-2-3 or Wordperfect did not work on the DOS alternative, the DOS
    alternative was not used. And Lotus 1-2-3 and Wordperfect certainly
    did not limit themselves to 64KB of data.

    I do not know if they offered protected mode versions. But it
    is likely that they did once machines with more than 640 kB
    formed resonable fraction of the PC market.

    Turbo Pascal offered only medium memory model

    Acoording to Terje Mathiesen, it also offered the large memory model.
    On its Wikipedia page, I find: "Besides allowing applications larger
    than 64 KB, Byte in 1988 reported ... for version 4.0". So apparently
    Turbo Pascal 4.0 introduced support for the large memory model in
    1988.

    I am not entirely sure, but probaly I used 4.0. I certainly used
    5.0 and later versions. AFAIR all versions that I used limited
    "static" data to 64 kB, that together with no such limit for code
    I take as definition of "medium" model. I do not remeber explicit
    model switches which were common on PC C compilers. PC compilers
    allowed far/near qualifiers on pointers and I do not rememeber
    significant restrictions on this (but other folks reported that
    some combinations did not work): for data model set defaults,
    but programmer could override it. So in Turbo Pascal one could
    use large pointers if desired (or maybe even by default), but
    static data was in a single 64 kB segment.

    Intel apparently assumed that programmers are willing to spend
    extra work to get good performance and IMO this was right
    as a general statement. Intel probably did not realize that
    programmers will be very reluctant to spent work on security
    features and in particular to spent work on making programs
    fast in 286 protected mode.

    80286 protected mode is never faster than real mode on the same CPU,
    so the way to make programs fast on the 286 is to stick with real
    mode; using the small memory model is an alternative, but as
    mentioned, the memory limits are too restrictive.

    Well, if program needs more than 1 MB total workarounds on 286
    may be more expensive than cost of protected mode. More to
    the point, if one needs security features, then doing them
    in real mode via sofware is likely to take more time than 286
    version. Intel clearly did not anticipate that large fraction
    of 286-s will be used in PC-s and that in PC vendors/developers
    will prefer speed gain (modest when protected mode version
    has enough tuning) to protection.

    Intel probably assumend that 286 would cover most needs,

    As far as protected mode was concerned, they hardly could have been
    more wrong.

    especially
    given that most system had much less memory than 16 MB theoreticlly
    allowed by 286.

    They provided 24 address pins, so they obviously assumed that there
    would be 80286 systems with >8MB. 64KB segments are already too
    limiting on systems with 1MB (which was supported by the 8086),
    probably even for anything beyond 128KB.

    IMO this is partially true: there
    is a class of programs which with some work fit into medium
    model, but using flat address space is easier. I think that
    on 286 (that is with 16 bit bus) those programs (assuming enough
    tuning) run faster than flat 32-bit version.

    Maybe in real mode. Certainly not in protected mode. Just run your
    tuned large-model protected-mode program against a 32-bit small-model
    program for the same task on a 386SX (which is reported as having a
    very similar speed to the 80286 on 16-bit programs).

    My instruction table show _longer_ times for several intructions
    on 386 compared to 286. For example real mode far call on 286
    has 13 clocks + penalty, on 386 17 clocks + the same penalty,
    protected mode call on 286 has 26 clocks + penalty, on 386 has
    34 clocks + penalty. Near call on both is 7 clocks + penalty.

    Anyway, if program consists of several procedures (or clusters
    of closely related procedures) each having few kilobytes then
    it can easily happen that there are thousends of instructions
    between far calls, so cost of far calls is going to be
    negligible (19 clocks per thousends of instructions). If
    program manages to do its work in single 64 kB data (not
    unreasonable for 1 MB code), than it will be faster than
    program using 32-bit addresses. More relevant, in multitaking
    situation with each task having its own data segment there
    will be reloading of segment registers on task switch,
    which is likely to be negligible. Again, each task will
    gain due to smaller pointers. With OS present there will
    be segment reloading due to system calls and this may
    be more significant. However, this mostly due to protection
    and not segmentation.

    And even if you
    find one case where the protected-mode program wins, nobody found it
    worth their time to do this nonsense.

    That is largely true. I wonder what will happen with x32 mode
    on x86_64. AFAIK x32 mode showed measurable performance gains,
    20-30% smaller programs and similar speed gains. In principle
    it should be cheap to support it as it is "just another 32-bit
    target". But some (for me important) programs do not work
    in this mode and there are voices to completly drop it.

    And so OS/2 flopped despite
    being backed by IBM and, until 1990, Microsoft.

    But I think that Intel segmentation had some
    attractive features during eighties.

    You are one of a tiny minority. Even Intel finally saw the light, as
    did everybody else, and nowadays segments are just a bad memory.

    Well, 16-bit segments clearly are too limited when one has several
    megabytes of memory. And consistently 32-bit segmented system
    increases memory use which is nontrivial cost. OTOH there is
    question how much customers are going to pay for security
    features. I think recent times show that secuity has significant
    costs. But lack of security may lead to big losses. So
    there is no easy choice.

    Now people talk more about capabilities. AFAICS capabilities
    offer more than segments, but are going to have higher cost.
    So abstractly, for some systems segments still may look
    attractive. OTOH we now understand that software ecosystem
    is much more varied than prevalent view in seventies, so
    system that fit well to segments are a tiny part.

    And considering bad memory, do you remember PAE? That had
    similar spirit to 8086 segmentation. I guess that due
    to bad feeling for segments among programmers (and possibly
    more relevant compatiblity troubles) Intel did not extend
    this to segments, but spirit was still there.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Mathisen on Mon Jan 6 00:35:00 2025
    In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no (Terje Mathisen) wrote:

    I do believe that IBM did seriously consider the risk of making the
    PC too good, so that it would compete directly with their low-end
    systems (8100?).

    I recall reading back in the 1980s that the PC was intended to be
    incapable of competing with the System/36 minis, and the previous
    System/34 and /32 machines. It rather failed at that.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Dallman on Mon Jan 6 03:02:22 2025
    On Thu, 1 Jan 1970 0:00:00 +0000, John Dallman wrote:

    In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no (Terje Mathisen) wrote:

    I do believe that IBM did seriously consider the risk of making the
    PC too good, so that it would compete directly with their low-end
    systems (8100?).

    I recall reading back in the 1980s that the PC was intended to be
    incapable of competing with the System/36 minis, and the previous
    System/34 and /32 machines. It rather failed at that.

    Perhaps IBM should have made them more performant !?!

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to Hebisch on Sun Jan 5 23:01:29 2025
    On Sun, 5 Jan 2025 21:49:20 -0000 (UTC), antispam@fricas.org (Waldek
    Hebisch) wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    Meanwhile, Microsoft introduced Windows/386 in September 1987 (in
    addition to the base (8086) variant of Windows 2.0, which was released
    in December 1987), which used 386 protected mode and virtual 8086 mode
    (which was missing in the "brain-damaged" (Bill Gates) 286). So
    Windows completely ignored 286 protected mode. Windows eventually
    became a big success.

    What I recall is a bit different. IIRC first successful version of
    Windows, that is Windows 3.0 had 3 modes of operation: 8086 compatible,
    286 protected mode and 386 protected mode. Only later Microsoft
    dropped requirement for 8086 compatiblity.

    They didn't drop 8086 so much as required 386. Windows and "DOS box"
    required the CPU to have "virtual 8086" mode.

    I think still later it dropped 286 support.

    I know 286 protected mode support continued at least through NT. Not
    sure about 2K.


    Windows 95 was supposed to be 32-bit, but contained quite a lot
    of 16-bit code.

    The GUI was 32-bit, the kernel and drivers were 16-bit. Weird, but it
    made some hardware interfacing easier.

    IIRC system interface to Windows 3.0 and 3.1 was 16-bit and only
    later Microsoft released extention allowing 32-bit system calls.

    Never programmed 3.0.

    3.1 and 3.11 (WfW) had a combination 16/32-bit kernel in which most
    device drivers were 16-bit, but the disk driver could be either 16 or
    32 bit. In WfW the network stack also was 32-bit and the NIC driver
    could be either.

    However the GUI in all 3.x versions was 16-bit 286 protected mode.

    You could run 32-bit "Win32s" programs (Win32s being a subset of
    Win32), but Win32s programs could not use graphics.


    I have no information about Windows internals except for some
    public statements by Microsoft and other people, but I think
    it reasonable to assume that Windows was actually a succesful
    example of 8086/286/386 compatibility. That is their 16 bit
    code could use real mode segmentation or protected mode
    segmentation the later both for 286 and 386. For 32-bit
    version they added translation layer to transform arguments
    between 16-bit world and 32-bit world. It is possible
    that this translation layer involved a lot of effort.

    For a number of years I worked on Windows based image processing
    systems that used OTS ISA-bus acceleration hardware. The drivers were
    16-bit DLLs, and /non-reentrant/. There was one "general" purpose
    board and several special purpose boards that could be combined with
    the general board in "stacks" that communicated via a private high
    speed bus. There could be multiple stacks of boards in the same
    system.

    [Our most complicated system had 7 boards in 2 stacks, one with 5
    boards and the other with 2. Our biggest system had 18 boards: 6
    stacks of 3 boards each. Ever see a 20 slot ISA backplane?]

    The non-reentrant driver made it difficult to simultaneously control
    separate stacks to do different tasks. We created a (reentrant)
    16 bit dispatching "thunk" DLL to translate calls for every
    function of every board that we might possibly want to use ...
    hundreds in all ... and then dynamically loaded multiple instances of
    the driver as required. PITA !!! Worked fine but very hard to debug, particularly when doing several different operations simultaneously.

    On 3.x we simulated threading in the shared 16-bit application space
    using multiple processes, messaging with hidden windows, and "far
    call" IPC using the main program as a kind of "shared library". Having
    real threads on 95 and later allowed actually consolidating everything
    into the same program and (at least initially) made everything easier.
    But then NT forced dealing with protected mode interrupts, while at
    the same time still using 16-bit drivers for everything else - and
    that became yet another PITA.

    We continued to use the image hardware until SIMD became fast enough
    to compete (circa GHz Pentium4 being available on SBC). Excepting
    NT3.x we had systems based on every Windows from 3.1 to NT4.


    Anyway, it seems that Windows was at least as tied to 286
    as OS/2 when it became sucessful and dropped 286 support
    later. And for long time after dropping 286 support
    Windows massively used 16-bit segments.

    I don't know exactly when 286 protected mode was dropped. I do know
    that, at least through NT4, 16-bit DOS mode and GUI applications would
    run so long as they relied on system calls and didn't directly try to
    touch hardware.

    I occasionally needed to run 16-bit VC++ on my NT4 machine.


    IIUC Microsoft Windows up to 3.0 and probably everbody who wanted
    to say "supported on Windows". That is Windows 3.0 on 286 almost
    surely used 286 protected mode and probably run "Windows" programs
    in protected mode. But Windows also supported 8086 and Microsoft
    guidelines insisted that proper "Windows program" should run on
    8086.

    Yes. I used - but never programmed - 3.0 on a V20 (8086 clone). It
    was painfully slow even with 1MB of RAM.


    ... Even Intel finally saw the light, as
    did everybody else, and nowadays segments are just a bad memory.

    Well, 16-bit segments clearly are too limited when one has several
    megabytes of memory. And consistently 32-bit segmented system
    increases memory use which is nontrivial cost. OTOH there is
    question how much customers are going to pay for security
    features. I think recent times show that secuity has significant
    costs. But lack of security may lead to big losses. So
    there is no easy choice.

    Now people talk more about capabilities. AFAICS capabilities
    offer more than segments, but are going to have higher cost.
    So abstractly, for some systems segments still may look
    attractive. OTOH we now understand that software ecosystem
    is much more varied than prevalent view in seventies, so
    system that fit well to segments are a tiny part.

    And considering bad memory, do you remember PAE? That had
    similar spirit to 8086 segmentation. I guess that due
    to bad feeling for segments among programmers (and possibly
    more relevant compatiblity troubles) Intel did not extend
    this to segments, but spirit was still there.

    The bad taste of segments is from exposure to Intel's half-assed
    implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    Intel had a chance to do it right with the 386, but instead they
    doubled down and expanded the existing poor implementation to support
    larger segments.

    I realize that transistor counts at the time might have made an
    on-chip SMU impossible, but ISTM the SMU would have been a very small
    component that (if necessary) could have been implemented on-die as a coprocessor.

    <grin>Maybe my de-deuces are wild ...</grin>
    but there they are nonetheless.

    YMMV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to George Neuner on Mon Jan 6 08:24:43 2025
    George Neuner <gneuner2@comcast.net> writes:
    The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    What benefits do you expect from segments? One benefit usually
    mentioned is security, in particular, protection against out-of-bounds
    accesses (e.g., buffer overflow exploits).

    If the program uses 32-bit (or nowadays 64-bit) addresses, and the
    segment number is just part of that, you don't get that protection: An out-of-bounds access could not be distinguished from a valid access to
    a different segment. There might be some addresses that are in no
    segment, and that would lead to a segment violation, but the same is
    true for paging-based "security" now; the important part is that there
    would be (and is) no guarantee that an out-of-bounds access is caught.

    The 286 segments catch out-of-segment accesses. The size granularity
    of the 386's 32-bit segments is coarse, but at least out-of-bounds
    accesses do not intrude into other segments.

    On the 286 and 386 segment numbers are stored in memory just like any
    other data, so an attacker may be able to change the segment number in
    addition to (or instead of) the offset, and thus gain access to
    sensitive data, so the security provided by 286/386 segments is
    limited. I have not looked closely into CHERI, but I dimly remember
    some claims that they protect against manipulation of the extra data
    (what would be the segment number in the 286) in the 128-bit address.

    Intel had a chance to do it right with the 386, but instead they
    doubled down and expanded the existing poor implementation to support
    larger segments.

    It looks to me that they took the right choices: Support 286 protected
    mode, add virtual 8086 mode, support a flat memory model like
    everybody else has done in modern computers (S/360, PDP-11); to
    combine these requirements, they added support for segments up to 4GB
    in size, so people wanting to use flat 32-bit addressing could just
    use the tiny memory model (CS=DS=SS) and forget about segments.

    I realize that transistor counts at the time might have made an
    on-chip SMU impossible, but ISTM the SMU would have been a very small >component that (if necessary) could have been implemented on-die as a >coprocessor.

    How would the addresses be divided into segment and offset in your
    model? What kind of addresses would you have used on the 286? What
    would the SMU have to do? Would a PC have used such an SMU if it was
    a separate chip?

    If they had made the 286 a kind of real-mode-only 386SX-like CPU, I
    think that PCs would have been designed without SMU. And one problem
    would have been that you probably would want 32 address bits to flow
    from the CPU to the SMU, but the 286 and 386SX only have 24 address
    pins, and additional pins are expensive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Mon Jan 6 14:41:22 2025
    On Mon, 06 Jan 2025 08:24:43 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    How would the addresses be divided into segment and offset in your
    model? What would the SMU have to do?


    - anton

    Those are sort of questions that in the past I several times asked Nick Maclaren when he was still active on c.a. Never got an answer that I
    was able to understand.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to mitchalsup@aol.com on Mon Jan 6 15:19:32 2025
    On Mon, 6 Jan 2025 03:02:22 +0000
    mitchalsup@aol.com (MitchAlsup1) wrote:

    On Thu, 1 Jan 1970 0:00:00 +0000, John Dallman wrote:

    In article <vlervh$174vb$1@dont-email.me>, terje.mathisen@tmsw.no
    (Terje Mathisen) wrote:

    I do believe that IBM did seriously consider the risk of making the
    PC too good, so that it would compete directly with their low-end
    systems (8100?).

    I recall reading back in the 1980s that the PC was intended to be
    incapable of competing with the System/36 minis, and the previous
    System/34 and /32 machines. It rather failed at that.

    Perhaps IBM should have made them more performant !?!


    Impossible. More performant S/36 would undermine S/38.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jan 6 16:05:02 2025
    Anton Ertl wrote:
    George Neuner <gneuner2@comcast.net> writes:
    The bad taste of segments is from exposure to Intel's half-assed
    implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    What benefits do you expect from segments? One benefit usually
    mentioned is security, in particular, protection against out-of-bounds accesses (e.g., buffer overflow exploits).

    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last)
    allocated page. This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jan 6 16:36:41 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last) >allocated page. This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    It also does not help for out-of-bounds accesses that are not just
    adjacent to an earlier in-bounds access. That may also be a less
    common vulnerability than adjacent positive-stride buffer overflows.
    But if we throw hardware on the problem, do we want to spend hardware
    on something that does not catch all out-of-bounds accesses?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jan 6 18:58:16 2025
    According to George Neuner <gneuner2@comcast.net>:
    The bad taste of segments is from exposure to Intel's half-assed >implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    The whole point of a segmented architecture is that the segments are visible and
    meaningful. You put a thing (for some definition of thing) in a segment to control access to the thing. So if it's an array, all of the address calculations are relative to the segment and out of bounds references fail because they point to a non-existent part of the segment. Similiarly if it's code, a jump outside the segment's boundaries fails.

    Muitics and the Burroughs machines had (still have, I suppose for emulated Burroughs) visible segments and programmers liked them just fine. The problems were that the segment sizes were too small as memories got bigger, and that they
    weren't byte addressed which these days is practically mandatory. The 286 added
    additional flaws that there weren't enough segment registers and segment loads were very slow.

    What you're describing is multi-level page tables. Every virtual memory system has them. Sometimes the operating systems make the higher level tables visible to applications, sometimes they don't. For example, in IBM mainframes the second
    level page table entries, which they call segments, can be shared between applications.




    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 6 19:49:34 2025
    On Mon, 6 Jan 2025 16:36:41 +0000, Anton Ertl wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The best idea I have seen to help detect out of bounds accesses, is to >>round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that >>the end of the requested region coincides with the end of the (last) >>allocated page. This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    It also does not help for out-of-bounds accesses that are not just
    adjacent to an earlier in-bounds access. That may also be a less
    common vulnerability than adjacent positive-stride buffer overflows.
    But if we throw hardware on the problem, do we want to spend hardware
    on something that does not catch all out-of-bounds accesses?

    An IBM guy once told me::

    "If you are going to put it in HW, put it in in such a way that you
    never have to change the definition of what you put in.

    So, to answer the above question:: you want to check absolutely
    all boundaries on all multi-container data objects, including
    array bounds within a structure::

    struct { integer a,b,c,d;
    double l[max],m[max],n[max][max]; } k;

    Any access to m[] is checked to be within the substructure
    of m[*], so you cannot touch l[] or n[][], or a,b,c, or d.

    Try doing that with segmentation bounds checking...or
    capabilities...

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Mon Jan 6 19:45:43 2025
    John Levine <johnl@taugh.com> writes:
    According to George Neuner <gneuner2@comcast.net>:
    The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    The whole point of a segmented architecture is that the segments are visible and
    meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.

    Muitics and the Burroughs machines had (still have, I suppose for emulated

    The original HP-3000 also had segments.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Jan 6 19:41:49 2025
    On Mon, 6 Jan 2025 15:05:02 +0000, Terje Mathisen wrote:

    Anton Ertl wrote:
    George Neuner <gneuner2@comcast.net> writes:
    The bad taste of segments is from exposure to Intel's half-assed
    implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    What benefits do you expect from segments? One benefit usually
    mentioned is security, in particular, protection against out-of-bounds
    accesses (e.g., buffer overflow exploits).

    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last) allocated page.
    This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    You allocate no more actual memory, but you do consume an additional
    virtual address PTE on those pages marked no-access. If, later, you
    expand that allocated area, you can THEN allocate a page and update
    the PTE.

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    Use an unallocated page prior to the buffer, too.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to John Levine on Mon Jan 6 19:48:46 2025
    John Levine <johnl@taugh.com> writes:
    According to George Neuner <gneuner2@comcast.net>:
    The bad taste of segments is from exposure to Intel's half-assed >>implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    The whole point of a segmented architecture is that the segments are visible and
    meaningful. You put a thing (for some definition of thing) in a segment to >control access to the thing. So if it's an array, all of the address >calculations are relative to the segment and out of bounds references fail >because they point to a non-existent part of the segment. Similiarly if it's >code, a jump outside the segment's boundaries fails.

    Muitics and the Burroughs machines had (still have, I suppose for emulated >Burroughs) visible segments and programmers liked them just fine. The problems >were that the segment sizes were too small as memories got bigger, and that they
    weren't byte addressed which these days is practically mandatory. The 286 added
    additional flaws that there weren't enough segment registers and segment loads >were very slow.

    What you're describing is multi-level page tables. Every virtual memory system >has them. Sometimes the operating systems make the higher level tables visible >to applications, sometimes they don't. For example, in IBM mainframes the second
    level page table entries, which they call segments, can be shared between >applications.

    There have been a number of attempts to use capabilities to describe
    individual data items (the aforementioned Burrougsh systems are the
    canonical examples).

    There are investigations into adapting such schemes to modern
    microprocessors, one of which is CHERI which uses 128-bit
    pointers to encode various attributes, including the size
    of the object.

    https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Mon Jan 6 22:02:30 2025
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last) allocated page. This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    It is also problematic to allocate 8K (or more) for a small entity, or
    on the stack.

    Bounds checking should ideally impart minimum overhead so that it
    can be enabled in production code.

    Hmm... a beginning of an idea (for which I am ready to be shot
    down, this is comp.arch :-)

    This would work best for languages which explicitly pass
    array bounds or sizes (like Fortran's assumed size arrays,
    or, if I read this correctly, Rust's slices).

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    Memory access is to base + index, with one additional point:
    If index > ubound, then the instruction raises an exception.

    This works less well with C's pointers, for which you would have
    to pass some sort of fat pointer. Compilers would have to make
    sure that the address of the base object is passed.

    Comments?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Mon Jan 6 22:57:11 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last)
    allocated page. This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    It is also problematic to allocate 8K (or more) for a small entity, or
    on the stack.

    Bounds checking should ideally impart minimum overhead so that it
    can be enabled in production code.

    Hmm... a beginning of an idea (for which I am ready to be shot
    down, this is comp.arch :-)

    This would work best for languages which explicitly pass
    array bounds or sizes (like Fortran's assumed size arrays,
    or, if I read this correctly, Rust's slices).

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Jan 6 23:41:41 2025
    On Mon, 6 Jan 2025 22:02:30 +0000, Thomas Koenig wrote:

    Hmm... a beginning of an idea (for which I am ready to be shot
    down, this is comp.arch :-)

    This would work best for languages which explicitly pass
    array bounds or sizes (like Fortran's assumed size arrays,
    or, if I read this correctly, Rust's slices).

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    Memory access is to base + index, with one additional point:
    If index > ubound, then the instruction raises an exception.

    Now, you are only checking the ubound and not the lbound; so,
    you only stumble over ½ the bound errors.

    Where you should START is with a data structure that defines
    the memory region::

    First Byte accessible Possibly lbound
    Last Byte accessible Possibly ubound
    other stuff as needed

    Then figure out how to efficiently perform the checks in ISA
    of choice (or add to ISA).

    This works less well with C's pointers, for which you would have
    to pass some sort of fat pointer. Compilers would have to make
    sure that the address of the base object is passed.

    I blame the programmers for not using FAT pointers (and then
    teaching the compilers how to get rid of most of the checks.)
    Nothing is preventing C programmers from using FAT pointers,
    and thereby avoid all those buffer overruns.

    Comments?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lynn Wheeler@21:1/5 to John Levine on Mon Jan 6 17:28:11 2025
    John Levine <johnl@taugh.com> writes:
    What you're describing is multi-level page tables. Every virtual
    memory system has them. Sometimes the operating systems make the
    higher level tables visible to applications, sometimes they don't. For example, in IBM mainframes the second level page table entries, which
    they call segments, can be shared between applications.

    initial adding virtual memory to all IBM 370s was similar to 24bit
    360/67 but had options for 16 1mbyte segments or 256 64kbyte segments
    and either 4kbyte or 2kbyte pages. Initial mapping of 360 MVT to VS2/SVS
    was single 16mbyte address space ... very similar to running MVT in a
    CP/67 16mbyte virtual machine.

    The upgrade to VS2/MVS gave each region its own 16mbyte virtual address
    space. However, OS/360 MVT API heritage was pointer passing API ... so
    they mapped a common 8mbyte image of the "MVS" kernel into every 16mbyte virtual address space (leaving 8mbytes for application code), kernel API
    call code could still directly access user code API parameters
    (basically same code from MVT days).

    However, MVT subsystems were also moved into their separate 16mbyte
    virtual address space ... making it harder to access application API
    calling parameters. So they defined a common segment area (CSA), 1mbyte
    segment mapped into every 16mbyte virtual address space, application
    code would get space in the CSA for API parameter information calling subsystesm.

    Problem was the requirement for subsystem API parameter (CSA) space was proportional to number of concurrent applications plus number of
    subsystems and quickly exceed 1mbyte ... and it morphs into
    multi-megabyte common system area. By the end of the 70s, CSAs were
    running 5-6mbytes (leaving 2-3mbytes for programs) and threatening to
    become 8mbytes (leaving zero mbytes for programs)... part of the mad
    rush to XA/370 and 31-bit virtual addressing (as well as access
    registers, and multiple concurrent virtual address spaces ... "Program
    Call" instruction had a table of MVS/XA address space pointers for
    subsystems, the PC instruction whould move the caller's address space
    pointer to secondary and load the subsystem address space pointer into
    primary ... program return instruction reversed the processes and moved
    the secondary pointer back to primary).

    --
    virtualization experience starting Jan1968, online at home since Mar1970

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 7 11:05:20 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Tue Jan 7 11:45:25 2025
    MitchAlsup1 wrote:
    On Mon, 6 Jan 2025 15:05:02 +0000, Terje Mathisen wrote:

    Anton Ertl wrote:
    George Neuner <gneuner2@comcast.net> writes:
    The bad taste of segments is from exposure to Intel's half-assed
    implementation which exposed the segment selector as part of the
    address.

    Segments /should/ have been implemented similar to the way paging is
    done: the program using flat 32-bit addresses and the MMU (SMU?)
    consulting some kind of segment "database" [using the term loosely].

    What benefits do you expect from segments?  One benefit usually
    mentioned is security, in particular, protection against out-of-bounds
    accesses (e.g., buffer overflow exploits).

    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last)
    allocated page.
                   This does require at least 8kB for every allocation, but
    I guess they can all share a single trapping segment?

    You allocate no more actual memory, but you do consume an additional
    virtual address PTE on those pages marked no-access. If, later, you
    expand that allocated area, you can THEN allocate a page and update
    the PTE.

    (This idea does not help locate negative buffer overruns (underruns?)
    but they seem to be much less common?)

    Use an unallocated page prior to the buffer, too.

    Yeah, of course, but you do lose the exact trap ability.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 7 10:53:17 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Mon, 6 Jan 2025 22:02:30 +0000, Thomas Koenig wrote:

    Hmm... a beginning of an idea (for which I am ready to be shot
    down, this is comp.arch :-)

    This would work best for languages which explicitly pass
    array bounds or sizes (like Fortran's assumed size arrays,
    or, if I read this correctly, Rust's slices).

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    Memory access is to base + index, with one additional point:
    If index > ubound, then the instruction raises an exception.

    Now, you are only checking the ubound and not the lbound; so,
    you only stumble over ½ the bound errors.

    If the base register does point to the start of the entity,
    that would also be covered, at least for one-dimensional
    arrays.


    Where you should START is with a data structure that defines
    the memory region::

    First Byte accessible Possibly lbound
    Last Byte accessible Possibly ubound
    other stuff as needed

    Then figure out how to efficiently perform the checks in ISA
    of choice (or add to ISA).

    One such example is defined in the Fortran standard, in the C
    descriptors, from ISO_Fortran_binding.h . There are two data
    structures: The CFI_dim_t structure, which describes (in integer
    variables) the lower bound, the extend (a.k.a number of elements)
    and the stride. The CFI_cdesc_t structure then describes the
    base address (a void *), the length of an individual element,
    the version, the rank, the type, several attributes (is it
    allocatable or a pointer) and the number of dimension.

    You can see an example at

    https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=libgfortran/ISO_Fortran_binding.h;hb=refs/heads/master

    (Unfortunately, for historical reasons, gfortran uses another
    format internally for array descriptors).

    Hmm... let's look at a simplified example.

    void bounds_error (const char *fmt, ...) __attribute__ ((format (printf, 1,2)))
    __attribute__ ((noreturn));

    void set_element (double *a, unsigned long lower, unsigned long upper,
    unsigned long n)
    {
    if (n < lower || n > upper)
    bounds_error ("Error: %ld not between %ld and %ld", n, lower, upper);
    a[n - lower] = 1.0;
    }

    it is hard avoiding two comparisons and branches without having
    some sort of range comparison, something like

    cmpr Rdst,Rsrc,Rlow,Rhigh

    which would then set conditional bits according to the different
    ranges that Rsrc can be in.

    This works less well with C's pointers, for which you would have
    to pass some sort of fat pointer. Compilers would have to make
    sure that the address of the base object is passed.

    I blame the programmers for not using FAT pointers (and then
    teaching the compilers how to get rid of most of the checks.)
    Nothing is preventing C programmers from using FAT pointers,
    and thereby avoid all those buffer overruns.

    They still have to do it by hand, it is much easier to do if
    the language they use would offer it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Jan 7 17:04:29 2025
    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly
    caught both in Matlab and in Octave. And Matlab has Fortran roots.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 7 14:43:02 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Tue Jan 7 15:28:07 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly
    caught both in Matlab and in Octave. And Matlab has Fortran roots.

    A compiler is free to create row or column capabilities for C or
    FORTRAN if the goal is more than just memory safety.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 7 16:41:36 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly >>caught both in Matlab and in Octave. And Matlab has Fortran roots.

    And vice versa, Fortran 90 borrowed heavily from Matlab :-)

    Out-of-bounds-accesses are an error in Fortran, but the language
    does not require.

    A compiler is free to create row or column capabilities for C or
    FORTRAN if the goal is more than just memory safety.

    Would this be reasonably efficient?

    Looking at an extreme case, the straightforward matrix
    multiplication below (including some initialization so it's
    self-contained)

    program main
    implicit none
    real a(0:2,0:2), b(0:2,0:2), c(0:2,0:2)
    integer i,j,k
    do i=0,2
    do j=0,2
    a(i,j) = 2*i + j
    b(i,j) = i - j
    c(i,j) = 0
    end do
    end do
    do j = 0, 2
    do k = 0,2
    do i = 0, 2
    c(i,j) = c(i,j) + a(i,k) * b(k,j)
    end do
    end do
    end do
    print *,c
    end

    gives us, in f2c translation to C of the three nested matmul loops,

    for (j = 0; j <= 2; ++j) {
    for (k = 0; k <= 2; ++k) {
    for (i__ = 0; i__ <= 2; ++i__) {
    c__[i__ + j * 3] += a[i__ + k * 3] * b[k + j * 3];
    }
    }
    }

    (yes, any bounds checking should have been moved outside the loops :-)
    how could capabilities be used to detect bounds violations for all
    indices on each access?

    And what would the effort be? Can they be created in the time
    for a simple pointer addition, or is something from the OS required?
    Would this require something like a "pointer to pointer"?

    (I have to admit that I haven't read a lot of CHERI, but what I
    have read makes me supect that the designers didn't really have multi-dimensional arrays in mind; but then neither did the
    designers of C, and CHERI is certainly C-centered).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Jan 7 20:16:57 2025
    On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly
    caught both in Matlab and in Octave. And Matlab has Fortran roots.

    WATFIV would catch a(9,11)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 7 21:26:11 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly
    caught both in Matlab and in Octave. And Matlab has Fortran roots.

    WATFIV would catch a(9,11)

    So would any compiler that generates bounds checking code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Tue Jan 7 22:01:55 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 7 Jan 2025 15:04:29 +0000, Michael S wrote:

    On Tue, 07 Jan 2025 14:43:02 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Assume a class of load and store instructions containing

    - One source or destination register
    - One base register
    - One index register
    - One ubound register

    See aforementioned CHERI.

    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.

    The floating point size is weird (and also does not catch all
    errors, such as writing one element past a huge array).

    I haven't seen any consideration of how CHERI would integrate
    with languages which have multidimensional arrays. How would

    do j=1,10
    do i=1,11
    a(i,j) = 42.
    end do
    end do

    interact with Cheri if a was a 10*10 array ? Would it be
    necessary to create a capability for a(:,j)?

    A multidimensional array is a single contiguous
    blob of memory, the capability would encompass the
    entire region of memory containing the array.

    Then out of bound access like a(9,11) would not be caught.
    I don't know if it has to be caught by Fortran rules. It is certainly
    caught both in Matlab and in Octave. And Matlab has Fortran roots.

    WATFIV would catch a(9,11)

    Every Fortran compiler I know has a bounds checking, but it is
    optional for all of them. People still leave it off for production
    code because it is too slow.

    It would really be great if the performance degradation was
    small enough so people would simply leave on the checks
    for production code. (Side question: What does CHERI
    do with SIMD-vectorized code?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 7 23:16:07 2025
    On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    -----------------
    WATFIV would catch a(9,11)

    Every Fortran compiler I know has a bounds checking, but it is
    optional for all of them. People still leave it off for production
    code because it is too slow.

    It would really be great if the performance degradation was
    small enough so people would simply leave on the checks
    for production code. (Side question: What does CHERI
    do with SIMD-vectorized code?)

    How long does it take SW to construct the kinds of slices
    a FORTRAN subroutine can hand off to another subroutine ??
    That is a CHERRI capability that allows for access to every
    even byte in a structure but no odd byte in the same structure ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Jan 8 11:53:51 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    -----------------
    WATFIV would catch a(9,11)

    Every Fortran compiler I know has a bounds checking, but it is
    optional for all of them. People still leave it off for production
    code because it is too slow.

    It would really be great if the performance degradation was
    small enough so people would simply leave on the checks
    for production code. (Side question: What does CHERI
    do with SIMD-vectorized code?)

    How long does it take SW to construct the kinds of slices
    a FORTRAN subroutine can hand off to another subroutine ??

    For the snippet

    subroutine bar
    interface
    subroutine foo(a)
    real, intent(in), dimension(:,:) :: a
    end subroutine foo
    end interface
    real, dimension(10,10) :: a
    call foo(a)
    end

    (which calls foo with an assumed-shape array) gfortran hands this
    to the middle end:

    __attribute__((fn spec (". ")))
    void bar ()
    {
    real(kind=4) a[100];

    {
    struct array02_real(kind=4) parm.0;

    parm.0.span = 4;
    parm.0.dtype = {.elem_len=4, .version=0, .rank=2, .type=3};
    parm.0.dim[0].lbound = 1;
    parm.0.dim[0].ubound = 10;
    parm.0.dim[0].stride = 1;
    parm.0.dim[1].lbound = 1;
    parm.0.dim[1].ubound = 10;
    parm.0.dim[1].stride = 10;
    parm.0.data = (void *) &a[0];
    parm.0.offset = -11;
    foo (&parm.0);
    }
    }


    The middle and back ends are then free to optimize.

    That is a CHERRI capability that allows for access to every
    even byte in a structure but no odd byte in the same structure ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Valencia@21:1/5 to Terje Mathisen on Sat Jan 11 13:59:21 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    The best idea I have seen to help detect out of bounds accesses, is to
    round all requested memory blocks up to the next 4K boundary and mark
    the next page as unavailable, then return a skewed pointer back, so that
    the end of the requested region coincides with the end of the (last) allocated page.

    I think I've mentioned this once before, but I did precisely this during my time at Sequent, and the C library blew up. Turned out the C string support routines were pulling in cache line lengths at a time, and it was such a win they didn't want to observe "strict" C string access rules. I assume they padded things such that no "real life" string could end up against a page boundary abutted to an invalid page address, but since they weren't
    interested in fixing it, I stopped worrying about it.

    A kinder, gentler time. I wonder if such things still lurk out there.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    Fediverse: @vandys@goto.vsta.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Sat Jan 11 22:31:15 2025
    On Wed, 8 Jan 2025 11:53:51 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Tue, 7 Jan 2025 22:01:55 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    -----------------
    WATFIV would catch a(9,11)

    Every Fortran compiler I know has a bounds checking, but it is
    optional for all of them. People still leave it off for production
    code because it is too slow.

    It would really be great if the performance degradation was
    small enough so people would simply leave on the checks
    for production code. (Side question: What does CHERI
    do with SIMD-vectorized code?)

    How long does it take SW to construct the kinds of slices
    a FORTRAN subroutine can hand off to another subroutine ??

    For the snippet

    subroutine bar
    interface
    subroutine foo(a)
    real, intent(in), dimension(:,:) :: a
    end subroutine foo
    end interface
    real, dimension(10,10) :: a
    call foo(a)
    end

    (which calls foo with an assumed-shape array) gfortran hands this
    to the middle end:

    __attribute__((fn spec (". ")))
    void bar ()
    {
    real(kind=4) a[100];

    {
    struct array02_real(kind=4) parm.0;

    parm.0.span = 4;
    parm.0.dtype = {.elem_len=4, .version=0, .rank=2, .type=3};
    parm.0.dim[0].lbound = 1;
    parm.0.dim[0].ubound = 10;
    parm.0.dim[0].stride = 1;
    parm.0.dim[1].lbound = 1;
    parm.0.dim[1].ubound = 10;
    parm.0.dim[1].stride = 10;
    parm.0.data = (void *) &a[0];
    parm.0.offset = -11;
    foo (&parm.0);
    }
    }


    The middle and back ends are then free to optimize.

    Thank you for this example.

    That is a CHERRI capability that allows for access to every
    even byte in a structure but no odd byte in the same structure ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Keith Thompson on Wed Jan 15 07:09:38 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Wed Jan 15 14:00:26 2025
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that
    dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.
    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive. There are no array descriptors generated automatically by
    compiler. But saying that there is no support is incorrect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Wed Jan 15 18:00:34 2025
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    There are no array descriptors generated automatically by
    compiler. But saying that there is no support is incorrect.

    What happens for mismatched array bounds between caller
    and callee? Nothing, I guess?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Wed Jan 15 22:28:24 2025
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike
    C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not, arrays
    stuff in C had not changed (AFAIK) since C99. So, more like four
    decades. Or 33 years since Fortran got its first standard.


    There are no array descriptors generated automatically by
    compiler. But saying that there is no support is incorrect.

    What happens for mismatched array bounds between caller
    and callee? Nothing, I guess?

    I don't know. I didn't read this part of the standard. Or any part of
    any C standard past C89.

    Never used them, too. For me, multi-dimensional arrays look mostly like
    source of confusion rather than useful feature. At least as long as
    there are no automatically generated descriptors. With exception for
    VERY conservative cases like array fields in structure, with all
    dimensions fixed at compile time.

    I don't know, but I can guess. And in case I am wrong Keith Thompson
    will correct me.
    Most likely the standard says that mismatched array bounds between
    caller and callee is UB.
    And most likely in practice it works as expected. I.e. if caller
    defined the matrix as X[M][N] and caller is treating it as Y[P][Q] then
    access to Y[i][j] for as long as k=i*Q+j < M*N will go to X[k/N][k%N].

    However, you have to pay attention that in practice something like that happening by mistake with variably-modified types is far less likely
    than it is with classic C multi-dimensional arrays.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Wed Jan 15 20:59:15 2025
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike
    C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that
    dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not, arrays
    stuff in C had not changed (AFAIK) since C99.

    It's not mandatory, so compilers are free to ignore it (and a
    major compiler, from a certain company in Redmond, does
    not support it). That's as good as sayhing that it does not
    exist in the language.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Keith Thompson on Wed Jan 15 22:39:57 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    In most but not all contexts. For example, `sizeof arr` yields the size
    of the array, not the size of a pointer.

    Jep.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    In C, multidimensional arrays are nothing more or less than arrays of
    arrays. You can also build data structures using pointers that are
    accessed using the same a[i][j] syntax as is used for a multidimensional array. And yes, they can be difficult to work with.

    A pointer forest is also Not Good (TM) for efficiency...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Thu Jan 16 10:11:36 2025
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header
    which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication
    to get the actual position:

    array[y][x] -> array[y*width + x]

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Thu Jan 16 11:43:48 2025
    On 15/01/2025 21:28, Michael S wrote:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike
    C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that
    dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    It's not a big thing. VLA's were added in C99, but one big and
    influential compiler supplier didn't want to bother supporting them
    (there's lots in C99 that they didn't bother supporting) so the argued
    for it to be optional in C11. By the time C23 was in planning, they had finally got around to supporting most of C99, so it is no longer
    optional for standards compliance. But basically the situation is the
    same as it always has been - if you use a solid C compiler like gcc,
    clang, icc, etc., you can freely use VLA's. If you use MS's half-done
    effort, you can't. (MS's compiler has much better support for newer C++ standards - they just seem determined to be useless at C support.)


    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not, arrays
    stuff in C had not changed (AFAIK) since C99. So, more like four
    decades. Or 33 years since Fortran got its first standard.


    Yes.


    There are no array descriptors generated automatically by
    compiler. But saying that there is no support is incorrect.

    What happens for mismatched array bounds between caller
    and callee? Nothing, I guess?

    Bad things /might/ happen. But they might not - it's undefined behaviour.


    I don't know. I didn't read this part of the standard. Or any part of
    any C standard past C89.

    Never used them, too. For me, multi-dimensional arrays look mostly like source of confusion rather than useful feature. At least as long as
    there are no automatically generated descriptors. With exception for
    VERY conservative cases like array fields in structure, with all
    dimensions fixed at compile time.

    I don't know, but I can guess. And in case I am wrong Keith Thompson
    will correct me.
    Most likely the standard says that mismatched array bounds between
    caller and callee is UB.

    Yes.

    If you have:

    int x[4][6];

    then the expression "x[i]" is evaluated by converting "x" to a pointer
    to an array of 6 ints. Thus x[0][6] would be an out-of-bounds access to
    the first array of 6 ints in x - it is /not/ defined to work like
    x[1][0], even though you'd get the same bit of memory if you worked out
    the array address by hand.

    In practice, it might work fine. When you declare an array type, the
    compiler will believe you - C is a trusting language. But if you have
    given the compiler conflicting information, things can go badly wrong.
    So if you declare an array somewhere with one format that the compiler
    can see, and then access it through an lvalue (such as a pointer) with a different format that the compiler also can see, the compiler might
    generate code that assumes one format or the other, or a mix of them.
    Or it might assume that the pointer can't refer to the declared array
    because they are not the same format, and keep values cached in
    registers that don't match up.

    I expect you'd see problems most often if the compiler is able to make
    use of SIMD or vector registers to handle blocks of the data at a time.
    And you are more likely to see trouble with cross-module optimisations
    (LTO in gcc terms) since it leads to greater sharing of information over
    wider ranges of the code.

    As always, the advice is not to lie to your compiler - it might not bite
    you now, but it may well do in the future when you least expect it.


    And most likely in practice it works as expected. I.e. if caller
    defined the matrix as X[M][N] and caller is treating it as Y[P][Q] then access to Y[i][j] for as long as k=i*Q+j < M*N will go to X[k/N][k%N].


    Remember that in C (and all other programming languages), if you try to
    do something that is not defined behaviour, there isn't any concept of
    "works as expected" as far as the language is concerned. What the
    /programmer/ expected is a different matter - but if the language (or additional information from the compiler) does not define the behaviour,
    then the programmer's expectations are based on a misunderstanding.

    However, you have to pay attention that in practice something like that happening by mistake with variably-modified types is far less likely
    than it is with classic C multi-dimensional arrays.


    I'm not sure why you'd say that.

    The rule for getting array code right is quite simple - don't use arrays without knowing the bounds for each dimension. You can get these by
    passing bounds as parameters, or using fixed constants, or wrapping
    fixed-size arrays in a struct and using sizeof - however you do it, make
    sure you know the bounds and keep them consistent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Thu Jan 16 12:36:45 2025
    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike
    C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation that
    dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for multi-dimensional
    arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not, arrays
    stuff in C had not changed (AFAIK) since C99.

    It's not mandatory, so compilers are free to ignore it (and a
    major compiler, from a certain company in Redmond, does
    not support it). That's as good as sayhing that it does not
    exist in the language.

    Not really, no. The world of C programmers can be divided into those
    that work with C compilers and can freely use pretty much anything in C,
    and those that have to content with limited, non-standard or otherwise problematic compilers and write code that works for them. Such
    compilers include embedded toolchains for very small microcontrollers or
    DSPs, and MS's C compiler.

    Some C code needs to be written in a way that works on MS's C compiler
    as well as other tools, but most is free from such limitations. Even
    for those targeting Windows, it's common to use clang or gcc for serious
    C coding.

    MS used to have a long-term policy of specifically not supporting C well because that might make it easier for people to write cross-platform C
    code for Windows and Linux. Instead, they preferred to push developers
    towards C# and Windows-only programming - or if that failed, C++ which
    was not as commonly used on *nix. Now, I think, they just don't care
    much about C - they don't see many people using their tools for C and
    haven't bothered supporting any feature that needs much effort. They
    know that they can't catch up with other C compilers, so have made it
    easier to integrate clang with their IDE's and development tools.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Terje Mathisen on Thu Jan 16 13:11:56 2025
    On 16/01/2025 10:11, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication
    to get the actual position:

      array[y][x] -> array[y*width + x]


    That does not surprise me. Vec<> in Rust is very similar to
    std::vector<> in C++, as far as I know (correct me if that's wrong). So
    a vector of vectors of int is not contiguous or consistent - each
    subvector can have a different current size and capacity. Doing a
    bounds check for accessing xs[i][j] (or in C++ syntax, xs.at(i).at(j)
    when you want bounds checking) means first reading the current size
    member of the outer vector, and checking "i" against that. Then xs[i]
    is found (by adding "i * sizeof(vector)" to the data pointer stored in
    the outer vector). That is looked up to find the current size of this
    inner vector for bounds checking, then the actual data can be found.

    This is /completely/ different from classic C multi-dimensional arrays.
    It is more akin to a one-dimensional C array of pointers to individually allocated one-dimensional C arrays - but even less efficient due to an
    extra layer of indirection.

    If you know the size of the data at compile time, then in C++ you have std::array<> where the information about size is carried in the type,
    with no run-time cost. A nested std::array<> is a perfectly good and
    efficient multi-dimensional array with runtime bounds checking if you
    want to use it, as well as having value semantics (no decay to pointer
    types in expressions). I would guess there is something equivalent in
    Rust ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Thu Jan 16 13:59:55 2025
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's >>>>>>>>> a ton of C code out there), but trying to retrofit a safe
    memory model onto C seems a bit awkward - it might have been >>>>>>>>> better to target a language which has arrays in the first
    place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation
    that dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for
    multi-dimensional arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not,
    arrays stuff in C had not changed (AFAIK) since C99.

    It's not mandatory, so compilers are free to ignore it (and a
    major compiler, from a certain company in Redmond, does
    not support it). That's as good as sayhing that it does not
    exist in the language.

    Not really, no. The world of C programmers can be divided into those
    that work with C compilers and can freely use pretty much anything in
    C, and those that have to content with limited, non-standard or
    otherwise problematic compilers and write code that works for them.
    Such compilers include embedded toolchains for very small
    microcontrollers or DSPs, and MS's C compiler.

    Some C code needs to be written in a way that works on MS's C
    compiler as well as other tools, but most is free from such
    limitations. Even for those targeting Windows, it's common to use
    clang or gcc for serious C coding.

    MS used to have a long-term policy of specifically not supporting C
    well because that might make it easier for people to write
    cross-platform C code for Windows and Linux. Instead, they preferred
    to push developers towards C# and Windows-only programming - or if
    that failed, C++ which was not as commonly used on *nix. Now, I
    think, they just don't care much about C - they don't see many people
    using their tools for C and haven't bothered supporting any feature
    that needs much effort. They know that they can't catch up with
    other C compilers, so have made it easier to integrate clang with
    their IDE's and development tools.


    Microsoft does care about C, but only in one specific area - kernel programming.

    OK. That's not an area I have been involved in at all, so I will take
    your word for it. Does that also extend to device drivers?

    The only other language officially allowed for Windows
    kernel programming is C++, but coding kernel drivers in C++ is
    discouraged.

    C++ is absolutely fine for low-level programming, but you need to know
    how to write low-level C++ code. Someone used to writing application
    code in C++ can write really bad low-level C++ code very, very quickly -
    it takes more effort to get things totally wrong in C!

    I suppose that driver written in C++ would have major
    difficulties passing Windows HLK tests and getting WHQL signing.


    I once took a brief look at that process many years ago, and decided it
    was not for me!

    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VMTs are, may
    be, tolerable (I wonder what is current policy of Linux and BSD
    kernels), but hardly desirable.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Thu Jan 16 14:35:32 2025
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 07:09:38 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's
    a ton of C code out there), but trying to retrofit a safe
    memory model onto C seems a bit awkward - it might have been
    better to target a language which has arrays in the first
    place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    C language always had multi-dimensional arrays, with limitation
    that dimensions have to be known in compile time.
    C99 lifted that limitation, making C support for
    multi-dimensional arrays comparable to that in old Fortran.
    C11 said that lifting is optional.
    Now C23 makes part of the lifting (variably-modified types) again
    mandatory.

    I'd missed that one.

    Relatively to F90, support for multi-dimensional arrays in C23 is
    primitive.

    From what you describe, support for multi-dimensional arrays
    in C23 now reached the level of Fortran II, released in
    1958. Only a bit more than six decades, can't complain
    about that.

    Well, apart from playing with what is mandatory and what is not,
    arrays stuff in C had not changed (AFAIK) since C99.

    It's not mandatory, so compilers are free to ignore it (and a
    major compiler, from a certain company in Redmond, does
    not support it). That's as good as sayhing that it does not
    exist in the language.

    Not really, no. The world of C programmers can be divided into those
    that work with C compilers and can freely use pretty much anything in
    C, and those that have to content with limited, non-standard or
    otherwise problematic compilers and write code that works for them.
    Such compilers include embedded toolchains for very small
    microcontrollers or DSPs, and MS's C compiler.

    Some C code needs to be written in a way that works on MS's C
    compiler as well as other tools, but most is free from such
    limitations. Even for those targeting Windows, it's common to use
    clang or gcc for serious C coding.

    MS used to have a long-term policy of specifically not supporting C
    well because that might make it easier for people to write
    cross-platform C code for Windows and Linux. Instead, they preferred
    to push developers towards C# and Windows-only programming - or if
    that failed, C++ which was not as commonly used on *nix. Now, I
    think, they just don't care much about C - they don't see many people
    using their tools for C and haven't bothered supporting any feature
    that needs much effort. They know that they can't catch up with
    other C compilers, so have made it easier to integrate clang with
    their IDE's and development tools.


    Microsoft does care about C, but only in one specific area - kernel programming. The only other language officially allowed for Windows
    kernel programming is C++, but coding kernel drivers in C++ is
    discouraged. I suppose that driver written in C++ would have major
    difficulties passing Windows HLK tests and getting WHQL signing.

    As you can guess, in kernel drivers VLA are unwelcome. VMTs are, may
    be, tolerable (I wonder what is current policy of Linux and BSD
    kernels), but hardly desirable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to David Brown on Thu Jan 16 16:46:17 2025
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array. Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.

    I do not know about Windows, but IIUC in some period Linux limit
    for kernel stack was something like 2 kB (single page shared
    with some other per-process data structures). I think it
    was increased later, but even moderate size arrays are
    unwelcame on kernel stack due to size limits.

    VMTs are, may
    be, tolerable (I wonder what is current policy of Linux and BSD
    kernels), but hardly desirable.

    IMO VMT-s are vastly superior to raw pointers, but to fully
    get their advantages one would need better tools. Also,
    kernel needs to deal with variable size arrays embedded in
    various data structures. This is possible using pointers,
    but current VMT-s are too weak for many such uses.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jan 16 17:24:58 2025
    On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header
    which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication
    to get the actual position:

      array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    The compiler does; but with a dope-vector in view, the compiler
    inserts additional checks on the arithmetic and addressing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Thu Jan 16 09:15:55 2025
    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication
    to get the actual position:

      array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to All on Thu Jan 16 09:55:50 2025
    On 1/16/2025 9:24 AM, MitchAlsup1 wrote:
    On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header >>> which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:

       array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    The compiler does; but with a dope-vector in view, the compiler
    inserts additional checks on the arithmetic and addressing.

    OK, so Terje's observation of it being faster doing the calculation
    himself is due to him not doing these additional checks?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Waldek Hebisch on Thu Jan 16 18:12:46 2025
    Waldek Hebisch <antispam@fricas.org> schrieb:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack.

    You can pass them as VLAs (which Fortran has had since 1958)
    or you can declare them. It is the latter which would need
    to allocate on the stack.

    But allocating them on the stack is an implementation detail.
    Since Fortran 90, you can also do

    subroutine foo(n,m)
    integer, intent(in) :: n,m
    real, dimension(n,m) :: a

    which will delcare the array a with the bounds of n and m.
    (Fortran can also do dynamic memory allocation, so

    subroutine foo(n,m)
    integer, intent(in) :: n,m
    real, dimension(:,:), allocatable :: c
    allocate (c(n,m))

    would also work, and also automatically release the memory).

    Because Fortran users are used to large arrays, any good Fortran
    compiler will also allocate a on the heap if it is too large.


    Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.

    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Basically, to use VLA one needs rather small bound on maximal
    size of array. Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    Allocating data on the stack promotes cache locality, which can
    increase performance by quite a lot.

    If you have a memory allocation pattern like

    p1 = malloc(chunk_1); /* Fill it */
    p2 = malloc(chunk_2);
    /* Use it */
    free (p2);
    p3 = malloc(chunk_3);
    /* Use it */
    free (p3)
    /* Use p1 */

    There is a chance that p2 still pollutes the cache and parts of
    p1 may have been removed unnecessarily. This would not have been
    the case p2 and p3 had been allocated on the stack.

    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.

    Stack limits are artificial, but


    I do not know about Windows, but IIUC in some period Linux limit
    for kernel stack was something like 2 kB (single page shared
    with some other per-process data structures). I think it
    was increased later, but even moderate size arrays are
    unwelcame on kernel stack due to size limits.

    ... for kernels maybe less so.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jan 16 18:23:28 2025
    On Thu, 16 Jan 2025 17:55:50 +0000, Stephen Fuld wrote:

    On 1/16/2025 9:24 AM, MitchAlsup1 wrote:
    On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a header >>>> which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication >>>> to get the actual position:

       array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    The compiler does; but with a dope-vector in view, the compiler
    inserts additional checks on the arithmetic and addressing.

    OK, so Terje's observation of it being faster doing the calculation
    himself is due to him not doing these additional checks?

    Most likely.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Jan 16 18:30:35 2025
    On Thu, 16 Jan 2025 18:12:46 +0000, Thomas Koenig wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    David Brown <david.brown@hesbynett.no> wrote:


    Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.

    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    Algol 60 only had stack allocation for dynamically sized arrays,
    so stacks had to be as big as the data are.

    Basically, to use VLA one needs rather small bound on maximal
    size of array. Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    Allocating data on the stack promotes cache locality, which can
    increase performance by quite a lot.

    If/when HW can track deallocating stack storage; the core
    does not have to write back modified lines to memory at cache
    line replacement. {{Look, once the SP moves out of that part of
    the stack, nobody is allowed to dereference it anymore, so,
    nobody cares about the value in those containers.}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu Jan 16 20:46:04 2025
    On Thu, 16 Jan 2025 18:12:46 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why.
    I've never understood why people think there is something
    "dangerous" about VLAs, or why they think using heap allocations
    is somehow "safer".

    VLA normally allocate on the stack.

    You can pass them as VLAs (which Fortran has had since 1958)
    or you can declare them. It is the latter which would need
    to allocate on the stack.


    The part about passing, including dynamic allocation, is what in C
    called VM types.

    But allocating them on the stack is an implementation detail.
    Since Fortran 90, you can also do

    subroutine foo(n,m)
    integer, intent(in) :: n,m
    real, dimension(n,m) :: a

    which will delcare the array a with the bounds of n and m.
    (Fortran can also do dynamic memory allocation, so

    subroutine foo(n,m)
    integer, intent(in) :: n,m
    real, dimension(:,:), allocatable :: c
    allocate (c(n,m))

    would also work, and also automatically release the memory).

    Because Fortran users are used to large arrays, any good Fortran
    compiler will also allocate a on the heap if it is too large.


    Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.

    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.

    In user space it is just unfortunate tradition. Not in all languages,
    BTW. In Go, for example, default stack is 1 GB, which is still small,
    but not ridiculously small as 1 to 8 MB that are typical in C, C++,
    Rust and I suppose Fortran.
    However original point of discussion was kernel programming. In kernel
    there are pretty good reasons in place why default stack is very small.
    8-32 KB, I think. May be on Apple few times bigger, I didn't check.
    The reason is that in many kernel contexts page fault not allowed, so
    you have to allocate physical memory rather than just reserve address
    space.

    "To avoid infinite recursion" is not a valid reason, IMHO.

    Basically, to use VLA one needs rather small bound on maximal
    size of array. Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    Allocating data on the stack promotes cache locality, which can
    increase performance by quite a lot.

    If you have a memory allocation pattern like

    p1 = malloc(chunk_1); /* Fill it */
    p2 = malloc(chunk_2);
    /* Use it */
    free (p2);
    p3 = malloc(chunk_3);
    /* Use it */
    free (p3)
    /* Use p1 */

    There is a chance that p2 still pollutes the cache and parts of
    p1 may have been removed unnecessarily. This would not have been
    the case p2 and p3 had been allocated on the stack.

    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.

    Stack limits are artificial, but


    I do not know about Windows, but IIUC in some period Linux limit
    for kernel stack was something like 2 kB (single page shared
    with some other per-process data structures). I think it
    was increased later, but even moderate size arrays are
    unwelcame on kernel stack due to size limits.

    ... for kernels maybe less so.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Thu Jan 16 20:12:03 2025
    Stephen Fuld wrote:
    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a
    header which contains the starting point and current length, along
    with allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit
    multiplication to get the actual position:

     Â  array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?


    Because Rust really doesn't have multi-dim vectors, instead using
    vectors of pointers to vectors?

    OTOH, it is perfectly OK to create your own multi-dim data structure,
    and using macros you can probably get the compiler to generate
    near-optimal code as well, but afaik, nothing like that is part of the
    core language.

    I do know that several people have created fast string libraries, where
    any string that is short enough ends up entirely inside the dope vector,
    so no heap allocation.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jan 16 19:14:08 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

    On 1/16/2025 1:11 AM, Terje Mathisen wrote:

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit multiplication >>> to get the actual position:

      array[y][x] -> array[y*width + x]


    That is what any Fortran compiler does under the hood, of course.

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    It should.


    The compiler does; but with a dope-vector in view, the compiler
    inserts additional checks on the arithmetic and addressing.

    Depends on the relevant flag for bounds checking (at least for Fortran).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Thu Jan 16 20:22:35 2025
    Stephen Fuld wrote:
    On 1/16/2025 9:24 AM, MitchAlsup1 wrote:
    On Thu, 16 Jan 2025 17:15:55 +0000, Stephen Fuld wrote:

    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a
    header
    which contains the starting point and current length, along with
    allocated size. For multidimendional work, the natural mapping is
    Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit
    multiplication
    to get the actual position:

       array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler generate
    code for that itself?

    The compiler does; but with a dope-vector in view, the compiler
    inserts additional checks on the arithmetic and addressing.

    OK, so Terje's observation of it being faster doing the calculation
    himself is due to him not doing these additional checks?

    No, more like one of the Advent of Code problems that naively looked
    like a nice little hash table problem, with strings as the keys:

    "-4,0,3,6"

    I.e. 4 integers, all in the -9 to 9 range, used to verify that this was
    the first time this particular combination was seen.

    The first speedup (compared to my original Perl code) was from
    converting this to 4 signed byte values all packed into a u32 variable,
    then on each iteration I would shift the key up by 8 (getting rid of the
    oldest delta) and add in the new delta as the new bottom byte, then use
    that u32 as the hash table key.

    My code became an order of magnitude faster when I instead allocated a
    single vector with 19*19*19*19 elements, then biased each of those four
    delta values by +9 so that they would all be in the [0..18] range
    instead of [-9..9], and do the addressing as ((d3*19+d2)*19+d1)*19+d0.

    Rust would still verify that the final value was in range, but this
    becomes a single (never taken) CMP/JA combination.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Waldek Hebisch on Thu Jan 16 21:02:28 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    IIUC popular current processors are still quite far from having
    64-bit virtual address space, so there is still reason to limit
    stack size, simply limit can be much bigger than on 32-bit
    systems.

    The ARMv8/ARMv9 architecture supports up to 52 bits of
    VA space (and up to 52-bits of PA space). Most implementations
    typically provide 48/48; I know of one that does 52/52
    and another that supports 48/52.

    Going larger would require more levels of translation table.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Thomas Koenig on Thu Jan 16 20:34:51 2025
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack.

    You can pass them as VLAs (which Fortran has had since 1958)
    or you can declare them.

    As explained in other post in C VLA means allocation, passing
    is VMT.

    Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.

    Stacks are small because OS people make them small, not because of
    a valid technical reason that has ever been explained to me.
    "To avoid infinite recursion" is not a valid reason, IMHO.

    On multiuser machine there is some point in it: you do not
    want buggy student program to cause thrashing. In other
    words you need stack limit that is some smallish fraction
    of real memory. With virtual memory heap allocations bigger
    than RAM work fine.

    There is good reason for small kernel stacks: it is used to
    handle interupts, including page faults, so must be real
    memory. Since each thread needs its own kernel stack, bigger
    stack would mean quite a lot of memory use.

    In 32-bit era there was also valid reason for small user stacks.
    Namely, one needs to pre-allocate address space for stack(s) and
    with lots of threads there is not enough address space to give
    sizeable stack to each thread.

    IIUC popular current processors are still quite far from having
    64-bit virtual address space, so there is still reason to limit
    stack size, simply limit can be much bigger than on 32-bit
    systems.

    There is also another issue: stack allocations become invalid
    when routine doing allocation returns. Which depending on
    application may be unacceptable. So, reuse of code doing
    stack allocation is tricky, while for heap allocation simple
    reference count may be ehough to ensure that allocation is
    freed when nobody uses given area. Consequently, there is
    tendency to use heap allocation to allow more flexible use
    patterns. With more use of heap allocation there is less
    use of stack allocation and big stacks are considered
    unnecessary.

    Basically, to use VLA one needs rather small bound on maximal
    size of array. Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    Allocating data on the stack promotes cache locality, which can
    increase performance by quite a lot.

    Sure.

    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.

    Stack limits are artificial, but


    I do not know about Windows, but IIUC in some period Linux limit
    for kernel stack was something like 2 kB (single page shared
    with some other per-process data structures). I think it
    was increased later, but even moderate size arrays are
    unwelcame on kernel stack due to size limits.

    ... for kernels maybe less so.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Thu Jan 16 22:23:38 2025
    On 16/01/2025 22:10, Keith Thompson wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 16/01/2025 10:11, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:
    It has Vec<> which is always implemented as a dope vector, i.e. a
    header which contains the starting point and current length, along
    with allocated size. For multidimendional work, the natural mapping
    is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.
    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit
    multiplication to get the actual position:
      array[y][x] -> array[y*width + x]

    Note that this will inhibit bounds checking on the inner dimension.
    That might be part of the reason for the improvement in speed.

    For example, given int array[10][10], array[0][11] is out of bounds,
    even if it logically refers to the same location as array[1][0]. This results in undefined behavior in C, and perhaps some kind of exception
    in a language that requires bounds checking. If you do this manually by defining a 1d array, any checking applies only to the entire array.

    That does not surprise me. Vec<> in Rust is very similar to
    std::vector<> in C++, as far as I know (correct me if that's wrong).
    So a vector of vectors of int is not contiguous or consistent - each
    subvector can have a different current size and capacity. Doing a
    bounds check for accessing xs[i][j] (or in C++ syntax, xs.at(i).at(j)
    when you want bounds checking) means first reading the current size
    member of the outer vector, and checking "i" against that. Then xs[i]
    is found (by adding "i * sizeof(vector)" to the data pointer stored in
    the outer vector). That is looked up to find the current size of this
    inner vector for bounds checking, then the actual data can be found.

    I'm not familiar with Rust's Vec<>, but C++'s std::vector<> guarantees
    that the elements are stored contiguously. But the std::vector<> object itself doesn't contain those elements; it's a fixed-size chunk of data (basically a struct in C terms) whose size doesn't change regardless of
    the number of elements (and typically regardless of the element type).
    So a std::vector<std::vector<int>> will result in the data for each row
    being stored contiguously, but the rows themselves will be allocated dynamically.


    Yes, exactly.

    Of course you could do as Terje did in Rust - make a std::vector<> of
    size N x M and do the "i * N + j" calculation manually. Now that C++23
    has a multi-parameter subscript operator, you can do that quite neatly
    in a little wrapper class around a std::vector<> with a nice access
    operator. But it's still more efficient to use a std::array<> if you
    know the sizes at compile time.

    This is /completely/ different from classic C multi-dimensional
    arrays. It is more akin to a one-dimensional C array of pointers to
    individually allocated one-dimensional C arrays - but even less
    efficient due to an extra layer of indirection.

    If you know the size of the data at compile time, then in C++ you have
    std::array<> where the information about size is carried in the type,
    with no run-time cost. A nested std::array<> is a perfectly good and
    efficient multi-dimensional array with runtime bounds checking if you
    want to use it, as well as having value semantics (no decay to pointer
    types in expressions). I would guess there is something equivalent in
    Rust ?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Waldek Hebisch on Thu Jan 16 22:16:43 2025
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack. You don't allocate
    anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    The stack on Linux is 10 MB by default, and 1 MB by default on Windows.
    That's a /lot/ if you are working with fairly small but non-constant
    sizes. So if you are working with a selection of short-lived
    medium-sized bits of data - say, parts of strings for some formatting
    work - putting them on the stack is safe and can be significantly faster
    than using the heap.

    Using VLAs (or the older but related technique, alloca) means you don't
    waste space. Maybe you are working with file paths, and want to support
    up to 4096 characters per path - but in reality most paths are less than
    100 characters. With fixed size arrays, allocating 16 of these and initialising them will use up your entire level 1 cache - with VLAs, it
    will use only a tiny fraction. These things can make a big difference
    to code that aims to be fast.

    Fixed size arrays are certainly easier to analyse and are often a good
    choice, but VLA's definitely have their advantages in some situations,
    and they are perfectly safe and reliable if you use them appropriately
    and correctly.


    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.


    Other people might have bad uses of VLAs - it doesn't mean /you/ have to
    use them badly too!

    I do not know about Windows, but IIUC in some period Linux limit
    for kernel stack was something like 2 kB (single page shared
    with some other per-process data structures). I think it
    was increased later, but even moderate size arrays are
    unwelcame on kernel stack due to size limits.

    If a kernel stack is that small (or you are working on an embedded
    system with very small stacks), then clearly you have to take that into account. I've used them a couple of times in embedded systems with
    small stacks - obviously the size of the VLA was also small. (On such
    systems, heap allocations are very much unwelcome - though not quite as unwelcome as overflowing the stack :-) )


    Far and away my most common use of VLAs is, however, not variable length
    at all. It's more like :

    const int no_of_whatsits = 20;
    const int size_of_whatsit = 4;

    uint8_t whatsits_data[no_of_whatsits * size_of_whatsit];

    Technically in C, that is a VLA because the size expression is not a
    constant expression according to the rules of the language. But of
    course it is a size that is known at compile-time, and the compiler
    generates exactly the same code as if it was a constant expression. It
    is equally amenable to analysis and testing. (In C++, it is considered
    a normal array - C++ does not support VLAs, but is happy with code like
    that.) With C23, these const variables can now be constexpr, and the
    array will then be a normal array and not a VLA - without that making
    the slightest difference to the actual generated code.



    VMTs are, may
    be, tolerable (I wonder what is current policy of Linux and BSD
    kernels), but hardly desirable.

    IMO VMT-s are vastly superior to raw pointers, but to fully
    get their advantages one would need better tools. Also,
    kernel needs to deal with variable size arrays embedded in
    various data structures. This is possible using pointers,
    but current VMT-s are too weak for many such uses.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Thu Jan 16 21:40:39 2025
    David Brown <david.brown@hesbynett.no> writes:
    On 16/01/2025 17:46, Waldek Hebisch wrote:

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack. You don't allocate >anything on the heap without knowing the bounds and being sure it is >appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    The stack on Linux is 10 MB by default, and 1 MB by default on Windows.

    On all the linux systems I use, the stack limit defaults to 8192KB.

    That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

    Now, that's for the primary thread stack, for which the OS
    manages the growth. For other threads in the process,
    the size varies based on the threads library in use
    and whether the application is compiled for 32-bit or
    64-bit systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Keith Thompson on Thu Jan 16 23:39:13 2025
    On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [...]
    I do know that several people have created fast string libraries,
    where any string that is short enough ends up entirely inside the dope
    vector, so no heap allocation.

    Some implementations of C++ std::string do this. For example, the GNU implementation appears to store up to 16 characters (including the
    trailing null character) in the std::string object.

    Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to David Brown on Fri Jan 17 02:22:54 2025
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack. You don't allocate anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    Well, AFAICS VLA-s may get allocated on function entry. In such
    case caller have to check for allocation size, which spreads
    allocation related code between caller and called function.
    In case of 'malloc' one can simply check return value. In fact,
    in many programs simple wrapper that exits in case of allocation
    failure is enough (if application can not do its work without
    memory and there is no memory, then there is no point in continuing
    execution).

    The stack on Linux is 10 MB by default, and 1 MB by default on Windows. That's a /lot/ if you are working with fairly small but non-constant
    sizes. So if you are working with a selection of short-lived
    medium-sized bits of data - say, parts of strings for some formatting
    work - putting them on the stack is safe and can be significantly faster
    than using the heap.

    IME this is relatively rare case. For formatting frequently a single
    result buffer (possibly expanded when needed) with other pieces of
    data added there gave me good performance. Intermediate strings
    appeared as return values of called functions. Without reoganizing
    code this does not respect stack discipline. Once reorganized
    I get best results without materializing intermediate strings.

    Using VLAs (or the older but related technique, alloca) means you don't
    waste space. Maybe you are working with file paths, and want to support
    up to 4096 characters per path - but in reality most paths are less than
    100 characters. With fixed size arrays, allocating 16 of these and initialising them will use up your entire level 1 cache - with VLAs, it
    will use only a tiny fraction.

    One case initialize only used part. Or simply used uninitialized
    arrays (that is what I do normally). It rather hard to give
    meaningful initialization in case where size of payload varies.

    These things can make a big difference
    to code that aims to be fast.

    Fixed size arrays are certainly easier to analyse and are often a good choice, but VLA's definitely have their advantages in some situations,
    and they are perfectly safe and reliable if you use them appropriately
    and correctly.


    In the past I was a fan of VLA and stack allocation in general.
    But I saw enough bug reports due to programs exceeding their
    stack limits that I changed my view.


    Other people might have bad uses of VLAs - it doesn't mean /you/ have to
    use them badly too!

    Well, for me typical cases is for work arrays where needed size
    may vary widely. Using 'malloc' is simpler is such use given
    small stack limit. With small stack limit VLA would be a
    micro-optimization, not worth extra effort.

    <snip>

    Far and away my most common use of VLAs is, however, not variable length
    at all. It's more like :

    const int no_of_whatsits = 20;
    const int size_of_whatsit = 4;

    uint8_t whatsits_data[no_of_whatsits * size_of_whatsit];

    Technically in C, that is a VLA because the size expression is not a
    constant expression according to the rules of the language. But of
    course it is a size that is known at compile-time, and the compiler
    generates exactly the same code as if it was a constant expression.

    OK, that is useful case (but in spirt this is not VLA).

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Keith Thompson on Fri Jan 17 02:10:52 2025
    On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [...]
    I do know that several people have created fast string libraries,
    where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.

    Some implementations of C++ std::string do this. For example, the GNU
    implementation appears to store up to 16 characters (including the
    trailing null character) in the std::string object.

    Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

    I don't understand. What pointer are you referring to?

    The pointer which would have had to point elsewhere had the string
    not been contained within.

    In the implementation I'm referring to, std::string happens to be 32
    bytes in size. If the string has a length of 15 or less, the string
    data is stored directly in the std::string object (in the last 16 bytes
    as it happens). If the string is longer than that it's stored
    elsewhere, and that 16 bytes is presumably use to manage the
    heap-allocated data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Fri Jan 17 10:20:43 2025
    On 16/01/2025 22:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 16/01/2025 17:46, Waldek Hebisch wrote:

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack. You don't allocate
    anything on the heap without knowing the bounds and being sure it is
    appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    The stack on Linux is 10 MB by default, and 1 MB by default on Windows.

    On all the linux systems I use, the stack limit defaults to 8192KB.

    OK. The details don't matter much here. (Of course, if you are
    intending to put large objects on the stack, then the details /do/
    matter, and you probably want to specify a minimum stack size explicitly.)


    That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

    Now, that's for the primary thread stack, for which the OS
    manages the growth. For other threads in the process,
    the size varies based on the threads library in use
    and whether the application is compiled for 32-bit or
    64-bit systems.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Keith Thompson on Fri Jan 17 15:52:48 2025
    On 17/01/2025 04:52, Keith Thompson wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    [...]
    Well, AFAICS VLA-s may get allocated on function entry.
    [...]

    That would rarely be possible for objects with automatic storage
    duration (local variables). For example:

    void func(void) {
    do_this();
    do_that();
    int vla[rand() % 10 + 1];
    }

    Memory for `vla` can't be allocated until its size is known,
    and it can't be known until the definition is reached. For most automatically allocated objects, the lifetime begins when execution
    reaches the `{` of the enclosing block; the lifetime of `vla`
    begins at its definition.

    Or did you have something else in mind?

    I'm guessing he was thinking of something like :

    void func(int n) {
    if (n < 1000) {
    int vla[n];
    do_stuff(vla);
    } else {
    int * p = malloc(n * sizeof(int));
    do_stuff(p);
    free(p);
    }
    }

    Although the lifetime of vla[n] is limited to the block that is in that
    one branch, the compiler could certainly handle the allocation with a
    single stack-pointer change at the entry to the function. It is common
    for optimised code to try to have just one stack frame allocation at
    code entry, and a deallocation at exit, rather than re-arranging the
    stack within blocks of code. But it is not common to do so when the
    sizes are not known at compile time and the VLA (or alloca) is not on
    all paths - precisely because the programmer might be doing such checks.


    (Should this part of the discussion migrate to comp.lang.c, or is it
    still sufficiently relevant to computer architecture?)


    Some of the "arch" folks here have compared to other languages, which is
    nice. But if regulars here think the thread branch has become too
    bogged down in details of C, we can stop.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Waldek Hebisch on Fri Jan 17 15:30:24 2025
    On 17/01/2025 03:22, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack. You don't allocate
    anything on the heap without knowing the bounds and being sure it is
    appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    Well, AFAICS VLA-s may get allocated on function entry.

    It could be allocated as soon as the size is known, yes.

    In such
    case caller have to check for allocation size, which spreads
    allocation related code between caller and called function.

    It is not about allocation sizes - it's about knowing the data you are
    dealing with, and sanitising unknown data.

    In the very rough example I gave of string formatting or manipulation,
    you might be getting the strings in from outside - command line
    parameters, database entries, wildcard directory searches, etc. You
    sanity check the data when it comes in - regardless of whether or not
    you plan to allocate memory (stack or heap) for copying them. Now you
    know that the sizes are reasonable, you can allocate VLAs (or use
    alloca, or use malloc) without extra worries.

    I am not suggesting you should have some kind of rule to check sizes
    just before every VLA declaration - I am suggesting that when you know
    the size is reasonable and safe, then using a VLA is reasonable and safe.


    In case of 'malloc' one can simply check return value.

    Drivel.

    That's a myth that originated in the days of K&R C.

    It is certainly true that if malloc returns 0, your allocation has
    failed. There are a few - but only a very few - circumstances where
    that is something that can realistically happen in code that is doing
    its job properly. Typically that would be in resource-constrainted
    systems where you might have some unusual circumstances causing overload.

    But generally (and this means there will be exceptions), checking for
    null returns from malloc is :

    a) Never properly tested, and often results in leaked resources or other problems;

    b) Totally unrealistic in any real-world use of the code;

    c) Treated as though it is a divine duty that must always be done
    ritually and religiously;

    d) Treated as though it magically makes the code safe, correct and reliable.

    Hopefully you can see that these points are self-contradictory.


    If you try to call malloc with a size that is unreasonable for the circumstances, all kinds of bad things can happen /despite/ a non-null
    return value. What goes wrong can depend on many factors, including the
    OS, the malloc library, the size, the system setup, and what you do with
    the returned pointer. Simply /trying/ to run malloc with a bad size
    may, on some systems, lead to the OS trying to free up as much memory as
    it can in order to accommodate your request - whether malloc ends up
    returning null or not. Or maybe the request is done in with lazy
    allocations - you asked for 100 TB of memory and you got a pointer back,
    and things will only go wrong when you start using the virtual space.

    Remember, from the point of view of people using the computer, having
    the OS push lots of stuff out of memory is tantamount to a broken
    system. A program that has runaway memory usage causes great
    frustration, and often leads to users doing a hard reset. And all the
    time, the malloc() calls have returned a non-null value.


    So what does all that mean? It means you do /not/ blindly call
    malloc(), check for a null result, and think that's all good. It means
    you be sure you know what sizes you are asking for /before/ you call
    malloc - probably long before you get to the bit of code that actually
    calls malloc(). It means you look /before/ you leap - you don't "just
    go for it" and hope that you can figure out what went wrong from the
    debris left at the crash site.

    And if you are in doubt - maybe you are pushing the target system to the limits, or have a program that demands more memory than many systems
    might have - you check in advance to see if the memory will be easily available. Such checks will be OS specific, of course.

    (I'm sure some people will now be thinking "you should have used
    ulimit", or "don't enable swap", or "that's the fault of over-commit".
    That would all be missing the point. You can of course use such tools
    as a way of making sure your sizes are reasonable - it's up to the
    developer to decide how to handle such checks and controls. But
    checking the return of malloc is so far from being sufficient that it is basically useless in most circumstances.)


    It is /exactly/ the same for VLAs (or alloca).


    The limits for what sizes are "reasonable" will, of course, be smaller
    for stack allocations than for heap allocations. But that's all target dependent anyway - for the systems I typically work with, the limit for "reasonable" heap allocations is orders of magnitude smaller than
    "reasonable" stack allocations on desktops.

    In fact,
    in many programs simple wrapper that exits in case of allocation
    failure is enough (if application can not do its work without
    memory and there is no memory, then there is no point in continuing execution).

    Have you ever seen that happening in real life? Have you ever even
    known such code to be properly tested?

    Don't get me wrong - a wrapper like this can be a good idea. But it's
    like an electrical fuse - it's a last resort, and only triggers if
    something has gone badly wrong. When you see a great music system with
    a 10 kW amplifier, you check if your house electrical system can handle
    that /before/ you buy it. You don't buy it, plug it in and rely on the
    fusebox to keep your house from burning down - even though you want the
    fuse there as a failsafe. For the most part, if malloc ever returns 0,
    the problem lies before malloc is called.


    (Sorry for the rant - "my code is safe because I check the result of
    malloc" is one of these misconceptions that really annoy me.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Brian G. Lucas on Fri Jan 17 15:17:29 2025
    "Brian G. Lucas" <bagel99@gmail.com> writes:
    On 1/17/25 4:20 AM, David Brown wrote:
    On 16/01/2025 22:40, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 16/01/2025 17:46, Waldek Hebisch wrote:

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.  You don't allocate >>>> anything on the heap without knowing the bounds and being sure it is
    appropriate.  There's no fundamental difference - it's just the cut-off >>>> point that is different.

    The stack on Linux is 10 MB by default, and 1 MB by default on Windows. >>>
    On all the linux systems I use, the stack limit defaults to 8192KB.

    On linux, one can call the routine setrlimit(RLIMIT_STACK, ...) to change
    the stack size.

    Yes, as a unix/linux kernel engineer, I've implemented that system call
    and the supporting kernel infrastructure in a version of unix a few
    decades ago.

    I'll point out that the implementation provides both HARD and SOFT
    limits for the stack (and all other resources), and the user can
    only affect the SOFT limit, and the user may not raise the SOFT
    limit above the HARD limit, unless running with the appropriate
    capabilities (e.g. UID == 0).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Fri Jan 17 16:15:36 2025
    On 17/01/2025 03:10, MitchAlsup1 wrote:
    On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [...]
    I do know that several people have created fast string libraries,
    where any string that is short enough ends up entirely inside the dope >>>>> vector, so no heap allocation.

    Some implementations of C++ std::string do this.  For example, the GNU >>>> implementation appears to store up to 16 characters (including the
    trailing null character) in the std::string object.

    Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

    I don't understand.  What pointer are you referring to?

    The pointer which would have had to point elsewhere had the string
    not been contained within.


    There are a couple of ways you can do "small string optimisation". One
    would be to have a structure something like this :

    struct String1 {
    size_t capacity;
    char * data;
    char small_string[16];
    }

    Then "data" would point to "small_string" for a capacity of 16, and if
    that's not enough, use malloc to allocate more space.


    An alternative would be to have something like this (I'm being /really/
    sloppy with alignments, rules for unions, and so on - this is
    illustrative only, not real code!) :

    struct String2 {
    bool is_small;
    union {
    char small_string[31];
    struct {
    size_t capacity;
    char * data;
    }
    }
    }

    This second version lets you put more characters in the local
    small_string area, reusing space that would otherwise be used for the
    pointer and capacity. But it has more runtime overhead when using the
    string :

    void print1(String1 s) {
    printf(s.data);
    }

    void print2(String2 s) {
    if (s.is_small) {
    printf(s.small_string);
    } else {
    printf(s.data);
    }
    }

    There are, of course, many other ways to make string types (such as
    supporting copy-on-write), but I suspect that Mitch is thinking of style String2 while Keith is thinking of style String1.



    In the implementation I'm referring to, std::string happens to be 32
    bytes in size.  If the string has a length of 15 or less, the string
    data is stored directly in the std::string object (in the last 16 bytes
    as it happens).  If the string is longer than that it's stored
    elsewhere, and that 16 bytes is presumably use to manage the
    heap-allocated data.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Fri Jan 17 16:42:17 2025
    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've
    never understood why people think there is something "dangerous" about
    VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.

    These may be points that you are looking at for your embedded work,
    but the average programmer does not.

    An example, Fortran-specific: Fortran 2018 made all procedures
    recursive by default. This means that some Fortran codes will start
    crashing because of stack overruns when this is implemented :-(

    You don't allocate
    anything on the heap without knowing the bounds and being sure it is appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    What would you recommend as a limt? (See fmax-stack-var-size=N
    in gfortran).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to David Brown on Fri Jan 17 18:02:28 2025
    David Brown wrote:
    On 17/01/2025 03:10, MitchAlsup1 wrote:
    On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [...]
    I do know that several people have created fast string libraries,
    where any string that is short enough ends up entirely inside the
    dope
    vector, so no heap allocation.

    Some implementations of C++ std::string do this.  For example, the >>>>> GNU
    implementation appears to store up to 16 characters (including the
    trailing null character) in the std::string object.

    Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !! >>>
    I don't understand.  What pointer are you referring to?

    The pointer which would have had to point elsewhere had the string
    not been contained within.


    There are a couple of ways you can do "small string optimisation".  One would be to have a structure something like this :

    struct String1 {
        size_t capacity;
        char * data;
        char small_string[16];
    }

    Then "data" would point to "small_string" for a capacity of 16, and if
    that's not enough, use malloc to allocate more space.


    An alternative would be to have something like this (I'm being /really/ sloppy with alignments, rules for unions, and so on - this is
    illustrative only, not real code!) :

    struct String2 {
        bool is_small;
        union {
            char small_string[31];
            struct {
                size_t capacity;
                char * data;
            }
        }
    }

    This second version lets you put more characters in the local
    small_string area, reusing space that would otherwise be used for the pointer and capacity.  But it has more runtime overhead when using the string :

        void print1(String1 s) {
            printf(s.data);
        }

        void print2(String2 s) {
            if (s.is_small) {
                printf(s.small_string);
            } else {
                printf(s.data);
            }
        }

    There are, of course, many other ways to make string types (such as supporting copy-on-write), but I suspect that Mitch is thinking of style String2 while Keith is thinking of style String1.

    All Vec<> types have a 3-word descriptor, with the first and second word
    being a pointer to the data and the current length, while the allocated
    size is stored in the third word.

    This is a total of 24 bytes, so quite a bit of overhead if you just need
    a few bytes.

    In the Rust Fast/Small vector type, they could use the top bit of the
    size field (no Vec<> object can be larger or equal to 2^63 bytes), then
    they need 4 or 5 bits for the actual length (but using 7 is easier),
    leaving 23 bytes for the embedded data. With little-endian storage this corresponds to the last byte of the 24-byte block.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Fri Jan 17 18:21:26 2025
    On 17/01/2025 17:42, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.


    That would be another way of saying you have no idea when your program
    is going to blow up from lack of stack space. You don't need VLAs to
    cause such problems.

    In reality, you /do/ know a fair amount. Often your knowledge is
    approximate - you know you are not going to need anything like as much
    stack as the system provides, and you don't worry about it. In other situations (such as in small embedded systems), you think about it all
    the time - again, regardless of any VLAs.

    If you are in a position where you suspect you might be pushing close to
    the limits of your stack, "standard" doesn't come into it - you are
    dealing with a particular target, and you can use whatever functions or
    support that target provides.

    These may be points that you are looking at for your embedded work,
    but the average programmer does not.


    The average programmer can think "I've got megabytes of stack. There's
    no problem with VLAs of several KB." That's often fine - all you need
    to do is be sure that your VLA's are a no more than a few KB in size.
    Your code is as safe (in this aspect) as pretty much any other piece of
    code on the platform.

    An example, Fortran-specific: Fortran 2018 made all procedures
    recursive by default. This means that some Fortran codes will start
    crashing because of stack overruns when this is implemented :-(

    You don't allocate
    anything on the heap without knowing the bounds and being sure it is
    appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    What would you recommend as a limt? (See fmax-stack-var-size=N
    in gfortran).

    Maybe 50 KB ? It's going to depend highly on the code. Obviously if
    you have a recursive function, you are not going to want a big stack
    frame. But for occasional one-off use, big stack frames are fine -
    VLA's, fixed arrays, or anything else. Once you are getting bigger than
    that, the overhead of malloc is probably negligible in measurable
    performance. (There are others here who could do a far better job than
    I at estimating that accurately - I am more concerned with what
    influences code reliability.)

    (gcc has stack frame limit flags for C and C++ too. And it can generate reports on function stack usage. In all cases, it can only know about
    limits if they are fixed, or at least limited, at compile time.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Fri Jan 17 20:08:30 2025
    David Brown <david.brown@hesbynett.no> schrieb:
    On 17/01/2025 17:42, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.


    That would be another way of saying you have no idea when your program
    is going to blow up from lack of stack space. You don't need VLAs to
    cause such problems.

    I may know a few things (or I can find out), but the general user,
    especially somebody who writes scientific software in Fortran,
    in general does not.

    In reality, you /do/ know a fair amount. Often your knowledge is
    approximate - you know you are not going to need anything like as much
    stack as the system provides, and you don't worry about it. In other situations (such as in small embedded systems), you think about it all
    the time - again, regardless of any VLAs.

    If you are in a position where you suspect you might be pushing close to
    the limits of your stack, "standard" doesn't come into it - you are
    dealing with a particular target, and you can use whatever functions or support that target provides.

    Again, try look at it from the viewpoint of somebody who writes
    scientific or technical code, and for whom such code should
    "just work". Also look at it from the viewpoint of somebody who
    co-maintains a compiler for such people.

    gfortran has the -fstack-arrays option, which can bring very
    large performance improvements - 50% in some real-world code.
    Do I know what code users are writing? Not in the least,
    unless they provide bug reports.

    And a stack overflow has the most unfriendly user interface of all.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Keith Thompson on Fri Jan 17 19:27:25 2025
    On Fri, 17 Jan 2025 1:04:03 +0000, Keith Thompson wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Thu, 16 Jan 2025 23:18:22 +0000, Keith Thompson wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    [...]
    I do know that several people have created fast string libraries,
    where any string that is short enough ends up entirely inside the dope >>>> vector, so no heap allocation.

    Some implementations of C++ std::string do this. For example, the GNU
    implementation appears to store up to 16 characters (including the
    trailing null character) in the std::string object.

    Why use an 8-byte pointer to store a string 16 or fewer bytes long ? !!

    I don't understand. What pointer are you referring to?

    In the implementation I'm referring to, std::string happens to be 32
    bytes in size. If the string has a length of 15 or less, the string
    data is stored directly in the std::string object (in the last 16 bytes
    as it happens). If the string is longer than that it's stored
    elsewhere, and that 16 bytes is presumably use to manage the
    heap-allocated data.

    So, when it is stored elsewhere, how do you get from the struct to the
    string ??
    You use a pointer !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sun Jan 19 18:49:00 2025
    In article <r1fiP.189541$FOb4.58758@fx15.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

    On all the linux systems I use, the stack limit defaults to 8192KB.

    That includes RHEL, Fedora, CentOS, Scientific Linux and Ubuntu.

    Now, that's for the primary thread stack, for which the OS
    manages the growth. For other threads in the process,
    the size varies based on the threads library in use
    and whether the application is compiled for 32-bit or
    64-bit systems.

    The library I work on documents the required stack sizes for threads that
    enter it, and for the threads it creates. Just another of the details one
    has to take care of. We didn't think of it when the project was started,
    but that was forty years ago this year.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Terje Mathisen on Mon Jan 20 12:29:13 2025
    On 1/16/2025 11:12 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a
    header which contains the starting point and current length, along
    with allocated size. For multidimendional work, the natural mapping
    is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit
    multiplication to get the actual position:

     Â  array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler
    generate code for that itself?


    Because Rust really doesn't have multi-dim vectors, instead using
    vectors of pointers to vectors?

    OTOH, it is perfectly OK to create your own multi-dim data structure,
    and using macros you can probably get the compiler to generate near-
    optimal code as well, but afaik, nothing like that is part of the core language.

    That surprised me. So I did a search for "Rust Multi dimensional
    arrays", and got several hits. It seems there are various ways to do
    this depending upon whether you want an array of arrays or a
    "traditional" multi-dimensional array. There is a crate for the latter.

    I don't know enough Rust to get all the details in the various search
    results, but it seems there are options.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tkoenig@netcologne.de on Tue Jan 21 20:30:47 2025
    On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>> never understood why people think there is something "dangerous" about >>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.

    Not standard compliant for sure, but you certainly can approximate
    stack use in C: just store (as byte*) the address of a local in your
    top level function, and check the (absolute value of) the difference
    to the address of a local in the current function.

    The bigger problem is knowing how much stack is available to use -
    there may be no way (or no easy way) to find the actual size ... or
    the limit if the stack expands ... and circumstances beyond the
    program may have limited it to be smaller than the program requested.


    These may be points that you are looking at for your embedded work,
    but the average programmer does not.

    An example, Fortran-specific: Fortran 2018 made all procedures
    recursive by default. This means that some Fortran codes will start
    crashing because of stack overruns when this is implemented :-(

    You don't allocate
    anything on the heap without knowing the bounds and being sure it is
    appropriate. There's no fundamental difference - it's just the cut-off
    point that is different.

    What would you recommend as a limt? (See fmax-stack-var-size=N
    in gfortran).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to George Neuner on Wed Jan 22 02:19:57 2025
    On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

    On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>>> never understood why people think there is something "dangerous" about >>>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.

    Not standard compliant for sure, but you certainly can approximate
    stack use in C: just store (as byte*) the address of a local in your
    top level function, and check the (absolute value of) the difference
    to the address of a local in the current function.

    On a Linux machine, you can find the last envp[*] entry and subtract
    SP from it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Wed Jan 22 14:15:56 2025
    Stephen Fuld wrote:
    On 1/16/2025 11:12 AM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 1/16/2025 1:11 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    CHERY targets C, which on the one hand, I understand (there's a
    ton of C code out there), but trying to retrofit a safe memory
    model onto C seems a bit awkward - it might have been better to
    target a language which has arrays in the first place, unlike C.
    [...]

    C does have arrays.

    Sort of - they decay into pointers at first sight.

    But what I should have written was "multi-dimensional arrays",
    with a reasonable way of handling them.

    Rust provides an interesting data point here:

    It has Vec<> which is always implemented as a dope vector, i.e. a
    header which contains the starting point and current length, along
    with allocated size. For multidimendional work, the natural mapping
    is Vec<Vec<>>, i.e. similar to classic C arrays of arrays, but with
    boundary checking.

    However, in my own testing I have found that it is often faster to
    flatten those multi-dim vectors, and instead use explicit
    multiplication to get the actual position:

     Â  array[y][x] -> array[y*width + x]

    Terje

    I am obviously missing something, but why doesn't the compiler
    generate code for that itself?


    Because Rust really doesn't have multi-dim vectors, instead using
    vectors of pointers to vectors?

    OTOH, it is perfectly OK to create your own multi-dim data structure,
    and using macros you can probably get the compiler to generate near-
    optimal code as well, but afaik, nothing like that is part of the core
    language.

    That surprised me.  So I did a search for "Rust Multi dimensional
    arrays", and got several hits.  It seems there are various ways to do
    this depending upon whether you want an array of arrays or a
    "traditional" multi-dimensional array. There is a crate for the latter.

    I don't know enough Rust to get all the details in the various search results, but it seems there are options.

    Notice what I wrote above, Rust allows for compile-time code generation
    in the form of macros which are in some ways even more powerful than C++ templates, so I'n not surprised to learn that there already exists
    public crate(s) to handle this. :-)

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 14:58:04 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

    On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>>>> never understood why people think there is something "dangerous" about >>>>>> VLAs, or why they think using heap allocations is somehow "safer".

    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.

    Not standard compliant for sure, but you certainly can approximate
    stack use in C: just store (as byte*) the address of a local in your
    top level function, and check the (absolute value of) the difference
    to the address of a local in the current function.

    On a Linux machine, you can find the last envp[*] entry and subtract
    SP from it.

    I would discourage programmers from relying on that for any reason
    whatsoever. The aux vectors are pushed before the envp entries.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 22 17:45:47 2025
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 1:30:47 +0000, George Neuner wrote:

    On Fri, 17 Jan 2025 16:42:17 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    David Brown <david.brown@hesbynett.no> schrieb:
    On 16/01/2025 17:46, Waldek Hebisch wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    On 16/01/2025 13:35, Michael S wrote:
    On Thu, 16 Jan 2025 12:36:45 +0100
    David Brown <david.brown@hesbynett.no> wrote:

    On 15/01/2025 21:59, Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Wed, 15 Jan 2025 18:00:34 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    As you can guess, in kernel drivers VLA are unwelcome.

    I can imagine that they are - but I really don't understand why. I've >>>>>>> never understood why people think there is something "dangerous" about >>>>>>> VLAs, or why they think using heap allocations is somehow "safer". >>>>>>
    VLA normally allocate on the stack. Which at first glance look
    great. But once one realize how small are stacks in modern
    systems (compared to whole memory), this no longer looks good.
    Basically, to use VLA one needs rather small bound on maximal
    size of array.

    Sure.

    Given such bound always allocating maximal
    size is simpler. Without _small_ bound on size heap is
    safer, as it is desined to handle also big allocations.

    You don't allocate anything in a VLA without knowing the bounds and
    being sure it is appropriate to put on the stack.

    In general, that is a hard thing to know - there is no standard
    way to enquire the size of the stack, how much you have already
    used, how deep you are going to recurse, or how much stack
    a function will use.

    Not standard compliant for sure, but you certainly can approximate
    stack use in C: just store (as byte*) the address of a local in your
    top level function, and check the (absolute value of) the difference
    to the address of a local in the current function.

    On a Linux machine, you can find the last envp[*] entry and subtract
    SP from it.

    I would discourage programmers from relying on that for any reason whatsoever. The aux vectors are pushed before the envp entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Wed Jan 22 18:44:14 2025
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    Notice what I wrote above, Rust allows for compile-time code generation
    in the form of macros which are in some ways even more powerful than C++ templates, so I'n not surprised to learn that there already exists
    public crate(s) to handle this. :-)

    That sounds scary; C++ templates are already Turing-complete...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 20:00:30 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and subtract
    SP from it.

    I would discourage programmers from relying on that for any reason
    whatsoever. The aux vectors are pushed before the envp entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 22 22:25:33 2025
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and subtract
    SP from it.

    I would discourage programmers from relying on that for any reason
    whatsoever. The aux vectors are pushed before the envp entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jan 22 22:44:45 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and subtract >>>>>SP from it.

    I would discourage programmers from relying on that for any reason
    whatsoever. The aux vectors are pushed before the envp entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    It's not something that a programmer generally would need, or want to
    do.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Thu Jan 23 01:39:29 2025
    On Wed, 22 Jan 2025 22:44:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and >>>>>subtract SP from it.

    I would discourage programmers from relying on that for any
    reason whatsoever. The aux vectors are pushed before the envp
    entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    It's not something that a programmer generally would need, or want to
    do.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.


    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.
    Of those that do support signals, not every one supports catching
    SIGSEGV.
    Of those that do support catching SIGSEGV, not every one can recover
    after that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Jan 23 01:00:49 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 22 Jan 2025 22:44:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and
    subtract SP from it.

    I would discourage programmers from relying on that for any
    reason whatsoever. The aux vectors are pushed before the envp
    entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    It's not something that a programmer generally would need, or want to
    do.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.


    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    Linux and MAC can't touch windows in terms of volume - but I'd
    argue that in the universe of programmers, they're close to
    if not equal to windows. The vast vast majority of windows
    users don't have a compiler. Those that do are working at
    a higher level where the knowlege of the stack base address
    would not be a useful value to know.

    Unix (bsd/sysv) and linux support the ucontex argument
    on the signal handler which provides processor state so
    the signal handler can recover from the fault in whatever
    fashion makes sense then transfer control to a known
    starting point (either siglongjmp or by manipulating the
    return context provided to the signal handler). This is
    clearly going to be processor and implementation specific.

    Yes, Windows is an abberation. I offered a solution, not
    "the" solution. I haven't seen any valid reason for a program[*]
    to need to know the base address of the process stack; if there
    were a need, the implementation would provide. I believe windows
    does have a functional equivalent to SIGSEGV, no? A quick search
    shows "EXCEPTION_ACCESS_VIOLATION" for windows.

    [*] Leaving aside the rare system utility or diagnostic
    utility or library (e.g. valgrind, et alia may find
    that a useful datum).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Thu Jan 23 08:14:52 2025
    Michael S <already5chosen@yahoo.com> writes:
    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    "man raise" tells me that raise() is C99. "man signal" tells me that
    signal() is C99.


    Of those that do support signals, not every one supports catching
    SIGSEGV.

    "man 7 signal" tells me that SIGSEGV is P1990, i.e., 'the original
    POSIX.1-1990 standard'. I.e., there were even some Windows systems
    that support it.

    Of those that do support catching SIGSEGV, not every one can recover
    after that.

    Gforth catches and recovers from SIGSEGV in order to return to
    Gforth's command line rather than terminating the process; in
    snapshots from recent years that's also used for determining whether
    some number is probably an address (try to read from that address; if
    there's a signal, it's not an address). I tried building Gforth on a
    number of Unix systems, and even the most rudimentary ones (e.g.,
    Ultrix), supported catching SIGSEGV. There is a port to Windows with
    Cygwin done by Bernd Paysan. I don't know if that could catch
    SIGSEGV, but I am sure that it's possible in Windows in some way, even
    if that way is not available through Cygwin.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Thu Jan 23 11:52:32 2025
    On Thu, 23 Jan 2025 01:00:49 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 22 Jan 2025 22:44:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and
    subtract SP from it.

    I would discourage programmers from relying on that for any
    reason whatsoever. The aux vectors are pushed before the
    envp entries.

    This brings into question what is "on" the stack ?? to be
    included in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??


    It's not something that a programmer generally would need, or want
    to do.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.


    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    Linux and MAC can't touch windows in terms of volume - but I'd
    argue that in the universe of programmers, they're close to
    if not equal to windows. The vast vast majority of windows
    users don't have a compiler. Those that do are working at
    a higher level where the knowlege of the stack base address
    would not be a useful value to know.


    I did not have "big" computers in mind. In fact, if we only look at
    "big" things then Android dwarfs anything else. And while Android is not
    POSIX complaint, it is probably similar enough for your method to work.

    I had in mind smaller things.
    All but one of very many embedded environments that I touched in
    last 3 decades had no signals. The exceptional one was running
    Linux.

    Unix (bsd/sysv) and linux support the ucontex argument
    on the signal handler which provides processor state so
    the signal handler can recover from the fault in whatever
    fashion makes sense then transfer control to a known
    starting point (either siglongjmp or by manipulating the
    return context provided to the signal handler). This is
    clearly going to be processor and implementation specific.

    Yes, Windows is an abberation. I offered a solution, not
    "the" solution. I haven't seen any valid reason for a program[*]
    to need to know the base address of the process stack; if there
    were a need, the implementation would provide. I believe windows
    does have a functional equivalent to SIGSEGV, no? A quick search
    shows "EXCEPTION_ACCESS_VIOLATION" for windows.


    But then one would have to use SEH which is not the same as signals.
    Although a specific case of SIGSEGV is the one where the SEH and
    signals happen to be rather similar.
    I can try it one day, but not today.

    [*] Leaving aside the rare system utility or diagnostic
    utility or library (e.g. valgrind, et alia may find
    that a useful datum).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Thu Jan 23 12:23:37 2025
    On Thu, 23 Jan 2025 08:14:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    "man raise" tells me that raise() is C99. "man signal" tells me that signal() is C99.


    I would guess that it belongs to the part of the standard that defines requirements for hosted implementation. My use of C for "real work"
    (as opposed to hobby) is almost exclusively in freestanding
    implementations.

    Even for hosted implementations, signal handled is guaranteed to be
    invoked only when signal is raised by raise(). It is not our case.


    Of those that do support signals, not every one supports catching
    SIGSEGV.

    "man 7 signal" tells me that SIGSEGV is P1990, i.e., 'the original POSIX.1-1990 standard'. I.e., there were even some Windows systems
    that support it.

    Of those that do support catching SIGSEGV, not every one can recover
    after that.

    Gforth catches and recovers from SIGSEGV in order to return to
    Gforth's command line rather than terminating the process; in
    snapshots from recent years that's also used for determining whether
    some number is probably an address (try to read from that address; if
    there's a signal, it's not an address). I tried building Gforth on a
    number of Unix systems, and even the most rudimentary ones (e.g.,
    Ultrix), supported catching SIGSEGV.

    From cppreference: https://en.cppreference.com/w/c/program/signal
    "If the user defined function returns when handling SIGFPE, SIGILL or
    SIGSEGV, the behavior is undefined."

    There is a port to Windows with
    Cygwin done by Bernd Paysan. I don't know if that could catch
    SIGSEGV, but I am sure that it's possible in Windows in some way, even
    if that way is not available through Cygwin.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Thu Jan 23 12:39:14 2025
    Michael S <already5chosen@yahoo.com> writes:
    =46rom cppreference: https://en.cppreference.com/w/c/program/signal
    "If the user defined function returns when handling SIGFPE, SIGILL or >SIGSEGV, the behavior is undefined."

    As is almost everything else occurring in production code. So such
    references are not particularly relevant for production code; what
    actual (in this case) operating system kernels and libraries do is
    relevant.

    And my experience from three decades across a wide variety of Unix
    systems on a wide variety of hardware is that what we do in our
    SIGSEGV handler works. But our signal handlers don't return, they
    longjmp() (in the cases that do not terminate the process).

    AFAIK returning would usually try to reexecute the segfaulting
    instruction, which would be the right thing if we had eliminated the
    cause for the SIGSEGV in the signal handler, but we don't do that in
    Gforth. Continuing with the next instruction would not be very useful
    for us, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Thu Jan 23 14:04:24 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    =46rom cppreference: https://en.cppreference.com/w/c/program/signal
    "If the user defined function returns when handling SIGFPE, SIGILL or >>SIGSEGV, the behavior is undefined."

    As is almost everything else occurring in production code. So such >references are not particularly relevant for production code; what
    actual (in this case) operating system kernels and libraries do is
    relevant.

    Indeed. And 'behavior is undefined' applies to the C specification;
    an implementation may certainly "define" that behavior and
    programmers using that implementation may rely on that definition.


    And my experience from three decades across a wide variety of Unix
    systems on a wide variety of hardware is that what we do in our
    SIGSEGV handler works. But our signal handlers don't return, they
    longjmp() (in the cases that do not terminate the process).

    Or siglongjmp(). Or using implementation-defined (or POSIX defined) capabilities (e.g. manipulating the process/thread context supplied to POSIX signal handlers).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Jan 23 14:31:22 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 23 Jan 2025 08:14:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority. =20
    =20
    "man raise" tells me that raise() is C99. "man signal" tells me that
    signal() is C99.
    =20

    I would guess that it belongs to the part of the standard that defines >requirements for hosted implementation. My use of C for "real work"
    (as opposed to hobby) is almost exclusively in freestanding
    implementations.

    In free-standing implementations, you must set the stack
    pointer yourself[*], so you implicitly know the stack start
    and stack bounds. You don't need to use the SIGSEGV
    technique that was described for hosted programs
    to find the base address of the stack (if there is no
    implementation-defined API that will provide the data).

    [*] As well as providing all the other needed state that a hosted
    implementation might provide.

    I've written a fair amount of non-hosted code myself (hypervisors,
    Operating Systems, standalone diagnostics) - the programmer needs
    to initialize the machine state (often in assembler) then establish
    the state required by the code (stack, initial register state,
    establishing protected mode, paging and long-mode on x86/x86_64
    systems, etc). None of this is using facilities defined by the
    C standard. Both hypervisors (at SGI and 3Leaf) were actually
    written in C++ - our platform initialization code also needed
    to ensure that static constructors were invoked prior to invoking
    the C++ code amongst other initializations (such as clearing
    the BSS region).

    Similar work needs to be done (either by you, or by the framework
    provided by the toolset provider such as greenhills et alia).


    Even for hosted implementations, signal handled is guaranteed to be
    invoked only when signal is raised by raise(). It is not our case.

    POSIX hosted implementations have guarantees beyond those provided by
    the C standard, including related to signal delivery and handling,
    and any implementation can provide guarantees beyond those described
    in the C standard.

    Standalone code is almost by definition non-portable.

    e.g.
    SS_DATA=0x18

    .text
    .global dvmmstart
    dvmmstart:
    #
    # Get processor into known state.
    #
    cld
    cli
    movl %eax, %esi
    movl $SS_DATA, %eax
    movl %eax, %ds
    movl %eax, %es
    movl %eax, %fs
    movl %eax, %gs

    lss stack_segdesc,%esp

    #
    # Clear BSS
    #
    xorl %eax,%eax
    movl $_edata,%edi
    movl $_end,%ecx
    subl %edi,%ecx
    rep stosb

    #
    # Invoke C++ code. Pass begin and end address of memory map.
    #
    movl $512, %eax # Starting with 512 bytes
    subl %esi, %eax # Subtract remaining
    addl $0x90000, %eax # e820 data map address
    pushl %eax # Push arg to main
    pushl $0x90000 # Push arg to main
    call dvmm_main # Invoke main

    #
    # Should never return.
    #
    hlt
    stack:
    .space 4096,0
    stacktop:

    stack_segdesc:
    .long stacktop
    .word SS_DATA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Thu Jan 23 17:41:16 2025
    On Thu, 23 Jan 2025 11:52:32 +0200
    Michael S <already5chosen@yahoo.com> wrote:

    On Thu, 23 Jan 2025 01:00:49 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 22 Jan 2025 22:44:45 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and
    subtract SP from it.

    I would discourage programmers from relying on that for any
    reason whatsoever. The aux vectors are pushed before the
    envp entries.

    This brings into question what is "on" the stack ?? to be
    included in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??


    It's not something that a programmer generally would need, or
    want to do.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.


    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    Linux and MAC can't touch windows in terms of volume - but I'd
    argue that in the universe of programmers, they're close to
    if not equal to windows. The vast vast majority of windows
    users don't have a compiler. Those that do are working at
    a higher level where the knowlege of the stack base address
    would not be a useful value to know.


    I did not have "big" computers in mind. In fact, if we only look at
    "big" things then Android dwarfs anything else. And while Android is
    not POSIX complaint, it is probably similar enough for your method to
    work.

    I had in mind smaller things.
    All but one of very many embedded environments that I touched in
    last 3 decades had no signals. The exceptional one was running
    Linux.

    Unix (bsd/sysv) and linux support the ucontex argument
    on the signal handler which provides processor state so
    the signal handler can recover from the fault in whatever
    fashion makes sense then transfer control to a known
    starting point (either siglongjmp or by manipulating the
    return context provided to the signal handler). This is
    clearly going to be processor and implementation specific.

    Yes, Windows is an abberation. I offered a solution, not
    "the" solution. I haven't seen any valid reason for a program[*]
    to need to know the base address of the process stack; if there
    were a need, the implementation would provide. I believe windows
    does have a functional equivalent to SIGSEGV, no? A quick search
    shows "EXCEPTION_ACCESS_VIOLATION" for windows.


    But then one would have to use SEH which is not the same as signals.
    Although a specific case of SIGSEGV is the one where the SEH and
    signals happen to be rather similar.
    I can try it one day, but not today.

    [*] Leaving aside the rare system utility or diagnostic
    utility or library (e.g. valgrind, et alia may find
    that a useful datum).



    At the end, I can not resist myself and did it today, wasting an hour
    and a half during which I was supposed to do real work.
    With Microsoft's language extensions it was trivial.
    But I don't know how to do it (on Windows) with gcc.

    Code:

    static void test(char** res, int depth) {
    *res = (char*)&res;
    if (depth > 0)
    test(res, depth-1);
    }

    int main()
    {
    char* res=(char*)&res;
    __try { test(&res, 1000000); }
    __except(1) { // 1==EXCEPTION_EXECUTE_HANDLER
    printf("SEH __except block\n");
    }
    printf("%p - %p = %zd\n", &res, res, (char*)&res - res);
    return 0;
    }


    It prints:
    SEH __except block
    000000000029F990 - 00000000001A4020 = 1030512

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Thu Jan 23 11:50:22 2025
    On Wed, 22 Jan 2025 22:44:45 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 20:00:30 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 22 Jan 2025 14:58:04 +0000, Scott Lurndal wrote:
    (MitchAlsup1)

    On a Linux machine, you can find the last envp[*] entry and subtract >>>>>>SP from it.

    I would discourage programmers from relying on that for any reason
    whatsoever. The aux vectors are pushed before the envp entries.

    This brings into question what is "on" the stack ?? to be included
    in the measurement of stack size.

    Only user data ??
    Data that is present when control arrives ??
    Could <equivalent> CRT0 store SP at arrival ??

    I think we have an illdefined measurement !!

    Everything between the base address of the stack
    and the limit address of the stack. The kernel exec(2)
    family system calls will allocate the initial
    stack region (with guard pages to handle extension)
    and populate it with the AUX, ENVP and ARG vectors
    before invoking the CRT in usermode.

    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    It's not something that a programmer generally would need, or want to
    do.

    https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

    or any problem requiring potentially unbounded recursion.

    However, if the OS they're using has a guard page to prevent
    stack underflow, one could write a subroutine which accesses
    page-aligned addresses towards the beginning of the stack
    region (anti the direction of growth) until a
    SIGSEGV is delivered.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to George Neuner on Thu Jan 23 17:18:32 2025
    George Neuner <gneuner2@comcast.net> writes:
    On Wed, 22 Jan 2025 22:44:45 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:


    So, how does one find the base (highest address on the stack) ??
    in a way that works on every system capable of running C-code ??

    It's not something that a programmer generally would need, or want to
    do.

    Note the word "generally".


    https://plover.com/~mjd/misc/hbaker-archive/CheneyMTA.html

    or any problem requiring potentially unbounded recursion.

    For which the standard unix resource limits are usually
    sufficient.

    Henry's 'scheme' is not typical of the vast majority of
    programs. Even Henry notes that the macro for his
    scheme (pun intended) is machine dependent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Thu Jan 23 14:22:24 2025
    Michael S wrote:

    At the end, I can not resist myself and did it today, wasting an hour
    and a half during which I was supposed to do real work.
    With Microsoft's language extensions it was trivial.
    But I don't know how to do it (on Windows) with gcc.

    Code:

    static void test(char** res, int depth) {
    *res = (char*)&res;
    if (depth > 0)
    test(res, depth-1);
    }

    int main()
    {
    char* res=(char*)&res;
    __try { test(&res, 1000000); }
    __except(1) { // 1==EXCEPTION_EXECUTE_HANDLER
    printf("SEH __except block\n");
    }
    printf("%p - %p = %zd\n", &res, res, (char*)&res - res);
    return 0;
    }


    It prints:
    SEH __except block
    000000000029F990 - 00000000001A4020 = 1030512

    To get the top of stack, just after main() is called you might be able to
    use a varargs routine to read the stack pointer. Round this up to the top
    of a 4KB page and that is likely the top of stack or near it.
    This could be stashed in a TLS variable for later.

    Something like...

    #include <stdio.h>
    #include <stdarg.h>
    #include <threads.h>

    thread_local unsigned long stkTopPtr;

    static unsigned long GetStackPtr (int junk, ...)
    {
    va_start(argptr, junk);
    return (unsigned long) argptr;
    }

    int __cdecl main (int argc, char *argv[])
    {
    unsigned long stkPtr;

    stkPtr = GetStackPtr (1);
    stkTopPtr = stkPtr | 0xFFF;
    printf ("Stack top: 0x%08X\n", stkTopPtr);
    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Mon Jan 27 17:18:16 2025
    Michael S <already5chosen@yahoo.com> writes:

    On Thu, 23 Jan 2025 08:14:52 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:

    Not every system capable of running C supports signals. I would
    think that those that support signals are not even majority.

    "man raise" tells me that raise() is C99. "man signal" tells me
    that signal() is C99.

    I would guess that it belongs to the part of the standard that
    defines requirements for hosted implementation. [...]

    Right. Almost all of the standard library is not required
    for freestanding implementations, and <signal.h> is not
    among the required set.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)