• Calling conventions (particularly 32-bit ARM)

    From David Brown@21:1/5 to All on Mon Jan 6 14:57:51 2025
    I'm trying to understand the reasoning behind some of the calling
    conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very
    important to me - good calling conventions make a big difference.

    No doubt most people here know this already, but in summary these
    devices are a 32-bit load/store RISC architecture with 16 registers.
    R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
    registers, R13 is the stack pointer, R14 is the link register and R15 is
    the program counter. For most Cortex-M cores, there is no
    super-scaling, out-of-order execution, speculative execution, etc., but instructions are pipelined.

    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as
    32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a
    /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument passing?

    I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than
    using C-style error codes or passing manual pointers to return value
    slots. But the limited return registers adds significant overhead to
    small functions.


    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?


    Thanks for any pointers or explanations here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to David Brown on Mon Jan 6 15:23:40 2025
    David Brown <david.brown@hesbynett.no> wrote:
    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    According to EABI, it's also possible to return a 128 bit vector in R0-3: https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument passing?

    The 'composite type' return value, where a pointer is passed in as the first argument to the function and a struct at that pointer is filled in with the return values, has existed since the first ARM ABI - APCS-R: http://www.riscos.com/support/developers/dde/appf.html

    That dates from the mid 1980s before 'modern compilers', and I'm guessing
    that has stuck around. A lot of early ARM code was in assembler. The
    original ARMCC was good but fairly basic - GCC didn't support ARM until
    about 1993.

    [*] technically APCS-R was the second ARM ABI, APCS-A was the first: https://heyrick.eu/assembler/apcsintro.html
    but I don't think return value handling was any different.

    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?

    Probably the latter. Also that AArch64 was an opportunity to throw all this stuff away and start again, with a much richer calling convention: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#result-return

    but obviously that's no help to the microcontroller folks. At this stage, a change of calling convention might be fairly big ask.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Mon Jan 6 15:32:04 2025
    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Most calling conventions on RISCs are oriented towards C (if you want
    calling conventions that try to be more cross-language (and slower),
    look at VAX) and its properties and limitations at the time when the
    calling convention was designed, in particular, the PCC
    implementation, which was the de-facto standard Unix C compiler at the
    time. C compilers in the 1980s did not allocate structs to registers,
    so passing structs in registers was foreign to them, so the solution
    is that the caller passes the target struct as an additional
    parameter.

    And passing the return value in registers might not have saved
    anything on a compiler that does not deal with structs in registers.
    E.g., if you have

    mystruct = myfunc(arg1, arg2);

    you would see stores to mystruct behind the call. With the PCC
    calling convention, the same stores would happen in the caller
    (possibly resulting in smaller code if there are several calls to
    myfunc()).

    I wonder, though, how things look for

    mystruct = foo(&mystruct);

    Does PCC perform the return stores to mystruct only after performing
    all other memory accesses in foo? Probably yes, anything else would
    complicate the compiler. In that case the caller could pass &mystruct
    for the return value (a slight complication). But is that restriction reflected in the calling convention?

    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    gcc has an option -freg-struct-return, which does what you want. Of
    course, if you use this option on ARM A32/T32, you are not following
    the calling convention, so you should only use it when all sides of a
    struct return are compiled with that option.

    Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a >/long/ time since they insisted that structs occupied a contiguous block
    of memory.

    ARM A32 is from 1985, and its calling convention is probably not much
    younger.

    I also think code would be a bit more efficient if there more registers >available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions. IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers. Not sure why they were so
    reluctant to use more registers earlier.

    In more modern C++ programming, it's very practical to use types like >std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than >using C-style error codes or passing manual pointers to return value
    slots.

    The ARM calling convention is certainly much older than "modern C++ programming".

    But the limited return registers adds significant overhead to
    small functions.

    C++ programmers think they know what C programming is about (and
    unfortunately they dominate not just C++ compiler writers, but they
    also damage C compilers while they are at it), so my sympathy for your
    problem is very limited.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Mon Jan 6 20:10:13 2025
    On Mon, 6 Jan 2025 13:57:51 +0000, David Brown wrote:

    I'm trying to understand the reasoning behind some of the calling
    conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very important to me - good calling conventions make a big difference.

    No doubt most people here know this already, but in summary these
    devices are a 32-bit load/store RISC architecture with 16 registers.
    R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
    registers, R13 is the stack pointer, R14 is the link register and R15 is
    the program counter. For most Cortex-M cores, there is no
    super-scaling,
    SuperScalar
    out-of-order execution, speculative execution, etc., but instructions are pipelined.

    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    Someone above mentioned a trick to pass back a 128-bit value.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    I have seen subroutines that returned structures where the point
    in the subroutine that puts values in the returned structure is
    such that putting the structure in registers is less efficient
    than returning the struct in registers--it all depends on how
    the struct is laid out in memory. Doing the struct field assign-
    ments in the middle of the subroutine (long path to return) is
    often enough to sway which is more efficient.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Vogue in 1980 was to have 1 result passed back from subroutines.

    Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values.

    My 66000 can pass up to 8 registers back as a aggregate result.

    Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument
    passing?

    In My 66000 ABI they can and do.

    I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than using C-style error codes or passing manual pointers to return value
    slots. But the limited return registers adds significant overhead to
    small functions.

    C++ also has the:
    try-throw-catch exception model which require new-and-fun stuff to
    be thrown onto the stack.
    constructors and destructors
    new
    Atomic stuff

    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?

    At the time, there were good technical rational--which my have
    faded in import as the years go by.


    Thanks for any pointers or explanations here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 6 20:19:15 2025
    On Mon, 6 Jan 2025 15:32:04 +0000, Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for >>the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs >>that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Most calling conventions on RISCs are oriented towards C (if you want
    calling conventions that try to be more cross-language (and slower),
    look at VAX) and its properties and limitations at the time when the
    calling convention was designed, in particular, the PCC
    implementation, which was the de-facto standard Unix C compiler at the
    time. C compilers in the 1980s did not allocate structs to registers,
    so passing structs in registers was foreign to them, so the solution
    is that the caller passes the target struct as an additional
    parameter.

    And passing the return value in registers might not have saved
    anything on a compiler that does not deal with structs in registers.
    E.g., if you have

    mystruct = myfunc(arg1, arg2);

    you would see stores to mystruct behind the call. With the PCC
    calling convention, the same stores would happen in the caller
    (possibly resulting in smaller code if there are several calls to
    myfunc()).

    I wonder, though, how things look for

    mystruct = foo(&mystruct);

    Does PCC perform the return stores to mystruct only after performing
    all other memory accesses in foo? Probably yes, anything else would complicate the compiler. In that case the caller could pass &mystruct
    for the return value (a slight complication). But is that restriction reflected in the calling convention?

    For VERY MANY circumstances passing a struct by address is more
    efficient than passing it by value, AND especially when the
    compiler does not optimize heavily.

    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    In addition, the programmer has the choice of changing into pointer
    form (&struct) from value form (struct) which is what we learned
    was better style way back then.

    --------------------------

    I also think code would be a bit more efficient if there more registers >>available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions. IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers. Not sure why they were so
    reluctant to use more registers earlier.

    Compiler people were telling us that more callee saved registers would
    be higher performing than more argument registers. It did not turn out
    to be that way.

    Oh and BTW, lack of argument registers leads to an increased
    desire for the linker to perform inline folding. ...



    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Tue Jan 7 02:11:45 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers
    available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code. It also appears in functional-style code, where to get
    around lack of destructive modification one freqenty have to
    double number of arguments. Another source is closures: when
    looking at source captured values are not visible as arguments,
    but implementation has to pass them behind the scenes.

    More generally, large number of arguments tend to appear in
    hand-optimized where they may lead to faster code than
    using structures in memory. In C structures in memory are
    not that expensive, so scope for gain is limited, but several
    languages dynamically allocate all structures (and pass then
    via address). In such case avoiding dynamic allocation can
    give substantial gain. Programmers now are much less
    inclined to do microptimizations of this sort. But it may
    appear in machine generated sources.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Jan 7 06:53:44 2025
    On Tue, 7 Jan 2025 02:11:45 -0000 (UTC), Waldek Hebisch wrote:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    That kind of thing just cries out for passing arguments by keyword.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Tue Jan 7 09:49:16 2025
    On 06/01/2025 16:32, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Most calling conventions on RISCs are oriented towards C (if you want
    calling conventions that try to be more cross-language (and slower),
    look at VAX) and its properties and limitations at the time when the
    calling convention was designed, in particular, the PCC
    implementation, which was the de-facto standard Unix C compiler at the
    time. C compilers in the 1980s did not allocate structs to registers,
    so passing structs in registers was foreign to them, so the solution
    is that the caller passes the target struct as an additional
    parameter.

    And passing the return value in registers might not have saved
    anything on a compiler that does not deal with structs in registers.

    Agreed.

    This is all as I suspected - but it's nice to have it confirmed by others.

    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.


    I use struct returns sometimes in my C code, but they are (naturally
    enough) a far smaller proportion of return types than in C++ code.

    gcc has an option -freg-struct-return, which does what you want. Of
    course, if you use this option on ARM A32/T32, you are not following
    the calling convention, so you should only use it when all sides of a
    struct return are compiled with that option.


    I know about the -freg-struct-return option (and the requirements for
    using it), but it only has effect for 32-bit x86 as far as I know. It certainly makes no difference for 32-bit ARM/Thumb. (clang specifically
    says it does not support that option for 32-bit ARM/Thumb.) I think
    part of this is that the calling convention already returns structs in registers - just as long as the struct fits in the single 32-bit register.

    Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a
    /long/ time since they insisted that structs occupied a contiguous block
    of memory.

    ARM A32 is from 1985, and its calling convention is probably not much younger.


    I first used ARM assembly in the late 1980's, but that was mixed BBC
    BASIC and assembly, all with almost no documentation, so I don't know
    what calling conventions there were at that time. (But the Acorn
    Archimedes was /really/ cool :-) )

    I also think code would be a bit more efficient if there more registers
    available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions. IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers. Not sure why they were so
    reluctant to use more registers earlier.


    Passing all parameters on the stack and returning a single int in a
    register was a perfect fit for old-style C where functions were often
    used without declarations. It would certainly be a lot easier for
    variadic functions. But once you start passing some parameters in
    registers, it seems strange to use so few. Perhaps it was to make life
    easier for earlier compiler writers? Things like lifetime analysis and register allocation algorithms were not as sophisticated as they are now
    - it used to be that if a variable used a register (via the C "register" qualifier), the register was dedicated to the variable throughout the
    function. Too many registers for parameter passing might have left too
    few registers for function implementation, or at least made the compiler
    more complex.

    In more modern C++ programming, it's very practical to use types like
    std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than
    using C-style error codes or passing manual pointers to return value
    slots.

    The ARM calling convention is certainly much older than "modern C++ programming".


    Yes.

    But the limited return registers adds significant overhead to
    small functions.

    C++ programmers think they know what C programming is about (and unfortunately they dominate not just C++ compiler writers, but they
    also damage C compilers while they are at it), so my sympathy for your problem is very limited.


    I program in C and C++, and in the past did a lot of assembly (mostly on
    8-bit or 16-bit microcontrollers). I am fully aware that C and C++ are different languages, and I write code in different styles for each.

    For this issue, improving the calling convention would make the biggest difference for C++, but would also be a positive benefit for C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Theo on Tue Jan 7 09:22:15 2025
    On 06/01/2025 16:23, Theo wrote:
    David Brown <david.brown@hesbynett.no> wrote:
    The big problem I see is the registers used for returning values from
    functions. R0-R3 can all be used for passing arguments to functions, as
    32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    According to EABI, it's also possible to return a 128 bit vector in R0-3: https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a
    /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument passing?

    The 'composite type' return value, where a pointer is passed in as the first argument to the function and a struct at that pointer is filled in with the return values, has existed since the first ARM ABI - APCS-R: http://www.riscos.com/support/developers/dde/appf.html

    That dates from the mid 1980s before 'modern compilers', and I'm guessing that has stuck around. A lot of early ARM code was in assembler. The original ARMCC was good but fairly basic - GCC didn't support ARM until
    about 1993.

    [*] technically APCS-R was the second ARM ABI, APCS-A was the first: https://heyrick.eu/assembler/apcsintro.html
    but I don't think return value handling was any different.

    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?

    Probably the latter.

    It certainly seems that way to me. But there was always the possibility
    that there were technical reasons that I had not thought of.

    Also that AArch64 was an opportunity to throw all this
    stuff away and start again, with a much richer calling convention: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#result-return

    but obviously that's no help to the microcontroller folks. At this stage, a change of calling convention might be fairly big ask.


    Actually, I disagree on that one. In the microcontroller world,
    changing calling conventions should not be nearly as difficult as it
    would be on hosted systems because you are rarely dealing with
    pre-compiled object code. And there are already many variations on
    calling conventions for 32-bit ARM devices - for thumb or ARM code, and
    for all the different combinations of floating point registers which may
    or may not be used.

    The pre-compiled object code you always have is basic C libraries and
    compiler support libraries (things like software floating point
    routines). For a typical 32-bit embedded gcc ARM toolchain there are
    already 30+ builds for libraries for all the different variants of the architecture and calling conventions - a few more won't be a problem.

    Then there are some RTOS's and other commercial libraries that are only available in binary form. Most of these are written in crappy ancient
    C90 - they won't return structs or other bigger data anyway, and thus be unaffected by such changes. And it would not be difficult for these
    suppliers to re-compile with new options either.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Tue Jan 7 10:09:20 2025
    On 06/01/2025 21:19, MitchAlsup1 wrote:
    On Mon, 6 Jan 2025 15:32:04 +0000, Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1.  If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for >>> the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs >>> that are made up of two 32-bit parts.



    I wonder, though, how things look for

    mystruct = foo(&mystruct);

    Does PCC perform the return stores to mystruct only after performing
    all other memory accesses in foo?  Probably yes, anything else would
    complicate the compiler.  In that case the caller could pass &mystruct
    for the return value (a slight complication).  But is that restriction
    reflected in the calling convention?

    For VERY MANY circumstances passing a struct by address is more
    efficient than passing it by value, AND especially when the
    compiler does not optimize heavily.

    For /some/ circumstances it is certainly true that passing by reference
    (or by pointer, or by hidden pointer on the stack) is more efficient, especially for larger aggregates. For others - especially smaller
    aggregates - using registers is vastly more efficient.

    Both C and C++ provide perfectly good ways to pass data around by
    address when that's what you want to do. My problem is that the calling convention won't let me pass around data in registers when I want to do
    that.

    I don't care what the compiler does when not optimising heavily - or for compilers that can't optimise heavily. When I am looking for efficient
    code, I use optimisation - caring about inefficiencies in the calling convention without heavy optimisation is like caring about how fast your
    car goes when you keep it in first gear.


    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    In addition, the programmer has the choice of changing into pointer
    form (&struct) from value form (struct) which is what we learned
    was better style way back then.


    I already know when it is best to pass a struct via a pointer, and when
    it is best to pass it as a struct value. (The 32-bit ARM calling
    convention happily uses registers to pass structs by value, using up to
    4 registers. It's the return via registers that is missing.) I also
    know when it is best for a struct return to be via an address or in
    registers - but C has no way to let me choose that.

    --------------------------

    I also think code would be a bit more efficient if there more registers
    available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions.  IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers.  Not sure why they were so
    reluctant to use more registers earlier.

    Compiler people were telling us that more callee saved registers would
    be higher performing than more argument registers. It did not turn out
    to be that way.


    The trouble with that kind of thing is that people write different kinds
    of code. The balance that works best for - say - PC desktop application programming is not necessarily the best for small-systems embedded
    programming. And the balance that works best for C is not necessarily
    the best for C++, or Rust, or D, or OCaml or any other language.

    I am not looking for perfection here - I don't think such a thing as a "perfect" calling convention could exist. I am just looking for an
    obvious improvement that would help in many languages and for a lot of
    code, with zero cost for code that doesn't need it - or for some good
    technical reason why it /would/ be costly.

    Oh and BTW, lack of argument registers leads to an increased
    desire for the linker to perform inline folding. ...


    Certainly a way out of this is to look to link-time optimisation and
    more inline code. But that leads to a lot of additional issues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Tue Jan 7 16:52:27 2025
    On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    Large numbers of parameters may be generated either by closure
    conversion or by lambda lifting. These are FP language
    transformations that are analogous to, but potentially more complex
    than, the rewriting of object methods and their call sites to pass the
    current object in an OO language.

    [The difference between closure conversion and lambda lifting is the
    scope of the tranformation: conversion limits code transformations to
    within the defining call chain, whereas lifting pulls the closure to
    top level making it (at least potentially) globally available.]

    In either case the original function is rewritten such that non-local
    variables can be passed as parameters. The function's code must be
    altered to access the non-locals - either directly as explicit
    individual parameters, or by indexing from a pointer to an environment
    data structure.

    While in a simple case this could look exactly like the OO method transformation, recall that a general closure may require access to
    non-local variables spread through multiple environments. Even if
    whole environments are passed via single pointers, there still may
    need to be multiple parameters added.

    Where exactly the line is drawn between passing individual variables
    from an enviroment vs passing the whole enviroment is a heuristic that
    is tied to the CPU's argument passing convention.

    YMMV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Tue Jan 7 23:23:28 2025
    On Tue, 7 Jan 2025 9:09:20 +0000, David Brown wrote:

    On 06/01/2025 21:19, MitchAlsup1 wrote:
    ------------------------
    Both C and C++ provide perfectly good ways to pass data around by
    address when that's what you want to do. My problem is that the calling convention won't let me pass around data in registers when I want to do
    that.

    I don't care what the compiler does when not optimising heavily - or for compilers that can't optimise heavily. When I am looking for efficient
    code, I use optimisation - caring about inefficiencies in the calling convention without heavy optimisation is like caring about how fast your
    car goes when you keep it in first gear.


    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    In addition, the programmer has the choice of changing into pointer
    form (&struct) from value form (struct) which is what we learned
    was better style way back then.


    I already know when it is best to pass a struct via a pointer, and when
    it is best to pass it as a struct value. (The 32-bit ARM calling
    convention happily uses registers to pass structs by value, using up to
    4 registers. It's the return via registers that is missing.) I also
    know when it is best for a struct return to be via an address or in
    registers - but C has no way to let me choose that.

    My 66000 ABI passes structs up to 8 doublewords in size as
    arguments and as results.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 7 23:35:31 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 7 Jan 2025 9:09:20 +0000, David Brown wrote:

    On 06/01/2025 21:19, MitchAlsup1 wrote:
    ------------------------
    Both C and C++ provide perfectly good ways to pass data around by
    address when that's what you want to do. My problem is that the calling
    convention won't let me pass around data in registers when I want to do
    that.

    I don't care what the compiler does when not optimising heavily - or for
    compilers that can't optimise heavily. When I am looking for efficient
    code, I use optimisation - caring about inefficiencies in the calling
    convention without heavy optimisation is like caring about how fast your
    car goes when you keep it in first gear.


    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    In addition, the programmer has the choice of changing into pointer
    form (&struct) from value form (struct) which is what we learned
    was better style way back then.


    I already know when it is best to pass a struct via a pointer, and when
    it is best to pass it as a struct value. (The 32-bit ARM calling
    convention happily uses registers to pass structs by value, using up to
    4 registers. It's the return via registers that is missing.) I also
    know when it is best for a struct return to be via an address or in
    registers - but C has no way to let me choose that.

    My 66000 ABI passes structs up to 8 doublewords in size as
    arguments and as results.

    What is a doubleword in your architecture? In intel vernacular
    it's 32-bits, but that's not universal.

    Both x86_64 and ARM64 support passing eight 64-bit quantities
    as arguments and as results architecturally without using
    the SIMD registers.

    Now, ABI conventions may be otherwise, but they're important
    for interoperability, not basic functionality.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 8 01:38:00 2025
    On Tue, 7 Jan 2025 23:35:31 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:

    My 66000 ABI passes structs up to 8 doublewords in size as
    arguments and as results.

    What is a doubleword in your architecture? In intel vernacular
    it's 32-bits, but that's not universal.

    Intel is wrong, IBM defined the tern before Intel was existent
    (1963 or earlier).

    Byte 8-bits
    Half 16-bits
    word 32-bits
    DW 64-bits
    QW 128-bits
    OW 256-bits
    Line 512-bits

    Oh, and BTW:: DEI stands for Dale Earnhardt Incorporated...

    Both x86_64 and ARM64 support passing eight 64-bit quantities
    as arguments and as results architecturally without using
    the SIMD registers.

    Now, ABI conventions may be otherwise, but they're important
    for interoperability, not basic functionality.

    Done wrong (or weak) they add overhead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 8 12:20:51 2025
    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.
    Large numbers of parameters may be generated either by closure
    conversion or by lambda lifting.

    AFAIK in these cases the same compiler generates the code for the
    function and for the calls, so it should be pretty much free to use any
    calling convention it likes.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 8 12:34:30 2025
    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    For languages where the type systems ensures that the max number of
    arguments is known (and the same) when compiling the function and when compiling the calls to it, you could adjust the number of caller-saved
    argument registers according to the actual number of arguments of the
    function, thus making it "cheap" to allow, say, 13 argument registers
    for those functions that take 13 arguments, since it doesn't impact the
    other functions.

    But in any case, I suspect there are also diminishing returns at some
    point: how much faster is it in practice to pass/return 13 values in
    registers instead of 8 of them in registers and the remaining 5 on
    the stack? I expect a 13-arg function to perform an amount
    of work that will dwarf the extra work of going through the stack.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Jan 8 20:19:40 2025
    On Wed, 8 Jan 2025 17:34:30 +0000, Stefan Monnier wrote:

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    For languages where the type systems ensures that the max number of
    arguments is known (and the same) when compiling the function and when compiling the calls to it, you could adjust the number of caller-saved argument registers according to the actual number of arguments of the function, thus making it "cheap" to allow, say, 13 argument registers
    for those functions that take 13 arguments, since it doesn't impact the
    other functions.

    The counter argument is that there are too few subroutines wanting
    this amount of register argument passing. So, even if you allowed
    for this, it probably does not show up on the bottom line.

    But in any case, I suspect there are also diminishing returns at some
    point: how much faster is it in practice to pass/return 13 values in registers instead of 8 of them in registers and the remaining 5 on
    the stack? I expect a 13-arg function to perform an amount
    of work that will dwarf the extra work of going through the stack.

    Then there is the issue of what is IN the structure passed in
    registers??

    If it is a series of bytes, then it is better passed by reference
    so the bytes can be LDed (1 instruction) rather than extracted
    (2 instructions in most ISAs); or STed (1 instruction) rather
    than insertion (3 instruction most ISAs).

    If, instead, the structure is comprised of bit-fields, then it is
    almost always wise to pass in registers--since extraction and
    insertion are always reg->reg.

    Also note: If the structure is written deep with the subroutine,
    many (many) instructions before return, Then it is often wiser
    to perform this stuff into a memory area, and reload just prior
    to return.



    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jan 8 22:08:46 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    For languages where the type systems ensures that the max number of
    arguments is known (and the same) when compiling the function and when >compiling the calls to it, you could adjust the number of caller-saved >argument registers according to the actual number of arguments of the >function, thus making it "cheap" to allow, say, 13 argument registers
    for those functions that take 13 arguments, since it doesn't impact the
    other functions.

    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    Language-private calling conventions can be a good idea, but then, if
    you want to call C code (or be called by C code), you need to handle
    ABI calling conventions in addition.

    But in any case, I suspect there are also diminishing returns at some
    point: how much faster is it in practice to pass/return 13 values in >registers instead of 8 of them in registers and the remaining 5 on
    the stack? I expect a 13-arg function to perform an amount
    of work that will dwarf the extra work of going through the stack.

    I certainly have a use for as many arguments as the ABI provides, for
    functions that typically contain only a few payload instructions: You
    can implement a direct-threaded VM interpreter using tail-call
    optimization, along the lines of

    void add(VMinst *ip, long *sp, long sp_top)
    {
    /* payload start */
    sp_top += *sp++;
    /* payload end */
    /* invoke the next VM instruction */
    (*ip)(ip+1,sp,sp_top);
    }

    30 years ago gcc could not tail-call-optimize this, in the meantime it
    can (and clang can do it, too). However, typical VMs have more than
    just these three VM registers (Gforth has ip, sp, rp, fp, lp, up,
    fp_top (usually mapped to a real-machine FP register) and registers
    for as many sp stack items as practical; we intend to cache rp_top in
    a register, too), and ideally you can pass them all as arguments; so
    we could make good use of 10+ arguments. If there are not enough
    arguments in registers, you have to use explicit register vars (a GNU
    C extension) in addition, but that is more architecture-specific.
    Some preliminary testing on AMD64 resulted in gcc apparently
    supporting a lot of explicit registers on AMD64, and clang/LLVM only
    one.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Wed Jan 8 18:20:43 2025
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs
    obviously, mismatched arg numbers less so), but I think the focus of optimization of the ABI should be calls to functions known to take the
    exact same number of arguments (after all, even in C we normally know
    the prototype of the called function; only sloppy ancient C calls
    functions without proper declarations), even if it comes at the cost of
    using different calling conventions for the two cases.

    But in any case, I suspect there are also diminishing returns at some >>point: how much faster is it in practice to pass/return 13 values in >>registers instead of 8 of them in registers and the remaining 5 on
    the stack? I expect a 13-arg function to perform an amount
    of work that will dwarf the extra work of going through the stack.
    I certainly have a use for as many arguments as the ABI provides,

    Ah, yes, machine-generated code can always defy intuitions about what
    is "typical". 🙂


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jan 9 00:11:08 2025
    On Wed, 8 Jan 2025 23:20:43 +0000, Stefan Monnier wrote:

    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    My 6600 ABI was designed for C, but is compatible with Fortran and
    C++ {and I suspect most languages--under the assumption that those
    languages have to clean up their own messes*}.

    (*) C++ has to drop "stuff" on the stack so that it can properly
    deallocate new structures when Try-Throw-Catch is performing walk
    backs, and to utilize that "stack stuff" when searching for the
    right exception block.

    When C calls Fortran and Fortran is expecting an array, C has
    to build the dope vector used by Fortran in accessing said array.

    Any calling convention is pressed on both sides--more argument registers
    and more callee-save registers--but the number of registers if fixed.

    I can agree that it's important to support those use-cases (varargs obviously, mismatched arg numbers less so), but I think the focus of optimization of the ABI should be calls to functions known to take the
    exact same number of arguments (after all, even in C we normally know
    the prototype of the called function; only sloppy ancient C calls
    functions without proper declarations), even if it comes at the cost of
    using different calling conventions for the two cases.

    In My 66000 ABI varargs takes one more Prologue instructions as
    a non-varargs subroutine and creates a vector of DW arguments
    which can be picked off with va_list = SP; va_start = 0,
    and va_arg(va_list,arg) = LD Rd,[va_list,Rarg<<3];

    One of the key reasons to have a unified register model.

    But in any case, I suspect there are also diminishing returns at some >>>point: how much faster is it in practice to pass/return 13 values in >>>registers instead of 8 of them in registers and the remaining 5 on
    the stack?

    Back when we looked at this in mid 1990s, using more registers for
    arguments (than the 8 we were using) was "well down" the low hanging
    fruit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 9 08:38:32 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    AFAIK in these cases the same compiler generates the code for the
    function and for the calls, so it should be pretty much free to use any >calling convention it likes.

    With separate compilation, the compiler does not know which other
    compiler generates the code for the caller of a function or the callee
    of a function. ABI Calling conventions exist in order to make code by different compilers (whether the same language or a different one) interoperable.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to monnier@iro.umontreal.ca on Thu Jan 9 07:23:57 2025
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the
    established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    only sloppy ancient C calls
    functions without proper declarations)

    You find it ok to design a calling convention such that ancient C
    programs do not work?

    What benefit do you expect from such a calling convention? To allow
    to use registers as arguments (and not callee-saved) that would
    otherwise be preferably used as callee-saved registers?

    However, I wonder why, e.g., RISC-V does not allow the use of all
    caller-saved registers as arguments. In addition to the 8 argument
    registers (a0-a7=x10-x17), RISC-V has 7 additional caller-saved
    registers: t0-t6(=x5-x7,x28-x31); for FP register's it's even more
    extreme: 8 argument registers fa0-fa7=f10-f17, and 12 additional
    caller-saved registers ft0-ft12=f0-f7,f28-f31.

    even if it comes at the cost of
    using different calling conventions for the two cases.

    That would mean that you find it ok that existing programs that use
    vararg functions like printf but do not declare them before use don't
    work on your newfangled architecture. Looking at <https://pdos.csail.mit.edu/6.828/2023/readings/riscv-calling.pdf>,
    the RISC-V people find that acceptable:

    |If argument i < 8 is a floating-point type, it is passed in
    |floating-point register fai; [...] Additionally, floating-point
    |arguments to variadic functions (except those that are explicitly
    |named in the parameter list) are passed in integer registers.

    So if I 'printf("%f",1.0)' without first declaring printf, the program
    won't work. I just tried out compiling the following program on
    RISC-V with gcc 10.3.1:

    int main()
    {
    printf("%f\n",1.0);
    }

    int xxx()
    {
    yyy("%f\n",1.0,2);
    }

    Note that there is no "#include <stdio.h>" or any declaration of
    printf() or yyy(). Yet 1.0 is passed to printf() in a1, while it is
    passed to yyy() in fa0, and 2 is passed to yyy() in a1.

    And gcc works around the varargs decision by using the varargs calling convention for some well-known vararg functions like printf, while
    other undeclared functions use the non-varargs calling convention.
    Apparently the fallout of that decision by the RISC-V people hit a
    "relevant" program.

    [1] Apparently they stuck with the decision to deal differently with
    varargs, and then decided to change the rest of the calling convention
    to benefit from that decision by not leaving holes in the FP argument
    registers for integers and vice versa. I don't find this clearly
    expressed in <https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc>.
    The only thing that points in that direction is:

    |Values are passed in floating-point registers whenever possible,
    |whether or not the integer registers have been exhausted.

    But this does not talk about how the integer argument register
    numbering is changed by the "Hardware Floating-point Calling
    Convention".

    I certainly have a use for as many arguments as the ABI provides,

    Ah, yes, machine-generated code can always defy intuitions about what
    is "typical".

    While I use a generator for my interpreter engines, many other people
    hand-code them. They would probably use macros for the function
    declaration and the tail-call, though. Or maybe a macro that wraps
    the whole payload so that one can easily switch between this technique
    and one of the others.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Thu Jan 9 10:07:36 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    That would mean that you find it ok that existing programs that use
    vararg functions like printf but do not declare them before use don't
    work on your newfangled architecture.

    Interestingly, tail call optimization (which I believe you like)
    can cause bugs with mismatched arguments when different functions
    disagree abuout the stack size. Here is a nasty case with sibling
    calls:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90329

    So, if you want to allow mismatched declarations, better
    disable tail calls, to be on the safe side.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jan 9 20:48:07 2025
    On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    No, I would stand my ground and mandate that they do work.

    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs?

    One of the salient point that allowed C to overtake PASCAL is that
    you can write printf() in C while you cannot write write() in PARCAL.
    Do not break this assumption on any architecture.

    How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    only sloppy ancient C calls
    functions without proper declarations)

    You find it ok to design a calling convention such that ancient C
    programs do not work?

    I went the other way, I made an ABI that made varargs EASY !!
    and in such a way the caller does not need to know callee is
    varargs.

    What benefit do you expect from such a calling convention? To allow
    to use registers as arguments (and not callee-saved) that would
    otherwise be preferably used as callee-saved registers?

    I found no particular problem in passing a fixed number of arguments
    in registers and the rest on a stack. va_list; dumps the registers
    onto the stack to form a vector of arguments in memory. va_arg
    initializes the pointer to where the registers got stuck on the stack.

    However, I wonder why, e.g., RISC-V does not allow the use of all caller-saved registers as arguments.

    A) we need some registers for passing of arguments, and some
    ..for returning results.
    B) we need some temporary registers so short leaf subroutines
    ..do not need stack space in order to compute with the given
    ..arguments
    C) we need some registers for holding onto caller's state while
    ..processing callee operations
    D) there is generally a register holding the return address.

    Generally (A) and (B) have a sliding window. The fewer arguments
    and results, the more temporary registers.

    (C) includes FP and SP as callee preserved registers--that is
    ..when control returns to caller R16..R31 contain the same
    ..values as when the CALL was performed.

    In looking at code out of My 66000 LLVM compiler, there are so
    few subroutines with "that many" arguments and results, that
    mandating more than 8 arguments or results go through memory
    is not really a performance burden.

    Also: More callee save registers (preserved) causes more stack
    space to be allocated for the 'temporary' registers. Say you want
    all the registers (except return address register, and return
    result register) to be preserved across a subroutine call: So, a
    small subroutine needing 3 registers to perform its calculations;
    now has 3 STs and 3 LDs to preserve caller registers, whereas
    with temporary registers there is no overhead.

    In addition to the 8 argument
    registers (a0-a7=x10-x17), RISC-V has 7 additional caller-saved
    registers: t0-t6(=x5-x7,x28-x31); for FP register's it's even more
    extreme: 8 argument registers fa0-fa7=f10-f17, and 12 additional
    caller-saved registers ft0-ft12=f0-f7,f28-f31.

    even if it comes at the cost of
    using different calling conventions for the two cases.

    That would mean that you find it ok that existing programs that use
    vararg functions like printf but do not declare them before use don't
    work on your newfangled architecture. Looking at <https://pdos.csail.mit.edu/6.828/2023/readings/riscv-calling.pdf>,
    the RISC-V people find that acceptable:

    |If argument i < 8 is a floating-point type, it is passed in
    |floating-point register fai; [...] Additionally, floating-point
    |arguments to variadic functions (except those that are explicitly
    |named in the parameter list) are passed in integer registers.

    So if I 'printf("%f",1.0)' without first declaring printf, the program
    won't work. I just tried out compiling the following program on
    RISC-V with gcc 10.3.1:

    int main()
    {
    printf("%f\n",1.0);
    }

    int xxx()
    {
    yyy("%f\n",1.0,2);
    }

    Note that there is no "#include <stdio.h>" or any declaration of
    printf() or yyy(). Yet 1.0 is passed to printf() in a1, while it is
    passed to yyy() in fa0, and 2 is passed to yyy() in a1.

    This is bad...not horrible, but bad.

    And gcc works around the varargs decision by using the varargs calling convention for some well-known vararg functions like printf, while
    other undeclared functions use the non-varargs calling convention.
    Apparently the fallout of that decision by the RISC-V people hit a
    "relevant" program.

    A good ABI does not need these distinctions.

    It also leaves open code compiled partially by GCC from linking with
    code
    compiled by LLVM will have interoperability issues on mundane calls.

    [1] Apparently they stuck with the decision to deal differently with
    varargs, and then decided to change the rest of the calling convention
    to benefit from that decision by not leaving holes in the FP argument registers for integers and vice versa. I don't find this clearly
    expressed in <https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc>.
    The only thing that points in that direction is:

    |Values are passed in floating-point registers whenever possible,
    |whether or not the integer registers have been exhausted.

    But this does not talk about how the integer argument register
    numbering is changed by the "Hardware Floating-point Calling
    Convention".

    I certainly have a use for as many arguments as the ABI provides,

    Ah, yes, machine-generated code can always defy intuitions about what
    is "typical".

    While I use a generator for my interpreter engines, many other people hand-code them. They would probably use macros for the function
    declaration and the tail-call, though. Or maybe a macro that wraps
    the whole payload so that one can easily switch between this technique
    and one of the others.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jan 9 21:23:30 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    No, I would stand my ground and mandate that they do work.

    That can be tricky. You can read

    https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html

    and its sequel

    https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

    as a cautionary tale.

    To cut this a little shorter: Assume eight arguments are passed in
    registers, like for My 66000.

    Caller calls

    foo (a1, a2, a3, a4, a5, a6, a7, a8);

    Callee side:

    foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

    Foo ends with

    bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

    and wants to save stack space, so it stores the value of b9 into
    the space where it was supposed to be, and then branches to bar.
    Result: Stack corruption.

    What would you tell your ABI designer in that case? Don't do tail
    calls, it is better to use more stack space, with all effect on
    stack sizes and locality that would have?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jan 10 01:08:16 2025
    On Thu, 9 Jan 2025 21:23:30 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the >>>>> number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>>>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    No, I would stand my ground and mandate that they do work.

    That can be tricky. You can read

    https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html

    and its sequel

    https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

    as a cautionary tale.

    Yes, I had to make a nasty ABI work on the HEP (Denelcor)

    To cut this a little shorter: Assume eight arguments are passed in registers, like for My 66000.

    Caller calls

    foo (a1, a2, a3, a4, a5, a6, a7, a8);

    Callee side:

    foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

    Foo ends with

    bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

    and wants to save stack space, so it stores the value of b9 into
    the space where it was supposed to be, and then branches to bar.
    Result: Stack corruption.

    What would you tell your ABI designer in that case? Don't do tail
    calls, it is better to use more stack space, with all effect on
    stack sizes and locality that would have?

    Same response I would give to::

    printf( "%d %d %d %d %d/r", a[i] );

    "They deserve what they get".

    You will notice that no ISA has ever had a "go jump in the lake"
    instruction. For had there been, computers would not have survived
    the the present--they would all be in the lake...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jan 10 08:33:19 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.
    A lot of VAX programs needed fixes to run on different machines.
    I remember issue with writing to strings: early C compilers
    put literal strings in writable memory and programs assumed that
    they can change strings. C 'errno' was made more abstract due
    to multithreading, it broke some programs. Concerning varags,
    Power PC and later AMD-64 used calling convention incompatible
    with popular expectations.

    Concerning customers, they will tolerate a lot of things, as long
    as there are benefits (faster or cheaper machines, better security,
    etc.) and fixes require reasonable amount of work. So that
    really is question of cost/benefit ratio.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Fri Jan 10 08:24:30 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    That would mean that you find it ok that existing programs that use
    vararg functions like printf but do not declare them before use don't
    work on your newfangled architecture.

    Interestingly, tail call optimization (which I believe you like)
    can cause bugs with mismatched arguments when different functions
    disagree abuout the stack size.

    I have a use case for tail-call optimization. When I first looked
    into that around 1994, I found that gcc does not perform tail-call optimization, and I was surprised, because it had been written by a
    Lisp programmer.

    When I looked into the reasons, I found that in C calling conventions
    typically the caller is responsible for alloating stack space for
    arguments and for deallocating that stack space. The reason for that
    is varargs and the fact that in old C there was no requirement to
    define a prototype of a function (including vararg functions). If in
    case of a call just before a return the function needs to put
    deallocating code between the call and return, the call is not a
    tail-call and therefore cannot be tail-call optimized.

    So I thought that with C calling conventions (necessitated by the
    properties of the C language), tail-call optimization is not possible,
    but Mark Probst, a student in our group, actually managed to deal with
    the tail-recursion case
    <https://www.complang.tuwien.ac.at/schani/diplarb.ps>.

    A few years later sibling call optimization (more restrictive than
    general tail-call optimization, but less restrictive than
    tail-recursion elimination) appeared in gcc. The gcc manual
    apparently does not say what a sibling call is, but <https://stackoverflow.com/questions/22037261/what-does-sibling-calls-mean> says "where caller function and callee function do not need to be
    same, but they have compatible stack footprint.". Given the bug you
    point out, that's obviously not restrictive enough to be correct in
    all cases.

    Concerning my use case, for me it's good enough if tail-calls are
    optimized when the caller and the callee have the same argument types
    and return type, and the arguments fit in registers. So if in your
    buggy case gcc decided not to optimize the call as sibling call, my
    use case would not be affected.

    Moreover, I need a guarantee that a call is actually
    tail-call-optimized (and if not, compilation should ideally error out,
    saving me the need to validate that property afterwards), and I would
    be willing to put some text in the source code that indicates that
    intent. E.g., something along the lines of

    void add(VMinst *ip, long *sp, long sp_top)
    {
    /* payload start */
    sp_top += *sp++;
    /* payload end */
    /* invoke the next VM instruction */
    (*ip)(ip+1,sp,sp_top) __attribute__("tail-call optimized");
    }

    Existing code would be unaffected by such an approach to tail-call optimization.

    Beyond my use case: tail-call optimization has to be applied like
    every other optimization: It preserve the behaviour of existing,
    working programs, i.e., the result must be equivalent. In the bug you
    mention, this obviously was not the case, and one way out would be not
    to apply tail-call optimization in this case and similar cases (maybe
    in all cases where arguments are in memory). That looks like a simple
    way to fix the bug. Maybe there's a less restrictive one.

    Sure one can wish that C was different (e.g., like the fantasy that
    all C programs are strictly conforming to some particular C standard
    that turns some desired transformation into a correct optimization),
    but existing, working programs are far more relevant than the wishes
    for some transformation IMO; there are a lot of people who see this differently, but it seems to me that these people not only wish that
    the old C programs vanish, but they don't care much about new C
    programs (apart from a few benchmarks), either. After all, they don't
    program in C, but in C++, Fortran, Rust, or something else.

    Actually, concerning the fantasy mentioned above, gcc already offers
    options such as -std=c23 and -pedantic which would allow the user to
    tell gcc that the compiled program actually lives in this fantasy
    world, but if the user did not ask for pain, a compiler should not
    provide it.

    So, if you want to allow mismatched declarations, better
    disable tail calls, to be on the safe side.

    That would be a way of dealing with the problem. It matches the
    general pattern of people defending transformations that do not
    preserve program equivalence (i.e., are buggy when intended as
    optimizations) by putting up a straw man that disables correct
    otimizations in addition to transformations that do not preserve
    program equivalence.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Jan 10 09:19:27 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 9 Jan 2025 21:23:30 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C, >>>>>> including varargs and often also tolerant of differences between the >>>>>> number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>>>>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V). >>>> Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need >>>> to work. Would you tell him that they don't need to work?

    No, I would stand my ground and mandate that they do work.

    That can be tricky. You can read

    https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html >>
    and its sequel

    https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

    as a cautionary tale.

    Yes, I had to make a nasty ABI work on the HEP (Denelcor)

    To cut this a little shorter: Assume eight arguments are passed in
    registers, like for My 66000.

    Caller calls

    foo (a1, a2, a3, a4, a5, a6, a7, a8);

    Callee side:

    foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

    Foo ends with

    bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

    and wants to save stack space, so it stores the value of b9 into
    the space where it was supposed to be, and then branches to bar.
    Result: Stack corruption.

    What would you tell your ABI designer in that case? Don't do tail
    calls, it is better to use more stack space, with all effect on
    stack sizes and locality that would have?

    Same response I would give to::

    printf( "%d %d %d %d %d/r", a[i] );

    "They deserve what they get".

    So, mismatched arguments don't need to work? We're in agreement, then.

    You will notice that no ISA has ever had a "go jump in the lake"
    instruction. For had there been, computers would not have survived
    the the present--they would all be in the lake...

    I don't find it in

    https://paws.kettering.edu/~jhuggins/humor/opcodes.html so I guess
    it does not exists. (That list is old, that was floating around when
    /pub directories were still open on ftp servers...)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Jan 10 10:25:23 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the
    established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.
    A lot of VAX programs needed fixes to run on different machines.

    That case is interesting. It's certainly a benefit to programmers if
    most uses of NULL produce a SIGSEGV, but for existing programs mapping
    allowing to have accessible memory in page 0 is an advantage. So how
    did we get from there to where we are now?

    First, my guess is that the VAX is only called out because it was so
    popular, and it was one of the first Unix machines where doing it
    differently was possible. I am sure that earlier Unix tragets without
    virtual memory used memory starting with address 1 because they would
    otherwise have wasted precious memory.

    Anyway, once we had virtual memory, whether to use the start of the
    address space is not an issue of the ABI (which is hard to change),
    but could be determined by programmers on linking. I guess that at
    first they used explicit options for making the first page
    unaccessible, and these options soon became the defaults. By the time
    I started with Unix in the later 1980s, that battle was over; I
    certainly never experienced it as an issue, and only read about it in
    papers on VAXocentrism.

    I remember issue with writing to strings: early C compilers
    put literal strings in writable memory and programs assumed that
    they can change strings.

    gcc definitely had an option for that. Again not an ABI issue, but
    one that can be controlled by programmers on compilation.

    C 'errno' was made more abstract due
    to multithreading, it broke some programs.

    That's pretty similar to an ABI issue (not sure if errno is in the
    ABIs or not). And the really perverse thing is that raw Unix and
    Linux system calls have been thread-safe from the start. It's only
    the limitation of C language in early times (no struct returns,
    bringing us back to the topic of the thread) that gave us the errno
    variable in the C wrappers of these system calls that turned out not
    to be thread-safe and led to problems later.

    Concerning varags,
    Power PC and later AMD-64 used calling convention incompatible
    with popular expectations.

    I did not experience calling convention problems on PowerPC in my
    software, so apparently it was compatible with my expectations.

    Still, Power(PC) is very niche. I recently talked to someone who
    worked a lot on Power while he was at IBM (now he no longer works for
    IBM); I asked him why people are buying Power, and he said something
    along the lines that IBM is satisfying a base of established
    customers. Maybe Power would be more popular if it had had a calling convention compatible with popular expectations, probably not.

    As for AMD64, whatever popular expectation they may have been
    incompatible with (again I experienced no problems), the user could
    fall back to the IA-32 calling convention (i.e., compile the program
    as a 32-bit program, or just run the existing 32-bit binary),
    providing an easy workaround for ABI problems for existing, working
    programs.

    Concerning customers, they will tolerate a lot of things, as long
    as there are benefits (faster

    Didn't work out for Alpha.

    or cheaper machines,

    People are abandoning PCs in favour of Raspis? Does not look that way
    to me.

    better security,

    Oh, really? Which machine became a success because of better security?

    etc.) and fixes require reasonable amount of work.

    Many customers expect a machine that's compatible with their legacy
    software, and are not willing (or at all able) to "fix" it. Many even
    require machines that are officially supported by the software vendor.
    And for a software vendor, the need for one fix is probably a sign
    that the platform is not as compatible as they would like, and that
    qualifying that platform requires more work, and they will charge that
    work to the platform's customers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Waldek Hebisch on Fri Jan 10 14:43:31 2025
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    I can agree that it's important to support those use-cases (varargs >>>obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?

    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the
    established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.

    That was a BSD thing. USL spent a fair bit of time fixing
    BSD utilities that relied on the BSD behaviour when porting
    to System V release 4.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jan 10 15:17:29 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.
    A lot of VAX programs needed fixes to run on different machines.

    That case is interesting. It's certainly a benefit to programmers if
    most uses of NULL produce a SIGSEGV, but for existing programs mapping >allowing to have accessible memory in page 0 is an advantage. So how
    did we get from there to where we are now?

    First, my guess is that the VAX is only called out because it was so
    popular, and it was one of the first Unix machines where doing it
    differently was possible. I am sure that earlier Unix tragets without >virtual memory used memory starting with address 1 because they would >otherwise have wasted precious memory.

    It was a bug. As I recall, the first thing in the address space in Berkeley Unix
    was a register save mask where the low byte happened to be zero, and a lot of sloppy programs written by students accidentally depended on it, e.g.

    if(*p == 0) /* no string */

    For a while ports to 68K and other architectures ensured there was a zero byte at
    location zero so the Berkeley programs wouldn't crash, but eventually people fixed
    the code.

    Location 0 on the PDP-11 had nothing memorable and we did our string tests correctly.
    You could deferenence a null pointer but you got a string of junk.




    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Anton Ertl on Fri Jan 10 18:39:12 2025
    On 09/01/2025 08:23, Anton Ertl wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    Why should an ABI be tolerant of such differences? In C, calling a
    function with an unexpected number (or type) of arguments has always
    been undefined behaviour, and always been something that programmers
    have strived to avoid. For variadic functions (including old
    pre-standard functions), the code does not declare the number or types
    of arguments, but you still have to match up the caller and callee.
    Call printf() with a mismatch between the format string and the
    arguments, and you can expect nasal daemons.

    I am all in favour of things like ABI's not intentionally making things significantly worse - no one wants a system that turns code bugs into
    something like an exploitable security hole from stack corruption.

    However, I see no good reason to try to make things "work" with broken
    code. An ABI should be designed with an emphasis on being efficient for correct code - not for being tolerant of hopelessly incorrect code.

    C /does/ require support for variadic functions, so that has to be in
    any ABI usable with C.


    I can agree that it's important to support those use-cases (varargs
    obviously, mismatched arg numbers less so),

    You are head of a group of people who design a new architecture (say,
    it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
    Your ABI designer comes to you and tells you that his life would be
    easier if it was ok that programs with mismatched arguments don't need
    to work. Would you tell him that they don't need to work?


    I would, yes. The efficiency of good code should not suffer because of
    the existence of bad code.

    I'd still try to avoid making results that are more dangerous than
    necessary. Maybe you do that by making it clear to compiler writers
    that the ABI should not be used with a compiler that supports implicit
    function declarations - that would block most risky or broken code at
    compile time. Perhaps you say that object files using this ABI need
    extra sections holding basic information about the function's parameters
    with its definition, and about the arguments when calling the function,
    and encouraging linkers to check for mismatches. There would surely be
    cases where you can't check - casts of function pointer types, dynamic
    linking, etc., - but you would again eliminate a large proportion of errors.

    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    How many people actually want to use code where some functions are
    called with an incorrect number of parameters? Such code is /broken/.
    If it ever gave results that the original users were happy with, it is
    by luck - no matter what ABI you have for your new architecture and new
    tools, it's pure luck whether things work or not in any sense.

    So the best you can do for your prospective customers is tell them that
    you prioritise the results for correct code and help them with tools to
    find mistakes in their ancient broken code.

    Accepting the unfortunate reality that most code of a significant size
    has /some/ bugs in it does not mean it is a good idea to reduce the
    efficiency of good code in a vain attempt at replicating the luck of old undefined behaviour on other platforms! That is especially true for a
    class of error that only exists due to very sloppy development
    practices, and should be identifiable by automatic linting and static
    checking.

    It never ceases to disappoint me how lax C is at fixing things that were
    design flaws from day one of the language. Backwards compatibility is
    very important, but allowing such crappy coding to be accepted simply encourages more people to write crappy code for longer. C compilers are
    even worse, as they usually support crappy code for longer than the C standards. Implicit function declarations were removed from the C
    language in C99, and non-prototype declarations were made obsolescent in
    C90, yet not removed from the language until C23.


    only sloppy ancient C calls
    functions without proper declarations)

    You find it ok to design a calling convention such that ancient C
    programs do not work?


    My original post was about an ABI for microcontroller programming. For
    that use, my answer is a definite "yes".

    For more general use, my answer would also be "yes" for a new
    architecture and ABI. I don't see why anyone should pander to ancient
    sloppy code. If there really is a significant body of C code that does
    not use function prototypes, and that code really is still useful and
    relevant, then it should not be much of a challenge to write a little
    utility program that converts the old code to something more modern.
    Maybe clang-format can already do that.

    What benefit do you expect from such a calling convention? To allow
    to use registers as arguments (and not callee-saved) that would
    otherwise be preferably used as callee-saved registers?


    That sounds like a benefit to me.


    even if it comes at the cost of
    using different calling conventions for the two cases.

    That would mean that you find it ok that existing programs that use
    vararg functions like printf but do not declare them before use don't
    work on your newfangled architecture.

    I would certainly be OK with that. I can understand that some people
    will disagree, but I really think there are better ways to handle old
    and/or broken code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Fri Jan 10 18:39:19 2025
    David Brown <david.brown@hesbynett.no> writes:
    On 09/01/2025 08:23, Anton Ertl wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the
    number of arguments in the caller and callee.

    Why should an ABI be tolerant of such differences? In C, calling a
    function with an unexpected number (or type) of arguments has always
    been undefined behaviour, and always been something that programmers
    have strived to avoid. For variadic functions (including old
    pre-standard functions), the code does not declare the number or types
    of arguments, but you still have to match up the caller and callee.

    I'm not sure that's completely true. Consider, for example,
    main(). It's sort of variadic, but most applications only declare
    the standard C argc/argv arguments. POSIX systems supply
    a third parameter (envp) and most unix/linux implementations
    supply a fourth parameter (auxv).

    I should think so long as the caller provides at least enough
    parameters to match the callee, there shouldn't be any
    issues.

    Call printf() with a mismatch between the format string and the
    arguments, and you can expect nasal daemons.

    Not if you provide _more_ parameters than the format string
    requires, which can happen with e.g. i18n error message strings.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to David Brown on Fri Jan 10 19:19:08 2025
    David Brown <david.brown@hesbynett.no> schrieb:

    How many people actually want to use code where some functions are
    called with an incorrect number of parameters? Such code is /broken/.

    Agreed (at least in priciple).

    If it ever gave results that the original users were happy with, it is
    by luck - no matter what ABI you have for your new architecture and new tools, it's pure luck whether things work or not in any sense.

    It gets worse when the code in question has been around for decades,
    and is widely used. Some ABIs, such as the x86-64 psABI, are very
    forgiving of errors.

    So the best you can do for your prospective customers is tell them that
    you prioritise the results for correct code and help them with tools to
    find mistakes in their ancient broken code.

    Now, you can also tell them to use LTO for checks for any old
    software.

    Example:

    $ cat main.c
    #include <stdio.h>

    int foo(int);

    int main()
    {
    printf ("%d\n", foo(42));
    }
    $ cat foo.c
    int foo (int a, int b)
    {
    return a + 2;
    }
    $ gcc -O2 -flto main.c foo.c
    main.c:3:5: warning: type of 'foo' does not match original declaration [-Wlto-type-mismatch]
    3 | int foo(int);
    | ^
    foo.c:1:5: note: type mismatch in parameter 2
    1 | int foo (int a, int b)
    | ^
    foo.c:1:5: note: type 'int' should match type 'void'
    foo.c:1:5: note: 'foo' was previously declared here

    This also works when the declaration is hidden (for example when
    the violating code is emitted by a compiler for another language
    in the same compiler collection):

    $ cat main.f90
    program main
    implicit none
    interface
    function foo(a) result(ret) bind(c)
    use, intrinsic :: iso_c_binding, only: c_int
    integer(c_int), value :: a
    integer(c_int) :: ret
    end function foo
    end interface
    print *,foo(42)
    end program main
    $ gfortran -O2 -flto main.f90 foo.c
    main.f90:10:17: warning: type of 'foo' does not match original declaration [-Wlto-type-mismatch]
    10 | print *,foo(42)
    | ^
    foo.c:1:5: note: type mismatch in parameter 2
    1 | int foo (int a, int b)
    | ^
    foo.c:1:5: note: type 'int' should match type 'void'
    foo.c:1:5: note: 'foo' was previously declared here

    Excuses are running out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Sun Jan 12 14:55:04 2025
    On 10/01/2025 19:39, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 09/01/2025 08:23, Anton Ertl wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    [Someone wrote:]
    ABI calling conventions tend to be designed to support at least C,
    including varargs and often also tolerant of differences between the >>>>> number of arguments in the caller and callee.

    Why should an ABI be tolerant of such differences? In C, calling a
    function with an unexpected number (or type) of arguments has always
    been undefined behaviour, and always been something that programmers
    have strived to avoid. For variadic functions (including old
    pre-standard functions), the code does not declare the number or types
    of arguments, but you still have to match up the caller and callee.

    I'm not sure that's completely true. Consider, for example,
    main(). It's sort of variadic, but most applications only declare
    the standard C argc/argv arguments. POSIX systems supply
    a third parameter (envp) and most unix/linux implementations
    supply a fourth parameter (auxv).

    I should think so long as the caller provides at least enough
    parameters to match the callee, there shouldn't be any
    issues.

    main() is a special case in C and C++ - it seems fine to say that it
    takes a particular implementation-defined set of parameters no matter
    how it is declared. If it is defined with fewer parameters than the implementation supports, then the definition should be treated as though
    those parameters were included but not used.


    Call printf() with a mismatch between the format string and the
    arguments, and you can expect nasal daemons.

    Not if you provide _more_ parameters than the format string
    requires, which can happen with e.g. i18n error message strings.


    I've always thought printf was a very unsafe design concept - that usage
    does not help!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Thomas Koenig on Sun Jan 12 14:59:09 2025
    On 10/01/2025 20:19, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    How many people actually want to use code where some functions are
    called with an incorrect number of parameters? Such code is /broken/.

    Agreed (at least in priciple).

    If it ever gave results that the original users were happy with, it is
    by luck - no matter what ABI you have for your new architecture and new
    tools, it's pure luck whether things work or not in any sense.

    It gets worse when the code in question has been around for decades,
    and is widely used. Some ABIs, such as the x86-64 psABI, are very
    forgiving of errors.

    So the best you can do for your prospective customers is tell them that
    you prioritise the results for correct code and help them with tools to
    find mistakes in their ancient broken code.

    Now, you can also tell them to use LTO for checks for any old
    software.


    Excuses are running out.

    Yes, exactly.

    I saw somewhere a quotation that backwards compatibility just means
    repeating the same old mistakes. Backwards compatibility /is/
    important, but so is trying to improve coding practices!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Waldek Hebisch on Sun Jan 12 12:10:50 2025
    On 1/6/2025 6:11 PM, Waldek Hebisch wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers
    available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code.

    While there has been much discussion down thread relating to Waldek's
    other points, there hasn't been much about these.

    So, some questions. Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
    there some other explanation for Mitch not considering their importance?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Sun Jan 12 20:41:23 2025
    On Sun, 12 Jan 2025 20:10:50 +0000, Stephen Fuld wrote:

    On 1/6/2025 6:11 PM, Waldek Hebisch wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers >>>> available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code.

    While there has been much discussion down thread relating to Waldek's
    other points, there hasn't been much about these.

    So, some questions. Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
    there some other explanation for Mitch not considering their importance?

    Almost entirely the later.

    The 13 cycles of overhead are invisible in a subroutine
    that takes 1B cycles to execute.

    But also note: The way Fortran passes array arguments is just perfect
    for avoiding almost all bounds checks. Arrays are used in loops where
    the initialization and termination are stored in the dope vector--
    which is trusted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 13 02:10:10 2025
    On Fri, 10 Jan 2025 10:25:23 +0000, Anton Ertl wrote:

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the
    established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.
    A lot of VAX programs needed fixes to run on different machines.

    That case is interesting. It's certainly a benefit to programmers if
    most uses of NULL produce a SIGSEGV, but for existing programs mapping allowing to have accessible memory in page 0 is an advantage. So how
    did we get from there to where we are now?

    The blame goes to defining NULL as a pointer that is not pointing at
    anything. We have no integer that has the property of one value that
    is not an integer--we COULD have had such a value (NEG_MAX on 2's
    complement, -0 on 1's complement), but no..........

    C 'errno' was made more abstract due
    to multithreading, it broke some programs.

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Lobbing errno over into Thread Local Store just makes the problems
    worse.

    That's pretty similar to an ABI issue (not sure if errno is in the
    ABIs or not).

    errno is not ABI, errno is part of subroutine definitions within
    a library. That errno can be set from different libraries, and
    that errno got dropped in TLS makes it doubly idiotic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Stephen Fuld on Mon Jan 13 01:20:38 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 1/6/2025 6:11 PM, Waldek Hebisch wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers >>>> available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code.

    While there has been much discussion down thread relating to Waldek's
    other points, there hasn't been much about these.

    So, some questions. Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
    there some other explanation for Mitch not considering their importance?

    Some comments to this:

    You are implicitely assuming that passing large number of
    arguments is expensive. Of course, if you can do the job with
    smaller number of arguments, then there may be some saving.
    However, large number of arguments is partially to increase
    performance. Let me illustracte this with a example having
    smaller number of arguments. I have a routine which is shortly
    described below:

    ++ vector_combination(v1, c1, v2, c2, n, delta, p) replaces
    ++ first n + 1 entries of v1 by corresponding entries of
    ++ c1*v1+c2*x^delta*v2 mod p.

    There 7 arguments here and it only deals with vectors (one
    dimensional arrays). Instead of routine above I could use
    5 separate routines, one to extrace a subvector, one shifting
    entries, one mutiplying vector by a scalar, one for addition
    and one for replacing a subvector. Using separate routines
    would be take roughly 3-5 times more time and require
    intermediate storage. Dynamically allocating this storage
    would decrease performance and reusing statically allocated
    work vectors would significantly complicate the code. And
    of course having 5 calls instead of a single one also would
    complicate code. So basically, I can use a routine with
    large number of arguments which is doing more work and have
    simpler and faster code or I could use "simpler" routines with
    small number of arguments and get more complicated and slower
    code.

    My routine above was for vectors, similar routine for arrays
    would have larger number of parameters, exceeding 8. Actually,
    already more general routine for vectors would have extra
    parameter to specify starting index (which currently is
    assument do be 0 and is the only case that I need).

    In case of Lapack, reasonably typical case is routine operating
    on subblock of an array, which means that an array (subblock) is
    described by 4 arguments: pointer to first element, leading
    dimension (that corresponding dimension of containing array) and
    2 dimensions of the array. Some dimensions may be shared, but
    clearly even in simplest case there are several parameters.
    There may be additional numeric parameters, work areas, parameters
    specifiying if array is transposed or not (otherwise there would
    be need of separate routines or user would be forced to separate
    call of matrix transposition). There are is convention of
    returning information about possible errors in 'INFO' variable.

    Lapack has inefficiency due to Fortran conventions. Namely,
    in natural C interface most arguments would be by value,
    but Fortran compilers pass arguments by reference. So
    even if all machine-level arguments were passed in registers,
    values still need to be saved in memory by the caller and
    read back by the called routine.

    Modern languages have support for records/structures, so at source
    code level number of arguments may be smaller. However, passing
    structures by address is efficient when structures are only
    passed down (quite typical case in modern code, where data goes
    trough several layers before doing real work), but inccurs cost
    when there is actual access. Passing structures by value means
    that optically number of parameters is smaller, but there is
    still need to pass several values.

    Concerning what machine architects do: for long time goal was
    high _average_ performance, based on some prediciton of load.
    Large number of arguments is reasonable frequent in scientific
    codes. Modern tendency is to pass addresses of aggregates, as
    that is better behaved in OO contexts. I am not aware of any
    publically available substantial body of realistic COBOL code,
    but reasonable guess is that COBOL routines do quite a lot of
    work between calls. In non-OO non-functional context compiler
    can inline small routines, effectively leading to case where
    calls are relatively rare. AFIAR, in initial AMD-64 gcc port
    Suse team that did it claimed about 2-3% better performace due
    to complicated calling convention trying to optimize use
    of registers. In particular that measured object code size
    of large body of Linux programs (they had no real hardware,
    so were unable to measure code speed) and optimed convention
    based on this. Later, Intel team claimed that due to improved
    inlining calls were rare and effect of calling convention was
    of order of fraction of percent. Of course, AMD-64 is limited
    by its 16 general purpose registers, on machine with more registers
    one can pass more arguments in registers, but I doubt that it
    pays to go above 10-12. OTOH I think that having more
    return registers (i mean comparable number to argument-passing
    registers) would improve performance, but probably
    code returning many values is so rare that architecs do not
    care much.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 14:19:43 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Fri, 10 Jan 2025 10:25:23 +0000, Anton Ertl wrote:

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    If yes, a few years down the road your prospective customers have to
    decide whether to go for your newfangled architecture or one of the
    established ones. They learn that a number of programs work
    everywhere else, but not on your architecture. How many of them will
    be placated by your reasoning that these programs are not strictly
    confoming standard programs? How many will be alarmed by your
    admission that you find it ok that you find it ok that such programs
    don't work on your architecture? After all, hardly any program is a
    strictly conforming standard program.

    Such things happended many times in the past. AFAIK standard
    setup on a VAX was that accessing data at address 0 gave you 0.
    A lot of VAX programs needed fixes to run on different machines.

    That case is interesting. It's certainly a benefit to programmers if
    most uses of NULL produce a SIGSEGV, but for existing programs mapping
    allowing to have accessible memory in page 0 is an advantage. So how
    did we get from there to where we are now?

    The blame goes to defining NULL as a pointer that is not pointing at >anything. We have no integer that has the property of one value that
    is not an integer--we COULD have had such a value (NEG_MAX on 2's
    complement, -0 on 1's complement), but no..........

    One of the advantages of BCD systems - we could define a NULL
    pointer value that was non-zero, non-numeric, and didn't point
    to anything.

    (0xc0eeeeee).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Jan 13 10:55:15 2025
    Anton Ertl [2025-01-09 08:38:32] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    AFAIK in these cases the same compiler generates the code for the
    function and for the calls, so it should be pretty much free to use any >>calling convention it likes.
    With separate compilation, the compiler does not know which other
    compiler generates the code for the caller of a function or the callee
    of a function.

    My reply was to:

    Large numbers of parameters may be generated either by closure
    conversion or by lambda lifting.

    Can you show me an example where that happens and where the caller and
    the callee can be generated by different compilers?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Jan 13 18:02:10 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Jan 13 19:00:53 2025
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Mon Jan 13 21:33:32 2025
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles?

    It's less than it used to be in the days when supercomputers
    roamed the computer centers, but for these applications where
    it matters, it can be significant.

    Or do these subroutines consume so many CPU cycles that the
    overhead of the large number of parameters is lost in the noise?

    If you have many small matrices to multiply, startup overhead
    can be quite significant. Not on a 2000*2000 matrix, though.

    Or is
    there some other explanation for Mitch not considering their importance?

    I think eight arguments, passed by reference in registers, is not
    too bad.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 21:53:55 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jan 13 22:02:23 2025
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    But with functions that take 754 arguments and produce 754
    results, it seems unnecessary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 22:40:02 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jan 14 02:32:23 2025
    On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    So, now the subroutine, which computes all work in a single
    instruction, has to check a global variable to decide if it
    has to LD in TLS pointer just to set errno ?!!?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 06:20:43 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    That makes SIMD-style vectorization of transcendentals...
    interesting.

    Hmmm... looking around, it seems that C++ has the same requirement
    since C++11. One more reason why Fortran is a better language
    for numerics than C++...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to All on Tue Jan 14 06:48:45 2025
    I wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles?

    It's less than it used to be in the days when supercomputers
    roamed the computer centers, but for these applications where
    it matters, it can be significant.

    Or do these subroutines consume so many CPU cycles that the
    overhead of the large number of parameters is lost in the noise?

    If you have many small matrices to multiply, startup overhead
    can be quite significant. Not on a 2000*2000 matrix, though.

    Or is
    there some other explanation for Mitch not considering their importance?

    I think eight arguments, passed by reference in registers, is not
    too bad.

    ... when the rest can be passed on the stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Tue Jan 14 15:05:52 2025
    On 13/01/2025 23:40, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.


    You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed
    value set by the implementation, usually as a macro. Certainly with a
    quick check with gcc on Linux, I could not set the bits in math_errhandling.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to All on Tue Jan 14 15:08:03 2025
    On 14/01/2025 03:32, MitchAlsup1 wrote:
    On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions.  Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide.  If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    So, now the subroutine, which computes all work in a single
    instruction, has to check a global variable to decide if it
    has to LD in TLS pointer just to set errno ?!!?

    Seems crazy to me too.

    gcc at least has a "-fno-math-errono" flag that skips errno setting for
    maths functions that are executed as a single instruction. That makes a
    big difference to things like "sqrt".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 14 14:22:19 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    So, now the subroutine, which computes all work in a single
    instruction, has to check a global variable to decide if it
    has to LD in TLS pointer just to set errno ?!!?

    The subroutine clearly does more than "do all the work in a single instruction".

    How does your instruction support all the functionality
    required by the POSIX specification for the sin(3) library function?

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to David Brown on Tue Jan 14 14:39:15 2025
    David Brown <david.brown@hesbynett.no> writes:
    On 13/01/2025 23:40, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing
    direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.


    You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed >value set by the implementation, usually as a macro. Certainly with a
    quick check with gcc on Linux, I could not set the bits in math_errhandling.


    Yes, the programmer in this case would instruct the compiler what
    the value of math_errhandling should be, e.g. with --ffast-math.

    https://gcc.gnu.org/wiki/FloatingPointMath

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Jan 14 16:41:28 2025
    On Tue, 14 Jan 2025 14:22:19 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing >>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    So, now the subroutine, which computes all work in a single
    instruction, has to check a global variable to decide if it
    has to LD in TLS pointer just to set errno ?!!?

    The subroutine clearly does more than "do all the work in a single instruction".

    How does your instruction support all the functionality
    required by the POSIX specification for the sin(3) library function?

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html


    I see no problems for as long as (math_errhandling & MATH_ERRNO)==0.
    Which sounds like more sensible choice regardless of question of
    instruction vs library.

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Raising of FP exceptions is orthogonal to question of one instruction
    vs library call. If anything, when exceptions are enabled, with single-instruction implementation it is probably easier for exception
    handler to find the reason and generate useful diagnostics.

    As to what POSIX allows, on the manual page that you quoted I see no
    indication that implementation is required to give to programmer to
    select this or that behavior. I read it like implementation is allowed
    to make the choice fully by itself.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Scott Lurndal on Tue Jan 14 16:50:26 2025
    On 14/01/2025 15:39, Scott Lurndal wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 13/01/2025 23:40, Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing >>>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.


    You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed
    value set by the implementation, usually as a macro. Certainly with a
    quick check with gcc on Linux, I could not set the bits in math_errhandling. >>

    Yes, the programmer in this case would instruct the compiler what
    the value of math_errhandling should be, e.g. with --ffast-math.

    https://gcc.gnu.org/wiki/FloatingPointMath

    I would say the key flag here is "-fno-math-errno" (which is included in -ffast-math). While I personally think most floating point code could
    be used just as well with "-ffast-math", and it is certainly appropriate
    for my own code, others have significantly different opinions or
    experiences. (That's fair enough.) That one flag simply disables
    setting errno in maths functions that can (reasonably) be implemented
    inline as instructions, without affecting the results of any other
    floating point operations.

    But to my mind, this is /not/ a case of the POSIX programmer making the
    choice - it is an implementation-specific feature. A C compiler might
    choose to always use errno, or never, or have some other control of the
    use of errno. When you write "POSIX leaves it up to the programmer", I
    take that to mean POSIX specifies a function that lets you change the
    value of "math_errhandling". That is quite different from saying "gcc
    has a flag that lets you choose".

    (For my own use, I like the flag - I don't write POSIX code, I have
    never had any use for errno, and I want the compiler to generate as few instructions as it possibly can.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Waldek Hebisch on Tue Jan 14 09:40:27 2025
    On 1/12/2025 5:20 PM, Waldek Hebisch wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 1/6/2025 6:11 PM, Waldek Hebisch wrote:
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers >>>>> available for parameter passing and as scratch registers - perhaps 6 >>>>> would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC, >>> $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code.

    While there has been much discussion down thread relating to Waldek's
    other points, there hasn't been much about these.

    So, some questions. Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of
    CPU cycles? Or do these subroutines consume so many CPU cycles that the
    overhead of the large number of parameters is lost in the noise? Or is
    there some other explanation for Mitch not considering their importance?

    Some comments to this:

    You are implicitely assuming that passing large number of
    arguments is expensive.

    I guess. I am actually assuming that passing arguments in memory is
    more expensive than passing them in registers. I don't think that is controversial.


    Of course, if you can do the job with
    smaller number of arguments, then there may be some saving.
    However, large number of arguments is partially to increase
    performance.

    I agree with your example below, which I snipped. My comment was more
    about the how the system implements argument passing (i.e. the number of registers used for the purpose) than about source code changes (fewer
    calls with more arguments versus more calls with fewer arguments). Specifically, I was not suggesting changing the source code to reduce
    the number of arguments.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 14 18:03:36 2025
    On Tue, 14 Jan 2025 6:48:45 +0000, Thomas Koenig wrote:

    I wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    Has Lapack (and the other old style Fortran numeric
    code that Waldek mentioned) lost its/their importance as a major user of >>> CPU cycles?

    It's less than it used to be in the days when supercomputers
    roamed the computer centers, but for these applications where
    it matters, it can be significant.

    Or do these subroutines consume so many CPU cycles that the
    overhead of the large number of parameters is lost in the noise?

    If you have many small matrices to multiply, startup overhead
    can be quite significant. Not on a 2000*2000 matrix, though.

    Or is
    there some other explanation for Mitch not considering their importance?

    I think eight arguments, passed by reference in registers, is not
    too bad.

    .... when the rest can be passed on the stack.

    And those passed in registers can be stored into memory adjacent
    to the memory arguments easily.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Tue Jan 14 18:02:29 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 14 Jan 2025 14:22:19 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Raising of FP exceptions is orthogonal to question of one instruction
    vs library call. If anything, when exceptions are enabled, with >single-instruction implementation it is probably easier for exception
    handler to find the reason and generate useful diagnostics.

    It seems to me that "raise an exception" is in the IEEE 754 sense (by
    default set a sticky flag in an internal register), not in the C sense
    of raising a signal. AFAIK you can tell the system to produce a
    signal for some exceptions, but the default on Linux is not to.

    As to what POSIX allows, on the manual page that you quoted I see no >indication that implementation is required to give to programmer to
    select this or that behavior. I read it like implementation is allowed
    to make the choice fully by itself.

    And if it is friendly, it can give the programmer a compiler option to
    select between the variants.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Jan 14 19:18:27 2025
    Stephen Fuld wrote:
    On 1/12/2025 5:20 PM, Waldek Hebisch wrote:
    You are implicitely assuming that passing large number of
    arguments is expensive.

    I guess.  I am actually assuming that passing arguments in memory is
    more expensive than passing them in registers.  I don't think that is controversial.

    Usually true, except for recursive functions where you have to store
    most stuff on the stack anyway, so going directly there can sometimes
    generate more compact code.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 18:19:12 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jan 14 18:15:06 2025
    On Tue, 14 Jan 2025 14:22:19 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

    MitchAlsup1 <mitchalsup@aol.com> schrieb:

    errno is an atrocity all by itself; single handedly preventing >>>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
    as instructions.

    Fortunately, the C standard does not require errno to be set
    for these functions. Apple, for example, does not do so.

    Nor will I.

    POSIX does, however, require errno to be set conditionally
    based on an application global variable 'math_errhandling'.

    The functions mentioned have the property of taking x as
    any IEEE 754 number (including NaNs, infinities, denorms)
    and produce a IEEE 754 number {NaNs, infinities, norms,
    denorms}.

    But if POSIX wants to spend as many cycles setting errno
    as performing the calculation, that is for POSIX to decide.

    POSIX leaves it up to the programmer to decide. If the
    programmer desires EDOM or ERANGE, they set the
    appropriate bit in math_errhandling before calling the
    sin et alia functions.

    So, now the subroutine, which computes all work in a single
    instruction, has to check a global variable to decide if it
    has to LD in TLS pointer just to set errno ?!!?

    The subroutine clearly does more than "do all the work in a single instruction".

    All of the work of computing sin(x) is performed in a single
    instruction.

    So we have a subroutine that looks like::

    double library_sin( double x )
    {
    // the work
    double r = My_66000_sin(x);
    // along with setting the flag bits

    // the overhead
    if( FP_Classify( x, NaN | INFINITY | ... ) )
    {
    errno_p tls = TLS();
    if( FP_Classify( x, NaN ) tls->errno = errno_NaN;
    if( FP_Classify( x, INFINITY ) tls->errno = errno_infinity;
    ...
    }
    return r;
    }

    How does your instruction support all the functionality
    required by the POSIX specification for the sin(3) library function?

    Except for the setting of errno.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Jan 14 19:39:06 2025
    Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    Without exposing any internal discussions, it should be obvious to
    anyone "versed in the field" that the ieee754 standard has some warts
    and mistakes. It has been possible to correct very few of them since 1978.

    OTOH, Kahan & co did an amazingly good job to start with, the fact that
    they didn't really consider the needs of massively parallel
    implementations 40-50 years later cannot be blamed on them.

    It is possible that one or two of the grandfather clauses in 754 can be
    removed in the future, simply because the architectures that made those exceptional choices are going away permanently.

    I do not see any way to support things like "trap and rescale" as a way
    to handle exponent overruns, even though that was a neat idea back then.

    It is much more likely that we will simply switch to quad/f128 (or even arbitrary precision) for those few computations that could need it.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 14 19:08:39 2025
    On Tue, 14 Jan 2025 18:19:12 +0000, Thomas Koenig wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    When the numeric calculation of sin() takes 150 cycles, the over-
    head matters little.

    When the numeric calculation of sin() takes 15 cycles, the over-
    head is more noticeable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 14 19:24:16 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    I spent several years on one of those committees[*] in the 90s. There were math and IEEE
    FP experts who very carefully considered all the consequences of changes
    to the math interfaces.

    [*] X/Open -> The Open Group

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jan 14 19:14:13 2025
    On Tue, 14 Jan 2025 18:39:06 +0000, Terje Mathisen wrote:

    Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    Without exposing any internal discussions, it should be obvious to
    anyone "versed in the field" that the ieee754 standard has some warts
    and mistakes. It has been possible to correct very few of them since
    1978.

    OTOH, Kahan & co did an amazingly good job to start with, the fact that
    they didn't really consider the needs of massively parallel
    implementations 40-50 years later cannot be blamed on them.

    CDC STAR and CRAY-1 were showing the massive parallelism well before
    754 ever sat down.

    In addition, the fallacy of exception and repair was also know to be
    a failure well before 754 had their first meeting.

    It is possible that one or two of the grandfather clauses in 754 can be removed in the future, simply because the architectures that made those exceptional choices are going away permanently.

    I do not see any way to support things like "trap and rescale" as a way
    to handle exponent overruns, even though that was a neat idea back then.

    And just how many EVER used said feature ???

    It is much more likely that we will simply switch to quad/f128 (or even arbitrary precision) for those few computations that could need it.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Jan 14 20:01:14 2025
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    Without exposing any internal discussions, it should be obvious to
    anyone "versed in the field" that the ieee754 standard has some warts
    and mistakes. It has been possible to correct very few of them since 1978.

    I'm not throwing shade on the IEEE committe, they did quite a good
    job, considering what they did and did not know.

    What I was criticising was the comittee(s) which made errno handling
    for functions like sin() and cos() mandatory, and put activating
    it in a globel flag.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue Jan 14 23:13:40 2025
    On Tue, 14 Jan 2025 20:31:59 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    I spent several years on one of those committees[*] in the 90s.
    There were math and IEEE FP experts who very carefully considered
    all the consequences of changes to the math interfaces.

    Putting in mandatory errno handling for transcencental intrinsics,
    and making this dependent on a global flag, was a huge mistake.


    Except that they didn't do this particular mistake.
    Please, read other messages of the sub-thread.

    Either the people on that particular committee didn't consider
    the consequences, or they (second option to the one above) didn't
    understand the consequences what they were doing. Vector computers
    had already been in service for a decade when POSIX was released,
    and a question "Would it run well on a Cray" would have answered
    itself.


    Cray-1 still was small enough for errno-based handling of errors in
    trigs to not be a serious obstacle. That is, not Cray-1 itself, but
    imaginary machine with organization, similar to Cray, but with more
    consistent FP arithmetic.

    OTOH, they can be excused if they thought that C should not be
    be used for serious numerical work, and would not be. People had
    FORTRAN for that...

    I would guess that today majority of numerical work is done from python
    by calling libraries. Libraries tend to be written in highly
    none-portable dialects of C language and sometimes in C++. I don't
    expect that measurable amount of Fortran is used in creation of the
    libraries.
    Now, whether one consideres overwhelming majority of today's numerical
    work "serious" is a separate question. But it certainly is a very
    serious business.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 20:31:59 2025
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    I spent several years on one of those committees[*] in the 90s. There were math and IEEE
    FP experts who very carefully considered all the consequences of changes
    to the math interfaces.

    Putting in mandatory errno handling for transcencental intrinsics,
    and making this dependent on a global flag, was a huge mistake.

    Either the people on that particular committee didn't consider
    the consequences, or they (second option to the one above) didn't
    understand the consequences what they were doing. Vector computers
    had already been in service for a decade when POSIX was released,
    and a question "Would it run well on a Cray" would have answered
    itself.

    OTOH, they can be excused if they thought that C should not be
    be used for serious numerical work, and would not be. People had
    FORTRAN for that...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue Jan 14 23:48:19 2025
    On Tue, 14 Jan 2025 19:18:27 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Stephen Fuld wrote:
    On 1/12/2025 5:20 PM, Waldek Hebisch wrote:
    You are implicitely assuming that passing large number of
    arguments is expensive.

    I guess.  I am actually assuming that passing arguments in memory
    is more expensive than passing them in registers.  I don't think
    that is controversial.

    Usually true, except for recursive functions where you have to store
    most stuff on the stack anyway, so going directly there can sometimes generate more compact code.

    Terje


    I would think that for Fortran (==everything passed by reference)
    memory would beat registers most of the time. May be, except for
    functions with 0-4 parameters.
    Do common Fortarn compilers even bother with passing in register?
    It would require replacement of natural by-reference "pointer in
    register points to value in memory" calling sequence to something like copy-in/copy-out, right?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 14 22:05:20 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

    Clearly there are programmers who wish to be able to detect
    certain exceptions, and POSIX allows programmers to
    select that behavior.

    Clearly, there is a committee which wanted to be able for people
    to detect certain error conditions on a fine-grained level.
    One assumes tht they did not consider the consequences.

    Without exposing any internal discussions, it should be obvious to
    anyone "versed in the field" that the ieee754 standard has some warts
    and mistakes. It has been possible to correct very few of them since 1978.

    I'm not throwing shade on the IEEE committe, they did quite a good
    job, considering what they did and did not know.

    What I was criticising was the comittee(s) which made errno handling
    for functions like sin() and cos() mandatory, and put activating
    it in a globel flag.

    It's not mandatory. It's listed as an optional extension, and
    even when implemented, it's opt-in at compile time.

    "The functionality described is optional. The functionality
    described is mandated by the ISO C standard only for implementations
    that define __STDC_IEC_559__."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Jan 15 00:09:58 2025
    On Tue, 14 Jan 2025 19:39:06 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:


    It is much more likely that we will simply switch to quad/f128 (or
    even arbitrary precision) for those few computations that could need
    it.

    Terje



    I had one of those computations that can benefit from quad and, may be,
    from octal precision yesterday/today.
    Design of equiripple symmetric FIR filter with ~2000 coefficients (more commonly called taps) using Parks-McClellan method. Implemented by Matlab/Octave function that traditionally was called remez and now
    called firpm. I suppose, the new name was invented, because the
    algorithm is only similar to Remez exchange, but differs in details.

    Octave failed to do it claiming of limited precision of arithmetic as a
    reason.

    In Matlab implementation is better, and it was able to create filter
    with ~1800 taps, which happened to be sufficient for my today's needs.
    But even Matlab was unable to cope with 2000 taps.

    If I had more time, I'd try to implement Parks-McClellan algorithm
    myself, to see bottlenecks and see whether higher precision helps a
    lot, or just a little. Unfortunately, right now I am too busy with work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Jan 14 23:27:22 2025
    On Tue, 14 Jan 2025 21:48:19 +0000, Michael S wrote:

    On Tue, 14 Jan 2025 19:18:27 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Stephen Fuld wrote:
    On 1/12/2025 5:20 PM, Waldek Hebisch wrote:
    You are implicitely assuming that passing large number of
    arguments is expensive.

    I guess.  I am actually assuming that passing arguments in memory
    is more expensive than passing them in registers.  I don't think
    that is controversial.

    Usually true, except for recursive functions where you have to store
    most stuff on the stack anyway, so going directly there can sometimes
    generate more compact code.

    Terje


    I would think that for Fortran (==everything passed by reference)
    memory would beat registers most of the time.

    Pass by COMMON block was even faster.

    May be, except for
    functions with 0-4 parameters.

    Do common Fortarn compilers even bother with passing in register?

    Fortran compilers are given an ABI (leaning towards C, C++) and
    are required to "do something reasonable" in mapping Fortran
    conventions into C conventions. C subroutines on the called
    side, then, have to have a data structure identical to what
    Fortran compiler would have produced (Dope Vector). C callers
    will have to use those kinds of structures to successfully
    call Fortran entry points.

    It would require replacement of natural by-reference "pointer in
    register points to value in memory" calling sequence to something like copy-in/copy-out, right?

    No, Fortran will pass dope vectors to called subroutines. The
    called subroutine needs to understand the dope vector.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Tue Jan 14 23:39:37 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    [...]
    What I was criticising was the comittee(s) which made errno handling
    for functions like sin() and cos() mandatory, and put activating
    it in a globel flag.

    It's not mandatory. It's listed as an optional extension, and
    even when implemented, it's opt-in at compile time.

    "The functionality described is optional. The functionality
    described is mandated by the ISO C standard only for implementations
    that define __STDC_IEC_559__."

    I can't find that anywhere in ISO C or POSIX. What exactly are you
    quoting? ISO C doesn't tie math_errhandling to __STDC_IEC_559__.

    https://pubs.opengroup.org/onlinepubs/9799919799/help/codes.html#MX

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Keith Thompson on Tue Jan 14 23:40:43 2025
    Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    There's no requirement in ISO C or POSIX for an implementation to let
    users affect the value of math_errhandling, at compile time or
    otherwise. (And POSIX isn't directly relevant; this is all defined by
    ISO C. There might be something in POSIX that goes beyond the ISO C >requirements.)

    gcc has "-f[no-]fast-math" and "-f[no-]math-errno" options that can
    affect the value of math_errhandling.

    Would not that qualify as at "compile time"?

    That's certainly what I meant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Wed Jan 15 00:47:59 2025
    On Tue, 14 Jan 2025 21:13:40 +0000, Michael S wrote:


    I would guess that today majority of numerical work is done from python
    by calling libraries.

    New software, but many of us are still using FEM from 1970s.

    That is the problem with floating point software--once developed
    you can continue using it forever.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Jan 15 03:31:47 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    I would think that for Fortran (==everything passed by reference)
    memory would beat registers most of the time.

    Pass by COMMON block was even faster.

    Sometimes. On machines that don't have direct addressing, such as S/360,
    the code needs to load a pointer to the data either way so it's a wash.

    Even when you do have direct addressing, if code is compiled to be
    position indepedent, the common block wouldn't be in the same module
    as the code that references it so it still needs to load a pointer
    from the GOT or whatever its equivalent is.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Levine on Wed Jan 15 16:50:58 2025
    On Wed, 15 Jan 2025 3:31:47 +0000, John Levine wrote:

    According to MitchAlsup1 <mitchalsup@aol.com>:
    I would think that for Fortran (==everything passed by reference)
    memory would beat registers most of the time.

    Pass by COMMON block was even faster.

    Sometimes. On machines that don't have direct addressing, such as
    S/360,
    the code needs to load a pointer to the data either way so it's a wash.

    Even when you do have direct addressing, if code is compiled to be
    position indepedent, the common block wouldn't be in the same module
    as the code that references it so it still needs to load a pointer
    from the GOT or whatever its equivalent is.

    Pass by COMMON block allows one to pass hundreds of data values in a
    single call.

    You are treating the common block as if it had but one data container.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Jan 15 22:03:54 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    On Wed, 15 Jan 2025 3:31:47 +0000, John Levine wrote:

    According to MitchAlsup1 <mitchalsup@aol.com>:
    Pass by COMMON block was even faster.

    Sometimes. On machines that don't have direct addressing, such as
    S/360,
    the code needs to load a pointer to the data either way so it's a wash.

    Even when you do have direct addressing, if code is compiled to be
    position indepedent, the common block wouldn't be in the same module
    as the code that references it so it still needs to load a pointer
    from the GOT or whatever its equivalent is.

    Pass by COMMON block allows one to pass hundreds of data values in a
    single call.

    You are treating the common block as if it had but one data container.

    If I were that kind of programmer, I could use EQUIVALENCE to glue a bunch of local variables and arrays together and pass that as a subroutine argument. Also
    remember that on machines without direct addressing there's extra code if the size of a block of whatever size is more than the offset size in an instruction,
    12 bits on S/360 and usually 16 on z.

    It's really a matter of taste and programming style more than efficiency.

    R's,
    John

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Thu Jan 16 03:02:44 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Tue, 14 Jan 2025 21:48:19 +0000, Michael S wrote:

    On Tue, 14 Jan 2025 19:18:27 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Stephen Fuld wrote:
    On 1/12/2025 5:20 PM, Waldek Hebisch wrote:
    You are implicitely assuming that passing large number of
    arguments is expensive.

    I guess.  I am actually assuming that passing arguments in memory
    is more expensive than passing them in registers.  I don't think
    that is controversial.

    Usually true, except for recursive functions where you have to store
    most stuff on the stack anyway, so going directly there can sometimes
    generate more compact code.

    Terje


    I would think that for Fortran (==everything passed by reference)
    memory would beat registers most of the time.

    One still needs to pass _values_ of addresses. Doing it in
    registers (assuming that enough are available) is likely to
    be more efficient than storing addresses in memory and
    re-fetching them later. _Relatively_ difference between
    passing in registers and passing in memory is smaller, as
    there are memory references to access arguments, but registers
    are likely to be a plus (unless there is excessive spiling and
    called routine needs to write addreses to memory and load
    them later).

    Pass by COMMON block was even faster.

    I do not think so. I LAPACK-like cases there are array arguments.
    Normal calling convention needs to store and later read parameters
    and pass addresses. COMMON would force copying of entire arrays,
    much less efficienct than handling parameters.

    In complicated program there could be many COMMON blocks, leading
    to worse locality than stack use (not relevant for cacheless
    machine and one with very bing caches, but could make a difference
    for machines with small caches).

    It would require replacement of natural by-reference "pointer in
    register points to value in memory" calling sequence to something like
    copy-in/copy-out, right?

    No, Fortran will pass dope vectors to called subroutines. The
    called subroutine needs to understand the dope vector.

    I would not say this. AFAIK in Fortran 77 caller passes enough
    information so that called routine can construct its own dope
    vector (if desired). IIUC that is very similar to VMT-s in C99.

    I think PL/I, Ada, Extended Pascal and probably Fortran 90 use
    dope vectors.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Waldek Hebisch on Thu Jan 16 15:08:37 2025
    On Thu, 16 Jan 2025 3:02:44 +0000, Waldek Hebisch wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:


    Pass by COMMON block was even faster.

    I do not think so. I LAPACK-like cases there are array arguments.
    Normal calling convention needs to store and later read parameters
    and pass addresses. COMMON would force copying of entire arrays,
    much less efficienct than handling parameters.

    SUBROUTINE FOO
    COMMON /ALPHA/ i,j,k,a[100],b[100],c[100,100]

    See no arguments, passed directly by common-block, no copying of
    data, no dope vectors needed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Thu Jan 16 16:24:38 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    On Thu, 16 Jan 2025 3:02:44 +0000, Waldek Hebisch wrote:

    MitchAlsup1 <mitchalsup@aol.com> wrote:


    Pass by COMMON block was even faster.

    I do not think so. I LAPACK-like cases there are array arguments.
    Normal calling convention needs to store and later read parameters
    and pass addresses. COMMON would force copying of entire arrays,
    much less efficienct than handling parameters.

    SUBROUTINE FOO
    COMMON /ALPHA/ i,j,k,a[100],b[100],c[100,100]

    See no arguments, passed directly by common-block, no copying of
    data, no dope vectors needed.

    No copy only if there is single set of arguments. If there are
    different arguments, then one needs to pass them, that is copy
    them.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to George Neuner on Mon Jan 27 17:09:59 2025
    George Neuner <gneuner2@comcast.net> writes:

    On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    Large numbers of parameters may be generated either by closure
    conversion or by lambda lifting. These are FP language
    transformations that are analogous to, but potentially more complex
    than, the rewriting of object methods and their call sites to pass the current object in an OO language.

    [The difference between closure conversion and lambda lifting is the
    scope of the tranformation: conversion limits code transformations to
    within the defining call chain, whereas lifting pulls the closure to
    top level making it (at least potentially) globally available.]

    In either case the original function is rewritten such that non-local variables can be passed as parameters. The function's code must be
    altered to access the non-locals - either directly as explicit
    individual parameters, or by indexing from a pointer to an environment
    data structure.

    While in a simple case this could look exactly like the OO method transformation, recall that a general closure may require access to
    non-local variables spread through multiple environments. Even if
    whole environments are passed via single pointers, there still may
    need to be multiple parameters added.

    Isn't it the case that access to all of the enclosing environments
    can be provided by passing a single pointer? I'm pretty sure it
    is.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Jan 28 22:53:00 2025
    On Mon, 27 Jan 2025 17:09:59 -0800, Tim Rentsch
    <tr.17687@z991.linuxsc.com> wrote:

    George Neuner <gneuner2@comcast.net> writes:

    On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    Large numbers of parameters may be generated either by closure
    conversion or by lambda lifting. These are FP language
    transformations that are analogous to, but potentially more complex
    than, the rewriting of object methods and their call sites to pass the
    current object in an OO language.

    [The difference between closure conversion and lambda lifting is the
    scope of the tranformation: conversion limits code transformations to
    within the defining call chain, whereas lifting pulls the closure to
    top level making it (at least potentially) globally available.]

    In either case the original function is rewritten such that non-local
    variables can be passed as parameters. The function's code must be
    altered to access the non-locals - either directly as explicit
    individual parameters, or by indexing from a pointer to an environment
    data structure.

    While in a simple case this could look exactly like the OO method
    transformation, recall that a general closure may require access to
    non-local variables spread through multiple environments. Even if
    whole environments are passed via single pointers, there still may
    need to be multiple parameters added.

    Isn't it the case that access to all of the enclosing environments
    can be provided by passing a single pointer? I'm pretty sure it
    is.

    Certainly, if the enclosing environments somehow are chained together.
    In real code though, in many instances such a chain will not already
    exist when the closure is constructed. The compiler would have to
    install pointers to the needed environments (or, alternatively,
    pointers directly to the needed values) into the new closure's
    immediate environment.
    [essentially this creates a private "display" for the closure.]

    Completely doable: it is simply that, if there are enough registers,
    passing the pointers as parameters will tend to be more performant.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)