Forum: >>> Magnum BBS <<<

Calling conventions (particularly 32-bit ARM)

From David Brown@21:1/5 to All on Mon Jan 6 14:57:51 2025

I'm trying to understand the reasoning behind some of the calling
conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very
important to me - good calling conventions make a big difference.

No doubt most people here know this already, but in summary these
devices are a 32-bit load/store RISC architecture with 16 registers.
R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
registers, R13 is the stack pointer, R14 is the link register and R15 is
the program counter. For most Cortex-M cores, there is no
super-scaling, out-of-order execution, speculative execution, etc., but instructions are pipelined.

The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as
32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a
/long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument passing?

I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
would make more sense.

In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than
using C-style error codes or passing manual pointers to return value
slots. But the limited return registers adds significant overhead to
small functions.

Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?

Thanks for any pointers or explanations here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Theo@21:1/5 to David Brown on Mon Jan 6 15:23:40 2025

David Brown <david.brown@hesbynett.no> wrote:

The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

According to EABI, it's also possible to return a 128 bit vector in R0-3: https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument passing?

The 'composite type' return value, where a pointer is passed in as the first argument to the function and a struct at that pointer is filled in with the return values, has existed since the first ARM ABI - APCS-R: http://www.riscos.com/support/developers/dde/appf.html

That dates from the mid 1980s before 'modern compilers', and I'm guessing
that has stuck around. A lot of early ARM code was in assembler. The
original ARMCC was good but fairly basic - GCC didn't support ARM until
about 1993.

[*] technically APCS-R was the second ARM ABI, APCS-A was the first: https://heyrick.eu/assembler/apcsintro.html
but I don't think return value handling was any different.

Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?

Probably the latter. Also that AArch64 was an opportunity to throw all this stuff away and start again, with a much richer calling convention: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#result-return

but obviously that's no help to the microcontroller folks. At this stage, a change of calling convention might be fairly big ask.

Theo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to David Brown on Mon Jan 6 15:32:04 2025

David Brown <david.brown@hesbynett.no> writes:

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns?

Most calling conventions on RISCs are oriented towards C (if you want
calling conventions that try to be more cross-language (and slower),
look at VAX) and its properties and limitations at the time when the
calling convention was designed, in particular, the PCC
implementation, which was the de-facto standard Unix C compiler at the
time. C compilers in the 1980s did not allocate structs to registers,
so passing structs in registers was foreign to them, so the solution
is that the caller passes the target struct as an additional
parameter.

And passing the return value in registers might not have saved
anything on a compiler that does not deal with structs in registers.
E.g., if you have

mystruct = myfunc(arg1, arg2);

you would see stores to mystruct behind the call. With the PCC
calling convention, the same stores would happen in the caller
(possibly resulting in smaller code if there are several calls to
myfunc()).

I wonder, though, how things look for

mystruct = foo(&mystruct);

Does PCC perform the return stores to mystruct only after performing
all other memory accesses in foo? Probably yes, anything else would
complicate the compiler. In that case the caller could pass &mystruct
for the return value (a slight complication). But is that restriction reflected in the calling convention?

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

gcc has an option -freg-struct-return, which does what you want. Of
course, if you use this option on ARM A32/T32, you are not following
the calling convention, so you should only use it when all sides of a
struct return are compiled with that option.

Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a >/long/ time since they insisted that structs occupied a contiguous block
of memory.

ARM A32 is from 1985, and its calling convention is probably not much
younger.

I also think code would be a bit more efficient if there more registers >available for parameter passing and as scratch registers - perhaps 6
would make more sense.

There is a tendency towards passing more parameters in registers in
more recent calling conventions. IA-32 (and IIRC VAX) passes none,
MIPS uses 4 integer registers (for either integer or FP parameters),
Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
RISC-V has 8 integer and 8 FP registers. Not sure why they were so
reluctant to use more registers earlier.

In more modern C++ programming, it's very practical to use types like >std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than >using C-style error codes or passing manual pointers to return value
slots.

The ARM calling convention is certainly much older than "modern C++ programming".

But the limited return registers adds significant overhead to
small functions.

C++ programmers think they know what C programming is about (and
unfortunately they dominate not just C++ compiler writers, but they
also damage C compilers while they are at it), so my sympathy for your
problem is very limited.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Mon Jan 6 20:10:13 2025

On Mon, 6 Jan 2025 13:57:51 +0000, David Brown wrote:

I'm trying to understand the reasoning behind some of the calling
conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very important to me - good calling conventions make a big difference.

No doubt most people here know this already, but in summary these
devices are a 32-bit load/store RISC architecture with 16 registers.
R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
registers, R13 is the stack pointer, R14 is the link register and R15 is
the program counter. For most Cortex-M cores, there is no
super-scaling,

SuperScalar

out-of-order execution, speculative execution, etc., but instructions are pipelined.

The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.

Someone above mentioned a trick to pass back a 128-bit value.

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

I have seen subroutines that returned structures where the point
in the subroutine that puts values in the returned structure is
such that putting the structure in registers is less efficient
than returning the struct in registers--it all depends on how
the struct is laid out in memory. Doing the struct field assign-
ments in the middle of the subroutine (long path to return) is
often enough to sway which is more efficient.

Is there any good reason why the ABI is designed with such limited
register usage for returns?

Vogue in 1980 was to have 1 result passed back from subroutines.

Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values.

My 66000 can pass up to 8 registers back as a aggregate result.

Modern compilers are
quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument
passing?

In My 66000 ABI they can and do.

I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than using C-style error codes or passing manual pointers to return value
slots. But the limited return registers adds significant overhead to
small functions.

C++ also has the:
try-throw-catch exception model which require new-and-fun stuff to
be thrown onto the stack.
constructors and destructors
new
Atomic stuff

Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?

At the time, there were good technical rational--which my have
faded in import as the years go by.

Thanks for any pointers or explanations here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 6 20:19:15 2025

On Mon, 6 Jan 2025 15:32:04 +0000, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for >>the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs >>that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns?

Most calling conventions on RISCs are oriented towards C (if you want
calling conventions that try to be more cross-language (and slower),
look at VAX) and its properties and limitations at the time when the
calling convention was designed, in particular, the PCC
implementation, which was the de-facto standard Unix C compiler at the
time. C compilers in the 1980s did not allocate structs to registers,
so passing structs in registers was foreign to them, so the solution
is that the caller passes the target struct as an additional
parameter.

And passing the return value in registers might not have saved
anything on a compiler that does not deal with structs in registers.
E.g., if you have

mystruct = myfunc(arg1, arg2);

you would see stores to mystruct behind the call. With the PCC
calling convention, the same stores would happen in the caller
(possibly resulting in smaller code if there are several calls to
myfunc()).

I wonder, though, how things look for

mystruct = foo(&mystruct);

Does PCC perform the return stores to mystruct only after performing
all other memory accesses in foo? Probably yes, anything else would complicate the compiler. In that case the caller could pass &mystruct
for the return value (a slight complication). But is that restriction reflected in the calling convention?

For VERY MANY circumstances passing a struct by address is more
efficient than passing it by value, AND especially when the
compiler does not optimize heavily.

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

In addition, the programmer has the choice of changing into pointer
form (&struct) from value form (struct) which is what we learned
was better style way back then.

--------------------------

I also think code would be a bit more efficient if there more registers >>available for parameter passing and as scratch registers - perhaps 6
would make more sense.

There is a tendency towards passing more parameters in registers in
more recent calling conventions. IA-32 (and IIRC VAX) passes none,
MIPS uses 4 integer registers (for either integer or FP parameters),
Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
RISC-V has 8 integer and 8 FP registers. Not sure why they were so
reluctant to use more registers earlier.

Compiler people were telling us that more callee saved registers would
be higher performing than more argument registers. It did not turn out
to be that way.

Oh and BTW, lack of argument registers leads to an increased
desire for the linker to perform inline folding. ...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Tue Jan 7 02:11:45 2025

MitchAlsup1 <mitchalsup@aol.com> wrote:

I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )

which has 13 arguments.

Large number of arguments is typical in old style Fortran numeric
code. It also appears in functional-style code, where to get
around lack of destructive modification one freqenty have to
double number of arguments. Another source is closures: when
looking at source captured values are not visible as arguments,
but implementation has to pass them behind the scenes.

More generally, large number of arguments tend to appear in
hand-optimized where they may lead to faster code than
using structures in memory. In C structures in memory are
not that expensive, so scope for gain is limited, but several
languages dynamically allocate all structures (and pass then
via address). In such case avoiding dynamic allocation can
give substantial gain. Programmers now are much less
inclined to do microptimizations of this sort. But it may
appear in machine generated sources.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Waldek Hebisch on Tue Jan 7 06:53:44 2025

On Tue, 7 Jan 2025 02:11:45 -0000 (UTC), Waldek Hebisch wrote:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )

which has 13 arguments.

That kind of thing just cries out for passing arguments by keyword.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Tue Jan 7 09:49:16 2025

On 06/01/2025 16:32, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns?

Most calling conventions on RISCs are oriented towards C (if you want
calling conventions that try to be more cross-language (and slower),
look at VAX) and its properties and limitations at the time when the
calling convention was designed, in particular, the PCC
implementation, which was the de-facto standard Unix C compiler at the
time. C compilers in the 1980s did not allocate structs to registers,
so passing structs in registers was foreign to them, so the solution
is that the caller passes the target struct as an additional
parameter.

And passing the return value in registers might not have saved
anything on a compiler that does not deal with structs in registers.

Agreed.

This is all as I suspected - but it's nice to have it confirmed by others.

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

I use struct returns sometimes in my C code, but they are (naturally
enough) a far smaller proportion of return types than in C++ code.

gcc has an option -freg-struct-return, which does what you want. Of
course, if you use this option on ARM A32/T32, you are not following
the calling convention, so you should only use it when all sides of a
struct return are compiled with that option.

I know about the -freg-struct-return option (and the requirements for
using it), but it only has effect for 32-bit x86 as far as I know. It certainly makes no difference for 32-bit ARM/Thumb. (clang specifically
says it does not support that option for 32-bit ARM/Thumb.) I think
part of this is that the calling convention already returns structs in registers - just as long as the struct fits in the single 32-bit register.

Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a
/long/ time since they insisted that structs occupied a contiguous block
of memory.

ARM A32 is from 1985, and its calling convention is probably not much younger.

I first used ARM assembly in the late 1980's, but that was mixed BBC
BASIC and assembly, all with almost no documentation, so I don't know
what calling conventions there were at that time. (But the Acorn
Archimedes was /really/ cool :-) )

I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.

There is a tendency towards passing more parameters in registers in
more recent calling conventions. IA-32 (and IIRC VAX) passes none,
MIPS uses 4 integer registers (for either integer or FP parameters),
Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
RISC-V has 8 integer and 8 FP registers. Not sure why they were so
reluctant to use more registers earlier.

Passing all parameters on the stack and returning a single int in a
register was a perfect fit for old-style C where functions were often
used without declarations. It would certainly be a lot easier for
variadic functions. But once you start passing some parameters in
registers, it seems strange to use so few. Perhaps it was to make life
easier for earlier compiler writers? Things like lifetime analysis and register allocation algorithms were not as sophisticated as they are now
- it used to be that if a variable used a register (via the C "register" qualifier), the register was dedicated to the variable throughout the
function. Too many registers for parameter passing might have left too
few registers for function implementation, or at least made the compiler
more complex.

In more modern C++ programming, it's very practical to use types like
std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than
using C-style error codes or passing manual pointers to return value
slots.

The ARM calling convention is certainly much older than "modern C++ programming".

Yes.

But the limited return registers adds significant overhead to
small functions.

C++ programmers think they know what C programming is about (and unfortunately they dominate not just C++ compiler writers, but they
also damage C compilers while they are at it), so my sympathy for your problem is very limited.

I program in C and C++, and in the past did a lot of assembly (mostly on
8-bit or 16-bit microcontrollers). I am fully aware that C and C++ are different languages, and I write code in different styles for each.

For this issue, improving the calling convention would make the biggest difference for C++, but would also be a positive benefit for C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Theo on Tue Jan 7 09:22:15 2025

On 06/01/2025 16:23, Theo wrote:

David Brown <david.brown@hesbynett.no> wrote:

The big problem I see is the registers used for returning values from
functions. R0-R3 can all be used for passing arguments to functions, as
32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.

According to EABI, it's also possible to return a 128 bit vector in R0-3: https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return

To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.

Is there any good reason why the ABI is designed with such limited
register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a
/long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument passing?

The 'composite type' return value, where a pointer is passed in as the first argument to the function and a struct at that pointer is filled in with the return values, has existed since the first ARM ABI - APCS-R: http://www.riscos.com/support/developers/dde/appf.html

That dates from the mid 1980s before 'modern compilers', and I'm guessing that has stuck around. A lot of early ARM code was in assembler. The original ARMCC was good but fairly basic - GCC didn't support ARM until
about 1993.

[*] technically APCS-R was the second ARM ABI, APCS-A was the first: https://heyrick.eu/assembler/apcsintro.html
but I don't think return value handling was any different.

Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?

Probably the latter.

It certainly seems that way to me. But there was always the possibility
that there were technical reasons that I had not thought of.

Also that AArch64 was an opportunity to throw all this
stuff away and start again, with a much richer calling convention: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#result-return

but obviously that's no help to the microcontroller folks. At this stage, a change of calling convention might be fairly big ask.

Actually, I disagree on that one. In the microcontroller world,
changing calling conventions should not be nearly as difficult as it
would be on hosted systems because you are rarely dealing with
pre-compiled object code. And there are already many variations on
calling conventions for 32-bit ARM devices - for thumb or ARM code, and
for all the different combinations of floating point registers which may
or may not be used.

The pre-compiled object code you always have is basic C libraries and
compiler support libraries (things like software floating point
routines). For a typical 32-bit embedded gcc ARM toolchain there are
already 30+ builds for libraries for all the different variants of the architecture and calling conventions - a few more won't be a problem.

Then there are some RTOS's and other commercial libraries that are only available in binary form. Most of these are written in crappy ancient
C90 - they won't return structs or other bigger data anyway, and thus be unaffected by such changes. And it would not be difficult for these
suppliers to re-compile with new options either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue Jan 7 10:09:20 2025

On 06/01/2025 21:19, MitchAlsup1 wrote:

On Mon, 6 Jan 2025 15:32:04 +0000, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for >>> the return type and pass a pointer to that space in R0.

To my mind, this is massively inefficient, especially when using structs >>> that are made up of two 32-bit parts.

I wonder, though, how things look for

mystruct = foo(&mystruct);

Does PCC perform the return stores to mystruct only after performing
all other memory accesses in foo? Probably yes, anything else would
complicate the compiler. In that case the caller could pass &mystruct
for the return value (a slight complication). But is that restriction
reflected in the calling convention?

For VERY MANY circumstances passing a struct by address is more
efficient than passing it by value, AND especially when the
compiler does not optimize heavily.

For /some/ circumstances it is certainly true that passing by reference
(or by pointer, or by hidden pointer on the stack) is more efficient, especially for larger aggregates. For others - especially smaller
aggregates - using registers is vastly more efficient.

Both C and C++ provide perfectly good ways to pass data around by
address when that's what you want to do. My problem is that the calling convention won't let me pass around data in registers when I want to do
that.

I don't care what the compiler does when not optimising heavily - or for compilers that can't optimise heavily. When I am looking for efficient
code, I use optimisation - caring about inefficiencies in the calling convention without heavy optimisation is like caring about how fast your
car goes when you keep it in first gear.

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

In addition, the programmer has the choice of changing into pointer
form (&struct) from value form (struct) which is what we learned
was better style way back then.

I already know when it is best to pass a struct via a pointer, and when
it is best to pass it as a struct value. (The 32-bit ARM calling
convention happily uses registers to pass structs by value, using up to
4 registers. It's the return via registers that is missing.) I also
know when it is best for a struct return to be via an address or in
registers - but C has no way to let me choose that.

--------------------------

I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.

There is a tendency towards passing more parameters in registers in
more recent calling conventions. IA-32 (and IIRC VAX) passes none,
MIPS uses 4 integer registers (for either integer or FP parameters),
Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
RISC-V has 8 integer and 8 FP registers. Not sure why they were so
reluctant to use more registers earlier.

Compiler people were telling us that more callee saved registers would
be higher performing than more argument registers. It did not turn out
to be that way.

The trouble with that kind of thing is that people write different kinds
of code. The balance that works best for - say - PC desktop application programming is not necessarily the best for small-systems embedded
programming. And the balance that works best for C is not necessarily
the best for C++, or Rust, or D, or OCaml or any other language.

I am not looking for perfection here - I don't think such a thing as a "perfect" calling convention could exist. I am just looking for an
obvious improvement that would help in many languages and for a lot of
code, with zero cost for code that doesn't need it - or for some good
technical reason why it /would/ be costly.

Oh and BTW, lack of argument registers leads to an increased
desire for the linker to perform inline folding. ...

Certainly a way out of this is to look to link-time optimisation and
more inline code. But that leads to a lot of additional issues.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Tue Jan 7 16:52:27 2025

On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

Large numbers of parameters may be generated either by closure
conversion or by lambda lifting. These are FP language
transformations that are analogous to, but potentially more complex
than, the rewriting of object methods and their call sites to pass the
current object in an OO language.

[The difference between closure conversion and lambda lifting is the
scope of the tranformation: conversion limits code transformations to
within the defining call chain, whereas lifting pulls the closure to
top level making it (at least potentially) globally available.]

In either case the original function is rewritten such that non-local
variables can be passed as parameters. The function's code must be
altered to access the non-locals - either directly as explicit
individual parameters, or by indexing from a pointer to an environment
data structure.

While in a simple case this could look exactly like the OO method transformation, recall that a general closure may require access to
non-local variables spread through multiple environments. Even if
whole environments are passed via single pointers, there still may
need to be multiple parameters added.

Where exactly the line is drawn between passing individual variables
from an enviroment vs passing the whole enviroment is a heuristic that
is tied to the CPU's argument passing convention.

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to David Brown on Tue Jan 7 23:23:28 2025

On Tue, 7 Jan 2025 9:09:20 +0000, David Brown wrote:

On 06/01/2025 21:19, MitchAlsup1 wrote:

------------------------

Both C and C++ provide perfectly good ways to pass data around by
address when that's what you want to do. My problem is that the calling convention won't let me pass around data in registers when I want to do
that.

I don't care what the compiler does when not optimising heavily - or for compilers that can't optimise heavily. When I am looking for efficient
code, I use optimisation - caring about inefficiencies in the calling convention without heavy optimisation is like caring about how fast your
car goes when you keep it in first gear.

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

In addition, the programmer has the choice of changing into pointer
form (&struct) from value form (struct) which is what we learned
was better style way back then.

I already know when it is best to pass a struct via a pointer, and when
it is best to pass it as a struct value. (The 32-bit ARM calling
convention happily uses registers to pass structs by value, using up to
4 registers. It's the return via registers that is missing.) I also
know when it is best for a struct return to be via an address or in
registers - but C has no way to let me choose that.

My 66000 ABI passes structs up to 8 doublewords in size as
arguments and as results.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 7 23:35:31 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 7 Jan 2025 9:09:20 +0000, David Brown wrote:

On 06/01/2025 21:19, MitchAlsup1 wrote:

------------------------

Both C and C++ provide perfectly good ways to pass data around by
address when that's what you want to do. My problem is that the calling
convention won't let me pass around data in registers when I want to do
that.

I don't care what the compiler does when not optimising heavily - or for
compilers that can't optimise heavily. When I am looking for efficient
code, I use optimisation - caring about inefficiencies in the calling
convention without heavy optimisation is like caring about how fast your
car goes when you keep it in first gear.

Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.

In addition, the programmer has the choice of changing into pointer
form (&struct) from value form (struct) which is what we learned
was better style way back then.

I already know when it is best to pass a struct via a pointer, and when
it is best to pass it as a struct value. (The 32-bit ARM calling
convention happily uses registers to pass structs by value, using up to
4 registers. It's the return via registers that is missing.) I also
know when it is best for a struct return to be via an address or in
registers - but C has no way to let me choose that.

My 66000 ABI passes structs up to 8 doublewords in size as
arguments and as results.

What is a doubleword in your architecture? In intel vernacular
it's 32-bits, but that's not universal.

Both x86_64 and ARM64 support passing eight 64-bit quantities
as arguments and as results architecturally without using
the SIMD registers.

Now, ABI conventions may be otherwise, but they're important
for interoperability, not basic functionality.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jan 8 01:38:00 2025

On Tue, 7 Jan 2025 23:35:31 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

My 66000 ABI passes structs up to 8 doublewords in size as
arguments and as results.

What is a doubleword in your architecture? In intel vernacular
it's 32-bits, but that's not universal.

Intel is wrong, IBM defined the tern before Intel was existent
(1963 or earlier).

Byte 8-bits
Half 16-bits
word 32-bits
DW 64-bits
QW 128-bits
OW 256-bits
Line 512-bits

Oh, and BTW:: DEI stands for Dale Earnhardt Incorporated...

Both x86_64 and ARM64 support passing eight 64-bit quantities
as arguments and as results architecturally without using
the SIMD registers.

Now, ABI conventions may be otherwise, but they're important
for interoperability, not basic functionality.

Done wrong (or weak) they add overhead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 8 12:20:51 2025

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

Large numbers of parameters may be generated either by closure
conversion or by lambda lifting.

AFAIK in these cases the same compiler generates the code for the
function and for the calls, so it should be pretty much free to use any
calling convention it likes.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 8 12:34:30 2025

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

For languages where the type systems ensures that the max number of
arguments is known (and the same) when compiling the function and when compiling the calls to it, you could adjust the number of caller-saved
argument registers according to the actual number of arguments of the
function, thus making it "cheap" to allow, say, 13 argument registers
for those functions that take 13 arguments, since it doesn't impact the
other functions.

But in any case, I suspect there are also diminishing returns at some
point: how much faster is it in practice to pass/return 13 values in
registers instead of 8 of them in registers and the remaining 5 on
the stack? I expect a 13-arg function to perform an amount
of work that will dwarf the extra work of going through the stack.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Wed Jan 8 20:19:40 2025

On Wed, 8 Jan 2025 17:34:30 +0000, Stefan Monnier wrote:

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

For languages where the type systems ensures that the max number of
arguments is known (and the same) when compiling the function and when compiling the calls to it, you could adjust the number of caller-saved argument registers according to the actual number of arguments of the function, thus making it "cheap" to allow, say, 13 argument registers
for those functions that take 13 arguments, since it doesn't impact the
other functions.

The counter argument is that there are too few subroutines wanting
this amount of register argument passing. So, even if you allowed
for this, it probably does not show up on the bottom line.

But in any case, I suspect there are also diminishing returns at some
point: how much faster is it in practice to pass/return 13 values in registers instead of 8 of them in registers and the remaining 5 on
the stack? I expect a 13-arg function to perform an amount
of work that will dwarf the extra work of going through the stack.

Then there is the issue of what is IN the structure passed in
registers??

If it is a series of bytes, then it is better passed by reference
so the bytes can be LDed (1 instruction) rather than extracted
(2 instructions in most ISAs); or STed (1 instruction) rather
than insertion (3 instruction most ISAs).

If, instead, the structure is comprised of bit-fields, then it is
almost always wise to pass in registers--since extraction and
insertion are always reg->reg.

Also note: If the structure is written deep with the subroutine,
many (many) instructions before return, Then it is often wiser
to perform this stuff into a memory area, and reload just prior
to return.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Wed Jan 8 22:08:46 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

For languages where the type systems ensures that the max number of
arguments is known (and the same) when compiling the function and when >compiling the calls to it, you could adjust the number of caller-saved >argument registers according to the actual number of arguments of the >function, thus making it "cheap" to allow, say, 13 argument registers
for those functions that take 13 arguments, since it doesn't impact the
other functions.

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

Language-private calling conventions can be a good idea, but then, if
you want to call C code (or be called by C code), you need to handle
ABI calling conventions in addition.

But in any case, I suspect there are also diminishing returns at some
point: how much faster is it in practice to pass/return 13 values in >registers instead of 8 of them in registers and the remaining 5 on
the stack? I expect a 13-arg function to perform an amount
of work that will dwarf the extra work of going through the stack.

I certainly have a use for as many arguments as the ABI provides, for
functions that typically contain only a few payload instructions: You
can implement a direct-threaded VM interpreter using tail-call
optimization, along the lines of

void add(VMinst *ip, long *sp, long sp_top)
{
/* payload start */
sp_top += *sp++;
/* payload end */
/* invoke the next VM instruction */
(*ip)(ip+1,sp,sp_top);
}

30 years ago gcc could not tail-call-optimize this, in the meantime it
can (and clang can do it, too). However, typical VMs have more than
just these three VM registers (Gforth has ip, sp, rp, fp, lp, up,
fp_top (usually mapped to a real-machine FP register) and registers
for as many sp stack items as practical; we intend to cache rp_top in
a register, too), and ideally you can pass them all as arguments; so
we could make good use of 10+ arguments. If there are not enough
arguments in registers, you have to use explicit register vars (a GNU
C extension) in addition, but that is more architecture-specific.
Some preliminary testing on AMD64 resulted in gcc apparently
supporting a lot of explicit registers on AMD64, and clang/LLVM only
one.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Wed Jan 8 18:20:43 2025

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs
obviously, mismatched arg numbers less so), but I think the focus of optimization of the ABI should be calls to functions known to take the
exact same number of arguments (after all, even in C we normally know
the prototype of the called function; only sloppy ancient C calls
functions without proper declarations), even if it comes at the cost of
using different calling conventions for the two cases.

But in any case, I suspect there are also diminishing returns at some >>point: how much faster is it in practice to pass/return 13 values in >>registers instead of 8 of them in registers and the remaining 5 on
the stack? I expect a 13-arg function to perform an amount
of work that will dwarf the extra work of going through the stack.

I certainly have a use for as many arguments as the ABI provides,

Ah, yes, machine-generated code can always defy intuitions about what
is "typical". 🙂

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jan 9 00:11:08 2025

On Wed, 8 Jan 2025 23:20:43 +0000, Stefan Monnier wrote:

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

My 6600 ABI was designed for C, but is compatible with Fortran and
C++ {and I suspect most languages--under the assumption that those
languages have to clean up their own messes*}.

(*) C++ has to drop "stuff" on the stack so that it can properly
deallocate new structures when Try-Throw-Catch is performing walk
backs, and to utilize that "stack stuff" when searching for the
right exception block.

When C calls Fortran and Fortran is expecting an array, C has
to build the dope vector used by Fortran in accessing said array.

Any calling convention is pressed on both sides--more argument registers
and more callee-save registers--but the number of registers if fixed.

I can agree that it's important to support those use-cases (varargs obviously, mismatched arg numbers less so), but I think the focus of optimization of the ABI should be calls to functions known to take the
exact same number of arguments (after all, even in C we normally know
the prototype of the called function; only sloppy ancient C calls
functions without proper declarations), even if it comes at the cost of
using different calling conventions for the two cases.

In My 66000 ABI varargs takes one more Prologue instructions as
a non-varargs subroutine and creates a vector of DW arguments
which can be picked off with va_list = SP; va_start = 0,
and va_arg(va_list,arg) = LD Rd,[va_list,Rarg<<3];

One of the key reasons to have a unified register model.

But in any case, I suspect there are also diminishing returns at some >>>point: how much faster is it in practice to pass/return 13 values in >>>registers instead of 8 of them in registers and the remaining 5 on
the stack?

Back when we looked at this in mid 1990s, using more registers for
arguments (than the 8 we were using) was "well down" the low hanging
fruit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stefan Monnier on Thu Jan 9 08:38:32 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:

AFAIK in these cases the same compiler generates the code for the
function and for the calls, so it should be pretty much free to use any >calling convention it likes.

With separate compilation, the compiler does not know which other
compiler generates the code for the caller of a function or the callee
of a function. ABI Calling conventions exist in order to make code by different compilers (whether the same language or a different one) interoperable.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to monnier@iro.umontreal.ca on Thu Jan 9 07:23:57 2025

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the
established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

only sloppy ancient C calls
functions without proper declarations)

You find it ok to design a calling convention such that ancient C
programs do not work?

What benefit do you expect from such a calling convention? To allow
to use registers as arguments (and not callee-saved) that would
otherwise be preferably used as callee-saved registers?

However, I wonder why, e.g., RISC-V does not allow the use of all
caller-saved registers as arguments. In addition to the 8 argument
registers (a0-a7=x10-x17), RISC-V has 7 additional caller-saved
registers: t0-t6(=x5-x7,x28-x31); for FP register's it's even more
extreme: 8 argument registers fa0-fa7=f10-f17, and 12 additional
caller-saved registers ft0-ft12=f0-f7,f28-f31.

even if it comes at the cost of
using different calling conventions for the two cases.

That would mean that you find it ok that existing programs that use
vararg functions like printf but do not declare them before use don't
work on your newfangled architecture. Looking at <https://pdos.csail.mit.edu/6.828/2023/readings/riscv-calling.pdf>,
the RISC-V people find that acceptable:

|If argument i < 8 is a floating-point type, it is passed in
|floating-point register fai; [...] Additionally, floating-point
|arguments to variadic functions (except those that are explicitly
|named in the parameter list) are passed in integer registers.

So if I 'printf("%f",1.0)' without first declaring printf, the program
won't work. I just tried out compiling the following program on
RISC-V with gcc 10.3.1:

int main()
{
printf("%f\n",1.0);
}

int xxx()
{
yyy("%f\n",1.0,2);
}

Note that there is no "#include <stdio.h>" or any declaration of
printf() or yyy(). Yet 1.0 is passed to printf() in a1, while it is
passed to yyy() in fa0, and 2 is passed to yyy() in a1.

And gcc works around the varargs decision by using the varargs calling convention for some well-known vararg functions like printf, while
other undeclared functions use the non-varargs calling convention.
Apparently the fallout of that decision by the RISC-V people hit a
"relevant" program.

[1] Apparently they stuck with the decision to deal differently with
varargs, and then decided to change the rest of the calling convention
to benefit from that decision by not leaving holes in the FP argument
registers for integers and vice versa. I don't find this clearly
expressed in <https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc>.
The only thing that points in that direction is:

|Values are passed in floating-point registers whenever possible,
|whether or not the integer registers have been exhausted.

But this does not talk about how the integer argument register
numbering is changed by the "Hardware Floating-point Calling
Convention".

I certainly have a use for as many arguments as the ABI provides,

Ah, yes, machine-generated code can always defy intuitions about what
is "typical".

While I use a generator for my interpreter engines, many other people
hand-code them. They would probably use macros for the function
declaration and the tail-call, though. Or maybe a macro that wraps
the whole payload so that one can easily switch between this technique
and one of the others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Thu Jan 9 10:07:36 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

That would mean that you find it ok that existing programs that use
vararg functions like printf but do not declare them before use don't
work on your newfangled architecture.

Interestingly, tail call optimization (which I believe you like)
can cause bugs with mismatched arguments when different functions
disagree abuout the stack size. Here is a nasty case with sibling
calls:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90329

So, if you want to allow mismatched declarations, better
disable tail calls, to be on the safe side.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jan 9 20:48:07 2025

On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

No, I would stand my ground and mandate that they do work.

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs?

One of the salient point that allowed C to overtake PASCAL is that
you can write printf() in C while you cannot write write() in PARCAL.
Do not break this assumption on any architecture.

How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

only sloppy ancient C calls
functions without proper declarations)

You find it ok to design a calling convention such that ancient C
programs do not work?

I went the other way, I made an ABI that made varargs EASY !!
and in such a way the caller does not need to know callee is
varargs.

What benefit do you expect from such a calling convention? To allow
to use registers as arguments (and not callee-saved) that would
otherwise be preferably used as callee-saved registers?

I found no particular problem in passing a fixed number of arguments
in registers and the rest on a stack. va_list; dumps the registers
onto the stack to form a vector of arguments in memory. va_arg
initializes the pointer to where the registers got stuck on the stack.

However, I wonder why, e.g., RISC-V does not allow the use of all caller-saved registers as arguments.

A) we need some registers for passing of arguments, and some
..for returning results.
B) we need some temporary registers so short leaf subroutines
..do not need stack space in order to compute with the given
..arguments
C) we need some registers for holding onto caller's state while
..processing callee operations
D) there is generally a register holding the return address.

Generally (A) and (B) have a sliding window. The fewer arguments
and results, the more temporary registers.

(C) includes FP and SP as callee preserved registers--that is
..when control returns to caller R16..R31 contain the same
..values as when the CALL was performed.

In looking at code out of My 66000 LLVM compiler, there are so
few subroutines with "that many" arguments and results, that
mandating more than 8 arguments or results go through memory
is not really a performance burden.

Also: More callee save registers (preserved) causes more stack
space to be allocated for the 'temporary' registers. Say you want
all the registers (except return address register, and return
result register) to be preserved across a subroutine call: So, a
small subroutine needing 3 registers to perform its calculations;
now has 3 STs and 3 LDs to preserve caller registers, whereas
with temporary registers there is no overhead.

In addition to the 8 argument
registers (a0-a7=x10-x17), RISC-V has 7 additional caller-saved
registers: t0-t6(=x5-x7,x28-x31); for FP register's it's even more
extreme: 8 argument registers fa0-fa7=f10-f17, and 12 additional
caller-saved registers ft0-ft12=f0-f7,f28-f31.

even if it comes at the cost of
using different calling conventions for the two cases.

That would mean that you find it ok that existing programs that use
vararg functions like printf but do not declare them before use don't
work on your newfangled architecture. Looking at <https://pdos.csail.mit.edu/6.828/2023/readings/riscv-calling.pdf>,
the RISC-V people find that acceptable:

|If argument i < 8 is a floating-point type, it is passed in
|floating-point register fai; [...] Additionally, floating-point
|arguments to variadic functions (except those that are explicitly
|named in the parameter list) are passed in integer registers.

So if I 'printf("%f",1.0)' without first declaring printf, the program
won't work. I just tried out compiling the following program on
RISC-V with gcc 10.3.1:

int main()
{
printf("%f\n",1.0);
}

int xxx()
{
yyy("%f\n",1.0,2);
}

Note that there is no "#include <stdio.h>" or any declaration of
printf() or yyy(). Yet 1.0 is passed to printf() in a1, while it is
passed to yyy() in fa0, and 2 is passed to yyy() in a1.

This is bad...not horrible, but bad.

And gcc works around the varargs decision by using the varargs calling convention for some well-known vararg functions like printf, while
other undeclared functions use the non-varargs calling convention.
Apparently the fallout of that decision by the RISC-V people hit a
"relevant" program.

A good ABI does not need these distinctions.

It also leaves open code compiled partially by GCC from linking with
code
compiled by LLVM will have interoperability issues on mundane calls.

[1] Apparently they stuck with the decision to deal differently with
varargs, and then decided to change the rest of the calling convention
to benefit from that decision by not leaving holes in the FP argument registers for integers and vice versa. I don't find this clearly
expressed in <https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc>.
The only thing that points in that direction is:

|Values are passed in floating-point registers whenever possible,
|whether or not the integer registers have been exhausted.

But this does not talk about how the integer argument register
numbering is changed by the "Hardware Floating-point Calling
Convention".

I certainly have a use for as many arguments as the ABI provides,

Ah, yes, machine-generated code can always defy intuitions about what
is "typical".

While I use a generator for my interpreter engines, many other people hand-code them. They would probably use macros for the function
declaration and the tail-call, though. Or maybe a macro that wraps
the whole payload so that one can easily switch between this technique
and one of the others.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jan 9 21:23:30 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

No, I would stand my ground and mandate that they do work.

That can be tricky. You can read

https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html

and its sequel

https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

as a cautionary tale.

To cut this a little shorter: Assume eight arguments are passed in
registers, like for My 66000.

Caller calls

foo (a1, a2, a3, a4, a5, a6, a7, a8);

Callee side:

foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

Foo ends with

bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

and wants to save stack space, so it stores the value of b9 into
the space where it was supposed to be, and then branches to bar.
Result: Stack corruption.

What would you tell your ABI designer in that case? Don't do tail
calls, it is better to use more stack space, with all effect on
stack sizes and locality that would have?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Fri Jan 10 01:08:16 2025

On Thu, 9 Jan 2025 21:23:30 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the >>>>> number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>>>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

No, I would stand my ground and mandate that they do work.

That can be tricky. You can read

https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html

and its sequel

https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

as a cautionary tale.

Yes, I had to make a nasty ABI work on the HEP (Denelcor)

To cut this a little shorter: Assume eight arguments are passed in registers, like for My 66000.

Caller calls

foo (a1, a2, a3, a4, a5, a6, a7, a8);

Callee side:

foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

Foo ends with

bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

and wants to save stack space, so it stores the value of b9 into
the space where it was supposed to be, and then branches to bar.
Result: Stack corruption.

What would you tell your ABI designer in that case? Don't do tail
calls, it is better to use more stack space, with all effect on
stack sizes and locality that would have?

Same response I would give to::

printf( "%d %d %d %d %d/r", a[i] );

"They deserve what they get".

You will notice that no ISA has ever had a "go jump in the lake"
instruction. For had there been, computers would not have survived
the the present--they would all be in the lake...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Anton Ertl on Fri Jan 10 08:33:19 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.
A lot of VAX programs needed fixes to run on different machines.
I remember issue with writing to strings: early C compilers
put literal strings in writable memory and programs assumed that
they can change strings. C 'errno' was made more abstract due
to multithreading, it broke some programs. Concerning varags,
Power PC and later AMD-64 used calling convention incompatible
with popular expectations.

Concerning customers, they will tolerate a lot of things, as long
as there are benefits (faster or cheaper machines, better security,
etc.) and fixes require reasonable amount of work. So that
really is question of cost/benefit ratio.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Fri Jan 10 08:24:30 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

That would mean that you find it ok that existing programs that use
vararg functions like printf but do not declare them before use don't
work on your newfangled architecture.

Interestingly, tail call optimization (which I believe you like)
can cause bugs with mismatched arguments when different functions
disagree abuout the stack size.

I have a use case for tail-call optimization. When I first looked
into that around 1994, I found that gcc does not perform tail-call optimization, and I was surprised, because it had been written by a
Lisp programmer.

When I looked into the reasons, I found that in C calling conventions
typically the caller is responsible for alloating stack space for
arguments and for deallocating that stack space. The reason for that
is varargs and the fact that in old C there was no requirement to
define a prototype of a function (including vararg functions). If in
case of a call just before a return the function needs to put
deallocating code between the call and return, the call is not a
tail-call and therefore cannot be tail-call optimized.

So I thought that with C calling conventions (necessitated by the
properties of the C language), tail-call optimization is not possible,
but Mark Probst, a student in our group, actually managed to deal with
the tail-recursion case
<https://www.complang.tuwien.ac.at/schani/diplarb.ps>.

A few years later sibling call optimization (more restrictive than
general tail-call optimization, but less restrictive than
tail-recursion elimination) appeared in gcc. The gcc manual
apparently does not say what a sibling call is, but <https://stackoverflow.com/questions/22037261/what-does-sibling-calls-mean> says "where caller function and callee function do not need to be
same, but they have compatible stack footprint.". Given the bug you
point out, that's obviously not restrictive enough to be correct in
all cases.

Concerning my use case, for me it's good enough if tail-calls are
optimized when the caller and the callee have the same argument types
and return type, and the arguments fit in registers. So if in your
buggy case gcc decided not to optimize the call as sibling call, my
use case would not be affected.

Moreover, I need a guarantee that a call is actually
tail-call-optimized (and if not, compilation should ideally error out,
saving me the need to validate that property afterwards), and I would
be willing to put some text in the source code that indicates that
intent. E.g., something along the lines of

void add(VMinst *ip, long *sp, long sp_top)
{
/* payload start */
sp_top += *sp++;
/* payload end */
/* invoke the next VM instruction */
(*ip)(ip+1,sp,sp_top) __attribute__("tail-call optimized");
}

Existing code would be unaffected by such an approach to tail-call optimization.

Beyond my use case: tail-call optimization has to be applied like
every other optimization: It preserve the behaviour of existing,
working programs, i.e., the result must be equivalent. In the bug you
mention, this obviously was not the case, and one way out would be not
to apply tail-call optimization in this case and similar cases (maybe
in all cases where arguments are in memory). That looks like a simple
way to fix the bug. Maybe there's a less restrictive one.

Sure one can wish that C was different (e.g., like the fantasy that
all C programs are strictly conforming to some particular C standard
that turns some desired transformation into a correct optimization),
but existing, working programs are far more relevant than the wishes
for some transformation IMO; there are a lot of people who see this differently, but it seems to me that these people not only wish that
the old C programs vanish, but they don't care much about new C
programs (apart from a few benchmarks), either. After all, they don't
program in C, but in C++, Fortran, Rust, or something else.

Actually, concerning the fantasy mentioned above, gcc already offers
options such as -std=c23 and -pedantic which would allow the user to
tell gcc that the compiled program actually lives in this fantasy
world, but if the user did not ask for pain, a compiler should not
provide it.

So, if you want to allow mismatched declarations, better
disable tail calls, to be on the safe side.

That would be a way of dealing with the problem. It matches the
general pattern of people defending transformations that do not
preserve program equivalence (i.e., are buggy when intended as
optimizations) by putting up a straw man that disables correct
otimizations in addition to transformations that do not preserve
program equivalence.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Jan 10 09:19:27 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 9 Jan 2025 21:23:30 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Thu, 9 Jan 2025 7:23:57 +0000, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C, >>>>>> including varargs and often also tolerant of differences between the >>>>>> number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>>>>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V). >>>> Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need >>>> to work. Would you tell him that they don't need to work?

No, I would stand my ground and mandate that they do work.

That can be tricky. You can read

https://blog.r-project.org/2019/05/15/gfortran-issues-with-lapack/index.html >>
and its sequel

https://blog.r-project.org/2019/09/25/gfortran-issues-with-lapack-ii/

as a cautionary tale.

Yes, I had to make a nasty ABI work on the HEP (Denelcor)

To cut this a little shorter: Assume eight arguments are passed in
registers, like for My 66000.

Caller calls

foo (a1, a2, a3, a4, a5, a6, a7, a8);

Callee side:

foo (a1, a2, a3, a4, a5, a6, a7, a8, a9)

Foo ends with

bar (b1, b2, b3, b4, b5, b6, b7, b8, b9);

and wants to save stack space, so it stores the value of b9 into
the space where it was supposed to be, and then branches to bar.
Result: Stack corruption.

What would you tell your ABI designer in that case? Don't do tail
calls, it is better to use more stack space, with all effect on
stack sizes and locality that would have?

Same response I would give to::

printf( "%d %d %d %d %d/r", a[i] );

"They deserve what they get".

So, mismatched arguments don't need to work? We're in agreement, then.

You will notice that no ISA has ever had a "go jump in the lake"
instruction. For had there been, computers would not have survived
the the present--they would all be in the lake...

I don't find it in

https://paws.kettering.edu/~jhuggins/humor/opcodes.html so I guess
it does not exists. (That list is old, that was floating around when
/pub directories were still open on ftp servers...)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Waldek Hebisch on Fri Jan 10 10:25:23 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the
established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.
A lot of VAX programs needed fixes to run on different machines.

That case is interesting. It's certainly a benefit to programmers if
most uses of NULL produce a SIGSEGV, but for existing programs mapping
allowing to have accessible memory in page 0 is an advantage. So how
did we get from there to where we are now?

First, my guess is that the VAX is only called out because it was so
popular, and it was one of the first Unix machines where doing it
differently was possible. I am sure that earlier Unix tragets without
virtual memory used memory starting with address 1 because they would
otherwise have wasted precious memory.

Anyway, once we had virtual memory, whether to use the start of the
address space is not an issue of the ABI (which is hard to change),
but could be determined by programmers on linking. I guess that at
first they used explicit options for making the first page
unaccessible, and these options soon became the defaults. By the time
I started with Unix in the later 1980s, that battle was over; I
certainly never experienced it as an issue, and only read about it in
papers on VAXocentrism.

I remember issue with writing to strings: early C compilers
put literal strings in writable memory and programs assumed that
they can change strings.

gcc definitely had an option for that. Again not an ABI issue, but
one that can be controlled by programmers on compilation.

C 'errno' was made more abstract due
to multithreading, it broke some programs.

That's pretty similar to an ABI issue (not sure if errno is in the
ABIs or not). And the really perverse thing is that raw Unix and
Linux system calls have been thread-safe from the start. It's only
the limitation of C language in early times (no struct returns,
bringing us back to the topic of the thread) that gave us the errno
variable in the C wrappers of these system calls that turned out not
to be thread-safe and led to problems later.

Concerning varags,
Power PC and later AMD-64 used calling convention incompatible
with popular expectations.

I did not experience calling convention problems on PowerPC in my
software, so apparently it was compatible with my expectations.

Still, Power(PC) is very niche. I recently talked to someone who
worked a lot on Power while he was at IBM (now he no longer works for
IBM); I asked him why people are buying Power, and he said something
along the lines that IBM is satisfying a base of established
customers. Maybe Power would be more popular if it had had a calling convention compatible with popular expectations, probably not.

As for AMD64, whatever popular expectation they may have been
incompatible with (again I experienced no problems), the user could
fall back to the IA-32 calling convention (i.e., compile the program
as a 32-bit program, or just run the existing 32-bit binary),
providing an easy workaround for ABI problems for existing, working
programs.

Concerning customers, they will tolerate a lot of things, as long
as there are benefits (faster

Didn't work out for Alpha.

or cheaper machines,

People are abandoning PCs in favour of Raspis? Does not look that way
to me.

better security,

Oh, really? Which machine became a success because of better security?

etc.) and fixes require reasonable amount of work.

Many customers expect a machine that's compatible with their legacy
software, and are not willing (or at all able) to "fix" it. Many even
require machines that are officially supported by the software vendor.
And for a software vendor, the need for one fix is probably a sign
that the platform is not as compatible as they would like, and that
qualifying that platform requires more work, and they will charge that
work to the platform's customers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Waldek Hebisch on Fri Jan 10 14:43:31 2025

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

I can agree that it's important to support those use-cases (varargs >>>obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the
established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.

That was a BSD thing. USL spent a fair bit of time fixing
BSD utilities that relied on the BSD behaviour when porting
to System V release 4.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jan 10 15:17:29 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.
A lot of VAX programs needed fixes to run on different machines.

That case is interesting. It's certainly a benefit to programmers if
most uses of NULL produce a SIGSEGV, but for existing programs mapping >allowing to have accessible memory in page 0 is an advantage. So how
did we get from there to where we are now?

First, my guess is that the VAX is only called out because it was so
popular, and it was one of the first Unix machines where doing it
differently was possible. I am sure that earlier Unix tragets without >virtual memory used memory starting with address 1 because they would >otherwise have wasted precious memory.

It was a bug. As I recall, the first thing in the address space in Berkeley Unix
was a register save mask where the low byte happened to be zero, and a lot of sloppy programs written by students accidentally depended on it, e.g.

if(*p == 0) /* no string */

For a while ports to 68K and other architectures ensured there was a zero byte at
location zero so the Berkeley programs wouldn't crash, but eventually people fixed
the code.

Location 0 on the PDP-11 had nothing memorable and we did our string tests correctly.
You could deferenence a null pointer but you got a string of junk.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Anton Ertl on Fri Jan 10 18:39:12 2025

On 09/01/2025 08:23, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

Why should an ABI be tolerant of such differences? In C, calling a
function with an unexpected number (or type) of arguments has always
been undefined behaviour, and always been something that programmers
have strived to avoid. For variadic functions (including old
pre-standard functions), the code does not declare the number or types
of arguments, but you still have to match up the caller and callee.
Call printf() with a mismatch between the format string and the
arguments, and you can expect nasal daemons.

I am all in favour of things like ABI's not intentionally making things significantly worse - no one wants a system that turns code bugs into
something like an exploitable security hole from stack corruption.

However, I see no good reason to try to make things "work" with broken
code. An ABI should be designed with an emphasis on being efficient for correct code - not for being tolerant of hopelessly incorrect code.

C /does/ require support for variadic functions, so that has to be in
any ABI usable with C.

I can agree that it's important to support those use-cases (varargs
obviously, mismatched arg numbers less so),

You are head of a group of people who design a new architecture (say,
it's 2010 and you design ARM A64, or it's 2014 and you design RISC-V).
Your ABI designer comes to you and tells you that his life would be
easier if it was ok that programs with mismatched arguments don't need
to work. Would you tell him that they don't need to work?

I would, yes. The efficiency of good code should not suffer because of
the existence of bad code.

I'd still try to avoid making results that are more dangerous than
necessary. Maybe you do that by making it clear to compiler writers
that the ABI should not be used with a compiler that supports implicit
function declarations - that would block most risky or broken code at
compile time. Perhaps you say that object files using this ABI need
extra sections holding basic information about the function's parameters
with its definition, and about the arguments when calling the function,
and encouraging linkers to check for mismatches. There would surely be
cases where you can't check - casts of function pointer types, dynamic
linking, etc., - but you would again eliminate a large proportion of errors.

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

How many people actually want to use code where some functions are
called with an incorrect number of parameters? Such code is /broken/.
If it ever gave results that the original users were happy with, it is
by luck - no matter what ABI you have for your new architecture and new
tools, it's pure luck whether things work or not in any sense.

So the best you can do for your prospective customers is tell them that
you prioritise the results for correct code and help them with tools to
find mistakes in their ancient broken code.

Accepting the unfortunate reality that most code of a significant size
has /some/ bugs in it does not mean it is a good idea to reduce the
efficiency of good code in a vain attempt at replicating the luck of old undefined behaviour on other platforms! That is especially true for a
class of error that only exists due to very sloppy development
practices, and should be identifiable by automatic linting and static
checking.

It never ceases to disappoint me how lax C is at fixing things that were
design flaws from day one of the language. Backwards compatibility is
very important, but allowing such crappy coding to be accepted simply encourages more people to write crappy code for longer. C compilers are
even worse, as they usually support crappy code for longer than the C standards. Implicit function declarations were removed from the C
language in C99, and non-prototype declarations were made obsolescent in
C90, yet not removed from the language until C23.

only sloppy ancient C calls
functions without proper declarations)

You find it ok to design a calling convention such that ancient C
programs do not work?

My original post was about an ABI for microcontroller programming. For
that use, my answer is a definite "yes".

For more general use, my answer would also be "yes" for a new
architecture and ABI. I don't see why anyone should pander to ancient
sloppy code. If there really is a significant body of C code that does
not use function prototypes, and that code really is still useful and
relevant, then it should not be much of a challenge to write a little
utility program that converts the old code to something more modern.
Maybe clang-format can already do that.

What benefit do you expect from such a calling convention? To allow
to use registers as arguments (and not callee-saved) that would
otherwise be preferably used as callee-saved registers?

That sounds like a benefit to me.

even if it comes at the cost of
using different calling conventions for the two cases.

That would mean that you find it ok that existing programs that use
vararg functions like printf but do not declare them before use don't
work on your newfangled architecture.

I would certainly be OK with that. I can understand that some people
will disagree, but I really think there are better ways to handle old
and/or broken code.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Fri Jan 10 18:39:19 2025

David Brown <david.brown@hesbynett.no> writes:

On 09/01/2025 08:23, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the
number of arguments in the caller and callee.

Why should an ABI be tolerant of such differences? In C, calling a
function with an unexpected number (or type) of arguments has always
been undefined behaviour, and always been something that programmers
have strived to avoid. For variadic functions (including old
pre-standard functions), the code does not declare the number or types
of arguments, but you still have to match up the caller and callee.

I'm not sure that's completely true. Consider, for example,
main(). It's sort of variadic, but most applications only declare
the standard C argc/argv arguments. POSIX systems supply
a third parameter (envp) and most unix/linux implementations
supply a fourth parameter (auxv).

I should think so long as the caller provides at least enough
parameters to match the callee, there shouldn't be any
issues.

Call printf() with a mismatch between the format string and the
arguments, and you can expect nasal daemons.

Not if you provide _more_ parameters than the format string
requires, which can happen with e.g. i18n error message strings.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to David Brown on Fri Jan 10 19:19:08 2025

David Brown <david.brown@hesbynett.no> schrieb:

How many people actually want to use code where some functions are
called with an incorrect number of parameters? Such code is /broken/.

Agreed (at least in priciple).

If it ever gave results that the original users were happy with, it is
by luck - no matter what ABI you have for your new architecture and new tools, it's pure luck whether things work or not in any sense.

It gets worse when the code in question has been around for decades,
and is widely used. Some ABIs, such as the x86-64 psABI, are very
forgiving of errors.

So the best you can do for your prospective customers is tell them that
you prioritise the results for correct code and help them with tools to
find mistakes in their ancient broken code.

Now, you can also tell them to use LTO for checks for any old
software.

Example:

$ cat main.c
#include <stdio.h>

int foo(int);

int main()
{
printf ("%d\n", foo(42));
}
$ cat foo.c
int foo (int a, int b)
{
return a + 2;
}
$ gcc -O2 -flto main.c foo.c
main.c:3:5: warning: type of 'foo' does not match original declaration [-Wlto-type-mismatch]
3 | int foo(int);
| ^
foo.c:1:5: note: type mismatch in parameter 2
1 | int foo (int a, int b)
| ^
foo.c:1:5: note: type 'int' should match type 'void'
foo.c:1:5: note: 'foo' was previously declared here

This also works when the declaration is hidden (for example when
the violating code is emitted by a compiler for another language
in the same compiler collection):

$ cat main.f90
program main
implicit none
interface
function foo(a) result(ret) bind(c)
use, intrinsic :: iso_c_binding, only: c_int
integer(c_int), value :: a
integer(c_int) :: ret
end function foo
end interface
print *,foo(42)
end program main
$ gfortran -O2 -flto main.f90 foo.c
main.f90:10:17: warning: type of 'foo' does not match original declaration [-Wlto-type-mismatch]
10 | print *,foo(42)
| ^
foo.c:1:5: note: type mismatch in parameter 2
1 | int foo (int a, int b)
| ^
foo.c:1:5: note: type 'int' should match type 'void'
foo.c:1:5: note: 'foo' was previously declared here

Excuses are running out.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Sun Jan 12 14:55:04 2025

On 10/01/2025 19:39, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 09/01/2025 08:23, Anton Ertl wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:
[Someone wrote:]

ABI calling conventions tend to be designed to support at least C,
including varargs and often also tolerant of differences between the >>>>> number of arguments in the caller and callee.

Why should an ABI be tolerant of such differences? In C, calling a
function with an unexpected number (or type) of arguments has always
been undefined behaviour, and always been something that programmers
have strived to avoid. For variadic functions (including old
pre-standard functions), the code does not declare the number or types
of arguments, but you still have to match up the caller and callee.

I'm not sure that's completely true. Consider, for example,
main(). It's sort of variadic, but most applications only declare
the standard C argc/argv arguments. POSIX systems supply
a third parameter (envp) and most unix/linux implementations
supply a fourth parameter (auxv).

I should think so long as the caller provides at least enough
parameters to match the callee, there shouldn't be any
issues.

main() is a special case in C and C++ - it seems fine to say that it
takes a particular implementation-defined set of parameters no matter
how it is declared. If it is defined with fewer parameters than the implementation supports, then the definition should be treated as though
those parameters were included but not used.

Call printf() with a mismatch between the format string and the
arguments, and you can expect nasal daemons.

Not if you provide _more_ parameters than the format string
requires, which can happen with e.g. i18n error message strings.

I've always thought printf was a very unsafe design concept - that usage
does not help!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Thomas Koenig on Sun Jan 12 14:59:09 2025

On 10/01/2025 20:19, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

How many people actually want to use code where some functions are
called with an incorrect number of parameters? Such code is /broken/.

Agreed (at least in priciple).

If it ever gave results that the original users were happy with, it is
by luck - no matter what ABI you have for your new architecture and new
tools, it's pure luck whether things work or not in any sense.

It gets worse when the code in question has been around for decades,
and is widely used. Some ABIs, such as the x86-64 psABI, are very
forgiving of errors.

So the best you can do for your prospective customers is tell them that
you prioritise the results for correct code and help them with tools to
find mistakes in their ancient broken code.

Now, you can also tell them to use LTO for checks for any old
software.

Excuses are running out.

Yes, exactly.

I saw somewhere a quotation that backwards compatibility just means
repeating the same old mistakes. Backwards compatibility /is/
important, but so is trying to improve coding practices!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Waldek Hebisch on Sun Jan 12 12:10:50 2025

On 1/6/2025 6:11 PM, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )

which has 13 arguments.

Large number of arguments is typical in old style Fortran numeric
code.

While there has been much discussion down thread relating to Waldek's
other points, there hasn't been much about these.

So, some questions. Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
there some other explanation for Mitch not considering their importance?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Sun Jan 12 20:41:23 2025

On Sun, 12 Jan 2025 20:10:50 +0000, Stephen Fuld wrote:

On 1/6/2025 6:11 PM, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

I also think code would be a bit more efficient if there more registers >>>> available for parameter passing and as scratch registers - perhaps 6
would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )

which has 13 arguments.

Large number of arguments is typical in old style Fortran numeric
code.

While there has been much discussion down thread relating to Waldek's
other points, there hasn't been much about these.

So, some questions. Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
there some other explanation for Mitch not considering their importance?

Almost entirely the later.

The 13 cycles of overhead are invisible in a subroutine
that takes 1B cycles to execute.

But also note: The way Fortran passes array arguments is just perfect
for avoiding almost all bounds checks. Arrays are used in loops where
the initialization and termination are stored in the dope vector--
which is trusted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 13 02:10:10 2025

On Fri, 10 Jan 2025 10:25:23 +0000, Anton Ertl wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the
established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.
A lot of VAX programs needed fixes to run on different machines.

That case is interesting. It's certainly a benefit to programmers if
most uses of NULL produce a SIGSEGV, but for existing programs mapping allowing to have accessible memory in page 0 is an advantage. So how
did we get from there to where we are now?

The blame goes to defining NULL as a pointer that is not pointing at
anything. We have no integer that has the property of one value that
is not an integer--we COULD have had such a value (NEG_MAX on 2's
complement, -0 on 1's complement), but no..........

C 'errno' was made more abstract due
to multithreading, it broke some programs.

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Lobbing errno over into Thread Local Store just makes the problems
worse.

That's pretty similar to an ABI issue (not sure if errno is in the
ABIs or not).

errno is not ABI, errno is part of subroutine definitions within
a library. That errno can be set from different libraries, and
that errno got dropped in TLS makes it doubly idiotic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to Stephen Fuld on Mon Jan 13 01:20:38 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 1/6/2025 6:11 PM, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

I also think code would be a bit more efficient if there more registers >>>> available for parameter passing and as scratch registers - perhaps 6
would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
$ WORK, INFO )

which has 13 arguments.

Large number of arguments is typical in old style Fortran numeric
code.

While there has been much discussion down thread relating to Waldek's
other points, there hasn't been much about these.

So, some questions. Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles? Or do these subroutines consume so many CPU cycles that the overhead of the large number of parameters is lost in the noise? Or is
there some other explanation for Mitch not considering their importance?

Some comments to this:

You are implicitely assuming that passing large number of
arguments is expensive. Of course, if you can do the job with
smaller number of arguments, then there may be some saving.
However, large number of arguments is partially to increase
performance. Let me illustracte this with a example having
smaller number of arguments. I have a routine which is shortly
described below:

++ vector_combination(v1, c1, v2, c2, n, delta, p) replaces
++ first n + 1 entries of v1 by corresponding entries of
++ c1*v1+c2*x^delta*v2 mod p.

There 7 arguments here and it only deals with vectors (one
dimensional arrays). Instead of routine above I could use
5 separate routines, one to extrace a subvector, one shifting
entries, one mutiplying vector by a scalar, one for addition
and one for replacing a subvector. Using separate routines
would be take roughly 3-5 times more time and require
intermediate storage. Dynamically allocating this storage
would decrease performance and reusing statically allocated
work vectors would significantly complicate the code. And
of course having 5 calls instead of a single one also would
complicate code. So basically, I can use a routine with
large number of arguments which is doing more work and have
simpler and faster code or I could use "simpler" routines with
small number of arguments and get more complicated and slower
code.

My routine above was for vectors, similar routine for arrays
would have larger number of parameters, exceeding 8. Actually,
already more general routine for vectors would have extra
parameter to specify starting index (which currently is
assument do be 0 and is the only case that I need).

In case of Lapack, reasonably typical case is routine operating
on subblock of an array, which means that an array (subblock) is
described by 4 arguments: pointer to first element, leading
dimension (that corresponding dimension of containing array) and
2 dimensions of the array. Some dimensions may be shared, but
clearly even in simplest case there are several parameters.
There may be additional numeric parameters, work areas, parameters
specifiying if array is transposed or not (otherwise there would
be need of separate routines or user would be forced to separate
call of matrix transposition). There are is convention of
returning information about possible errors in 'INFO' variable.

Lapack has inefficiency due to Fortran conventions. Namely,
in natural C interface most arguments would be by value,
but Fortran compilers pass arguments by reference. So
even if all machine-level arguments were passed in registers,
values still need to be saved in memory by the caller and
read back by the called routine.

Modern languages have support for records/structures, so at source
code level number of arguments may be smaller. However, passing
structures by address is efficient when structures are only
passed down (quite typical case in modern code, where data goes
trough several layers before doing real work), but inccurs cost
when there is actual access. Passing structures by value means
that optically number of parameters is smaller, but there is
still need to pass several values.

Concerning what machine architects do: for long time goal was
high _average_ performance, based on some prediciton of load.
Large number of arguments is reasonable frequent in scientific
codes. Modern tendency is to pass addresses of aggregates, as
that is better behaved in OO contexts. I am not aware of any
publically available substantial body of realistic COBOL code,
but reasonable guess is that COBOL routines do quite a lot of
work between calls. In non-OO non-functional context compiler
can inline small routines, effectively leading to case where
calls are relatively rare. AFIAR, in initial AMD-64 gcc port
Suse team that did it claimed about 2-3% better performace due
to complicated calling convention trying to optimize use
of registers. In particular that measured object code size
of large body of Linux programs (they had no real hardware,
so were unable to measure code speed) and optimed convention
based on this. Later, Intel team claimed that due to improved
inlining calls were rare and effect of calling convention was
of order of fraction of percent. Of course, AMD-64 is limited
by its 16 general purpose registers, on machine with more registers
one can pass more arguments in registers, but I doubt that it
pays to go above 10-12. OTOH I think that having more
return registers (i mean comparable number to argument-passing
registers) would improve performance, but probably
code returning many values is so rare that architecs do not
care much.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 14:19:43 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Fri, 10 Jan 2025 10:25:23 +0000, Anton Ertl wrote:

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

If yes, a few years down the road your prospective customers have to
decide whether to go for your newfangled architecture or one of the
established ones. They learn that a number of programs work
everywhere else, but not on your architecture. How many of them will
be placated by your reasoning that these programs are not strictly
confoming standard programs? How many will be alarmed by your
admission that you find it ok that you find it ok that such programs
don't work on your architecture? After all, hardly any program is a
strictly conforming standard program.

Such things happended many times in the past. AFAIK standard
setup on a VAX was that accessing data at address 0 gave you 0.
A lot of VAX programs needed fixes to run on different machines.

That case is interesting. It's certainly a benefit to programmers if
most uses of NULL produce a SIGSEGV, but for existing programs mapping
allowing to have accessible memory in page 0 is an advantage. So how
did we get from there to where we are now?

The blame goes to defining NULL as a pointer that is not pointing at >anything. We have no integer that has the property of one value that
is not an integer--we COULD have had such a value (NEG_MAX on 2's
complement, -0 on 1's complement), but no..........

One of the advantages of BCD systems - we could define a NULL
pointer value that was non-zero, non-numeric, and didn't point
to anything.

(0xc0eeeeee).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Jan 13 10:55:15 2025

Anton Ertl [2025-01-09 08:38:32] wrote:

Stefan Monnier <monnier@iro.umontreal.ca> writes:

AFAIK in these cases the same compiler generates the code for the
function and for the calls, so it should be pretty much free to use any >>calling convention it likes.

With separate compilation, the compiler does not know which other
compiler generates the code for the caller of a function or the callee
of a function.

My reply was to:

Large numbers of parameters may be generated either by closure
conversion or by lambda lifting.

Can you show me an example where that happens and where the caller and
the callee can be generated by different compilers?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Mon Jan 13 18:02:10 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Jan 13 19:00:53 2025

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Mon Jan 13 21:33:32 2025

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles?

It's less than it used to be in the days when supercomputers
roamed the computer centers, but for these applications where
it matters, it can be significant.

Or do these subroutines consume so many CPU cycles that the
overhead of the large number of parameters is lost in the noise?

If you have many small matrices to multiply, startup overhead
can be quite significant. Not on a 2000*2000 matrix, though.

Or is
there some other explanation for Mitch not considering their importance?

I think eight arguments, passed by reference in registers, is not
too bad.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 21:53:55 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Jan 13 22:02:23 2025

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

But with functions that take 754 arguments and produce 754
results, it seems unnecessary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jan 13 22:40:02 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jan 14 02:32:23 2025

On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

So, now the subroutine, which computes all work in a single
instruction, has to check a global variable to decide if it
has to LD in TLS pointer just to set errno ?!!?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 06:20:43 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

That makes SIMD-style vectorization of transcendentals...
interesting.

Hmmm... looking around, it seems that C++ has the same requirement
since C++11. One more reason why Fortran is a better language
for numerics than C++...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to All on Tue Jan 14 06:48:45 2025

I wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles?

It's less than it used to be in the days when supercomputers
roamed the computer centers, but for these applications where
it matters, it can be significant.

Or do these subroutines consume so many CPU cycles that the
overhead of the large number of parameters is lost in the noise?

If you have many small matrices to multiply, startup overhead
can be quite significant. Not on a 2000*2000 matrix, though.

Or is
there some other explanation for Mitch not considering their importance?

I think eight arguments, passed by reference in registers, is not
too bad.

... when the rest can be passed on the stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Tue Jan 14 15:05:52 2025

On 13/01/2025 23:40, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed
value set by the implementation, usually as a macro. Certainly with a
quick check with gcc on Linux, I could not set the bits in math_errhandling.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to All on Tue Jan 14 15:08:03 2025

On 14/01/2025 03:32, MitchAlsup1 wrote:

On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

So, now the subroutine, which computes all work in a single
instruction, has to check a global variable to decide if it
has to LD in TLS pointer just to set errno ?!!?

Seems crazy to me too.

gcc at least has a "-fno-math-errono" flag that skips errno setting for
maths functions that are executed as a single instruction. That makes a
big difference to things like "sqrt".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jan 14 14:22:19 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

So, now the subroutine, which computes all work in a single
instruction, has to check a global variable to decide if it
has to LD in TLS pointer just to set errno ?!!?

The subroutine clearly does more than "do all the work in a single instruction".

How does your instruction support all the functionality
required by the POSIX specification for the sin(3) library function?

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to David Brown on Tue Jan 14 14:39:15 2025

David Brown <david.brown@hesbynett.no> writes:

On 13/01/2025 23:40, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing
direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed >value set by the implementation, usually as a macro. Certainly with a
quick check with gcc on Linux, I could not set the bits in math_errhandling.

Yes, the programmer in this case would instruct the compiler what
the value of math_errhandling should be, e.g. with --ffast-math.

https://gcc.gnu.org/wiki/FloatingPointMath

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Jan 14 16:41:28 2025

On Tue, 14 Jan 2025 14:22:19 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing >>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

So, now the subroutine, which computes all work in a single
instruction, has to check a global variable to decide if it
has to LD in TLS pointer just to set errno ?!!?

The subroutine clearly does more than "do all the work in a single instruction".

How does your instruction support all the functionality
required by the POSIX specification for the sin(3) library function?

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

I see no problems for as long as (math_errhandling & MATH_ERRNO)==0.
Which sounds like more sensible choice regardless of question of
instruction vs library.

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Raising of FP exceptions is orthogonal to question of one instruction
vs library call. If anything, when exceptions are enabled, with single-instruction implementation it is probably easier for exception
handler to find the reason and generate useful diagnostics.

As to what POSIX allows, on the manual page that you quoted I see no
indication that implementation is required to give to programmer to
select this or that behavior. I read it like implementation is allowed
to make the choice fully by itself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Scott Lurndal on Tue Jan 14 16:50:26 2025

On 14/01/2025 15:39, Scott Lurndal wrote:

David Brown <david.brown@hesbynett.no> writes:

On 13/01/2025 23:40, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing >>>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

You know POSIX better than I do, but AFAIK "math_errhandling" is a fixed
value set by the implementation, usually as a macro. Certainly with a
quick check with gcc on Linux, I could not set the bits in math_errhandling. >>

Yes, the programmer in this case would instruct the compiler what
the value of math_errhandling should be, e.g. with --ffast-math.

https://gcc.gnu.org/wiki/FloatingPointMath

I would say the key flag here is "-fno-math-errno" (which is included in -ffast-math). While I personally think most floating point code could
be used just as well with "-ffast-math", and it is certainly appropriate
for my own code, others have significantly different opinions or
experiences. (That's fair enough.) That one flag simply disables
setting errno in maths functions that can (reasonably) be implemented
inline as instructions, without affecting the results of any other
floating point operations.

But to my mind, this is /not/ a case of the POSIX programmer making the
choice - it is an implementation-specific feature. A C compiler might
choose to always use errno, or never, or have some other control of the
use of errno. When you write "POSIX leaves it up to the programmer", I
take that to mean POSIX specifies a function that lets you change the
value of "math_errhandling". That is quite different from saying "gcc
has a flag that lets you choose".

(For my own use, I like the flag - I don't write POSIX code, I have
never had any use for errno, and I want the compiler to generate as few instructions as it possibly can.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Waldek Hebisch on Tue Jan 14 09:40:27 2025

On 1/12/2025 5:20 PM, Waldek Hebisch wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 1/6/2025 6:11 PM, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

I also think code would be a bit more efficient if there more registers >>>>> available for parameter passing and as scratch registers - perhaps 6 >>>>> would make more sense.

Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.

My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

I meet such code with reasonable frequency. I peeked semi
randomly into Lapack. First routine that I looked at had
8 arguments, so within your limit. Second is:

SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC, >>> $ WORK, INFO )

which has 13 arguments.

Large number of arguments is typical in old style Fortran numeric
code.

While there has been much discussion down thread relating to Waldek's
other points, there hasn't been much about these.

So, some questions. Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of
CPU cycles? Or do these subroutines consume so many CPU cycles that the
overhead of the large number of parameters is lost in the noise? Or is
there some other explanation for Mitch not considering their importance?

Some comments to this:

You are implicitely assuming that passing large number of
arguments is expensive.

I guess. I am actually assuming that passing arguments in memory is
more expensive than passing them in registers. I don't think that is controversial.

Of course, if you can do the job with
smaller number of arguments, then there may be some saving.
However, large number of arguments is partially to increase
performance.

I agree with your example below, which I snipped. My comment was more
about the how the system implements argument passing (i.e. the number of registers used for the purpose) than about source code changes (fewer
calls with more arguments versus more calls with fewer arguments). Specifically, I was not suggesting changing the source code to reduce
the number of arguments.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 14 18:03:36 2025

On Tue, 14 Jan 2025 6:48:45 +0000, Thomas Koenig wrote:

I wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Has Lapack (and the other old style Fortran numeric
code that Waldek mentioned) lost its/their importance as a major user of >>> CPU cycles?

It's less than it used to be in the days when supercomputers
roamed the computer centers, but for these applications where
it matters, it can be significant.

Or do these subroutines consume so many CPU cycles that the
overhead of the large number of parameters is lost in the noise?

If you have many small matrices to multiply, startup overhead
can be quite significant. Not on a 2000*2000 matrix, though.

Or is
there some other explanation for Mitch not considering their importance?

I think eight arguments, passed by reference in registers, is not
too bad.

.... when the rest can be passed on the stack.

And those passed in registers can be stored into memory adjacent
to the memory arguments easily.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Tue Jan 14 18:02:29 2025

Michael S <already5chosen@yahoo.com> writes:

On Tue, 14 Jan 2025 14:22:19 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Raising of FP exceptions is orthogonal to question of one instruction
vs library call. If anything, when exceptions are enabled, with >single-instruction implementation it is probably easier for exception
handler to find the reason and generate useful diagnostics.

It seems to me that "raise an exception" is in the IEEE 754 sense (by
default set a sticky flag in an internal register), not in the C sense
of raising a signal. AFAIK you can tell the system to produce a
signal for some exceptions, but the default on Linux is not to.

As to what POSIX allows, on the manual page that you quoted I see no >indication that implementation is required to give to programmer to
select this or that behavior. I read it like implementation is allowed
to make the choice fully by itself.

And if it is friendly, it can give the programmer a compiler option to
select between the variants.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Stephen Fuld on Tue Jan 14 19:18:27 2025

Stephen Fuld wrote:

On 1/12/2025 5:20 PM, Waldek Hebisch wrote:

You are implicitely assuming that passing large number of
arguments is expensive.

I guess. I am actually assuming that passing arguments in memory is
more expensive than passing them in registers. I don't think that is controversial.

Usually true, except for recursive functions where you have to store
most stuff on the stack anyway, so going directly there can sometimes
generate more compact code.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 18:19:12 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jan 14 18:15:06 2025

On Tue, 14 Jan 2025 14:22:19 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 22:40:02 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 21:53:55 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 13 Jan 2025 18:02:10 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

errno is an atrocity all by itself; single handedly preventing >>>>>>>> direct use of SIN(), COS(), TAN(), ATAN(), exp(), ln(), pow()
as instructions.

Fortunately, the C standard does not require errno to be set
for these functions. Apple, for example, does not do so.

Nor will I.

POSIX does, however, require errno to be set conditionally
based on an application global variable 'math_errhandling'.

The functions mentioned have the property of taking x as
any IEEE 754 number (including NaNs, infinities, denorms)
and produce a IEEE 754 number {NaNs, infinities, norms,
denorms}.

But if POSIX wants to spend as many cycles setting errno
as performing the calculation, that is for POSIX to decide.

POSIX leaves it up to the programmer to decide. If the
programmer desires EDOM or ERANGE, they set the
appropriate bit in math_errhandling before calling the
sin et alia functions.

So, now the subroutine, which computes all work in a single
instruction, has to check a global variable to decide if it
has to LD in TLS pointer just to set errno ?!!?

The subroutine clearly does more than "do all the work in a single instruction".

All of the work of computing sin(x) is performed in a single
instruction.

So we have a subroutine that looks like::

double library_sin( double x )
{
// the work
double r = My_66000_sin(x);
// along with setting the flag bits

// the overhead
if( FP_Classify( x, NaN | INFINITY | ... ) )
{
errno_p tls = TLS();
if( FP_Classify( x, NaN ) tls->errno = errno_NaN;
if( FP_Classify( x, INFINITY ) tls->errno = errno_infinity;
...
}
return r;
}

How does your instruction support all the functionality
required by the POSIX specification for the sin(3) library function?

Except for the setting of errno.

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Tue Jan 14 19:39:06 2025

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

Without exposing any internal discussions, it should be obvious to
anyone "versed in the field" that the ieee754 standard has some warts
and mistakes. It has been possible to correct very few of them since 1978.

OTOH, Kahan & co did an amazingly good job to start with, the fact that
they didn't really consider the needs of massively parallel
implementations 40-50 years later cannot be blamed on them.

It is possible that one or two of the grandfather clauses in 754 can be
removed in the future, simply because the architectures that made those exceptional choices are going away permanently.

I do not see any way to support things like "trap and rescale" as a way
to handle exponent overruns, even though that was a neat idea back then.

It is much more likely that we will simply switch to quad/f128 (or even arbitrary precision) for those few computations that could need it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Tue Jan 14 19:08:39 2025

On Tue, 14 Jan 2025 18:19:12 +0000, Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

When the numeric calculation of sin() takes 150 cycles, the over-
head matters little.

When the numeric calculation of sin() takes 15 cycles, the over-
head is more noticeable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 14 19:24:16 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

I spent several years on one of those committees[*] in the 90s. There were math and IEEE
FP experts who very carefully considered all the consequences of changes
to the math interfaces.

[*] X/Open -> The Open Group

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Tue Jan 14 19:14:13 2025

On Tue, 14 Jan 2025 18:39:06 +0000, Terje Mathisen wrote:

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

Without exposing any internal discussions, it should be obvious to
anyone "versed in the field" that the ieee754 standard has some warts
and mistakes. It has been possible to correct very few of them since
1978.

OTOH, Kahan & co did an amazingly good job to start with, the fact that
they didn't really consider the needs of massively parallel
implementations 40-50 years later cannot be blamed on them.

CDC STAR and CRAY-1 were showing the massive parallelism well before
754 ever sat down.

In addition, the fallacy of exception and repair was also know to be
a failure well before 754 had their first meeting.

It is possible that one or two of the grandfather clauses in 754 can be removed in the future, simply because the architectures that made those exceptional choices are going away permanently.

I do not see any way to support things like "trap and rescale" as a way
to handle exponent overruns, even though that was a neat idea back then.

And just how many EVER used said feature ???

It is much more likely that we will simply switch to quad/f128 (or even arbitrary precision) for those few computations that could need it.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Tue Jan 14 20:01:14 2025

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

Without exposing any internal discussions, it should be obvious to
anyone "versed in the field" that the ieee754 standard has some warts
and mistakes. It has been possible to correct very few of them since 1978.

I'm not throwing shade on the IEEE committe, they did quite a good
job, considering what they did and did not know.

What I was criticising was the comittee(s) which made errno handling
for functions like sin() and cos() mandatory, and put activating
it in a globel flag.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue Jan 14 23:13:40 2025

On Tue, 14 Jan 2025 20:31:59 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

I spent several years on one of those committees[*] in the 90s.
There were math and IEEE FP experts who very carefully considered
all the consequences of changes to the math interfaces.

Putting in mandatory errno handling for transcencental intrinsics,
and making this dependent on a global flag, was a huge mistake.

Except that they didn't do this particular mistake.
Please, read other messages of the sub-thread.

Either the people on that particular committee didn't consider
the consequences, or they (second option to the one above) didn't
understand the consequences what they were doing. Vector computers
had already been in service for a decade when POSIX was released,
and a question "Would it run well on a Cray" would have answered
itself.

Cray-1 still was small enough for errno-based handling of errors in
trigs to not be a serious obstacle. That is, not Cray-1 itself, but
imaginary machine with organization, similar to Cray, but with more
consistent FP arithmetic.

OTOH, they can be excused if they thought that C should not be
be used for serious numerical work, and would not be. People had
FORTRAN for that...

I would guess that today majority of numerical work is done from python
by calling libraries. Libraries tend to be written in highly
none-portable dialects of C language and sometimes in C++. I don't
expect that measurable amount of Fortran is used in creation of the
libraries.
Now, whether one consideres overwhelming majority of today's numerical
work "serious" is a separate question. But it certainly is a very
serious business.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Tue Jan 14 20:31:59 2025

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

I spent several years on one of those committees[*] in the 90s. There were math and IEEE
FP experts who very carefully considered all the consequences of changes
to the math interfaces.

Putting in mandatory errno handling for transcencental intrinsics,
and making this dependent on a global flag, was a huge mistake.

Either the people on that particular committee didn't consider
the consequences, or they (second option to the one above) didn't
understand the consequences what they were doing. Vector computers
had already been in service for a decade when POSIX was released,
and a question "Would it run well on a Cray" would have answered
itself.

OTOH, they can be excused if they thought that C should not be
be used for serious numerical work, and would not be. People had
FORTRAN for that...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue Jan 14 23:48:19 2025

On Tue, 14 Jan 2025 19:18:27 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Stephen Fuld wrote:

On 1/12/2025 5:20 PM, Waldek Hebisch wrote:

You are implicitely assuming that passing large number of
arguments is expensive.

I guess.� I am actually assuming that passing arguments in memory
is more expensive than passing them in registers.� I don't think
that is controversial.

Usually true, except for recursive functions where you have to store
most stuff on the stack anyway, so going directly there can sometimes generate more compact code.

Terje

I would think that for Fortran (==everything passed by reference)
memory would beat registers most of the time. May be, except for
functions with 0-4 parameters.
Do common Fortarn compilers even bother with passing in register?
It would require replacement of natural by-reference "pointer in
register points to value in memory" calling sequence to something like copy-in/copy-out, right?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Thomas Koenig on Tue Jan 14 22:05:20 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

https://pubs.opengroup.org/onlinepubs/9799919799/functions/sin.html

Clearly there are programmers who wish to be able to detect
certain exceptions, and POSIX allows programmers to
select that behavior.

Clearly, there is a committee which wanted to be able for people
to detect certain error conditions on a fine-grained level.
One assumes tht they did not consider the consequences.

Without exposing any internal discussions, it should be obvious to
anyone "versed in the field" that the ieee754 standard has some warts
and mistakes. It has been possible to correct very few of them since 1978.

I'm not throwing shade on the IEEE committe, they did quite a good
job, considering what they did and did not know.

What I was criticising was the comittee(s) which made errno handling
for functions like sin() and cos() mandatory, and put activating
it in a globel flag.

It's not mandatory. It's listed as an optional extension, and
even when implemented, it's opt-in at compile time.

"The functionality described is optional. The functionality
described is mandated by the ISO C standard only for implementations
that define __STDC_IEC_559__."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Wed Jan 15 00:09:58 2025

On Tue, 14 Jan 2025 19:39:06 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

It is much more likely that we will simply switch to quad/f128 (or
even arbitrary precision) for those few computations that could need
it.

Terje

I had one of those computations that can benefit from quad and, may be,
from octal precision yesterday/today.
Design of equiripple symmetric FIR filter with ~2000 coefficients (more commonly called taps) using Parks-McClellan method. Implemented by Matlab/Octave function that traditionally was called remez and now
called firpm. I suppose, the new name was invented, because the
algorithm is only similar to Remez exchange, but differs in details.

Octave failed to do it claiming of limited precision of arithmetic as a
reason.

In Matlab implementation is better, and it was able to create filter
with ~1800 taps, which happened to be sufficient for my today's needs.
But even Matlab was unable to cope with 2000 taps.

If I had more time, I'd try to implement Parks-McClellan algorithm
myself, to see bottlenecks and see whether higher precision helps a
lot, or just a little. Unfortunately, right now I am too busy with work.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Jan 14 23:27:22 2025

On Tue, 14 Jan 2025 21:48:19 +0000, Michael S wrote:

On Tue, 14 Jan 2025 19:18:27 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Stephen Fuld wrote:

On 1/12/2025 5:20 PM, Waldek Hebisch wrote:

You are implicitely assuming that passing large number of
arguments is expensive.

I guess. I am actually assuming that passing arguments in memory
is more expensive than passing them in registers. I don't think
that is controversial.

Usually true, except for recursive functions where you have to store
most stuff on the stack anyway, so going directly there can sometimes
generate more compact code.

Terje

I would think that for Fortran (==everything passed by reference)
memory would beat registers most of the time.

Pass by COMMON block was even faster.

May be, except for
functions with 0-4 parameters.

Do common Fortarn compilers even bother with passing in register?

Fortran compilers are given an ABI (leaning towards C, C++) and
are required to "do something reasonable" in mapping Fortran
conventions into C conventions. C subroutines on the called
side, then, have to have a data structure identical to what
Fortran compiler would have produced (Dope Vector). C callers
will have to use those kinds of structures to successfully
call Fortran entry points.

It would require replacement of natural by-reference "pointer in
register points to value in memory" calling sequence to something like copy-in/copy-out, right?

No, Fortran will pass dope vectors to called subroutines. The
called subroutine needs to understand the dope vector.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Keith Thompson on Tue Jan 14 23:39:37 2025

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

[...]

What I was criticising was the comittee(s) which made errno handling
for functions like sin() and cos() mandatory, and put activating
it in a globel flag.

It's not mandatory. It's listed as an optional extension, and
even when implemented, it's opt-in at compile time.

"The functionality described is optional. The functionality
described is mandated by the ISO C standard only for implementations
that define __STDC_IEC_559__."

I can't find that anywhere in ISO C or POSIX. What exactly are you
quoting? ISO C doesn't tie math_errhandling to __STDC_IEC_559__.

https://pubs.opengroup.org/onlinepubs/9799919799/help/codes.html#MX

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Keith Thompson on Tue Jan 14 23:40:43 2025

Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

There's no requirement in ISO C or POSIX for an implementation to let
users affect the value of math_errhandling, at compile time or
otherwise. (And POSIX isn't directly relevant; this is all defined by
ISO C. There might be something in POSIX that goes beyond the ISO C >requirements.)

gcc has "-f[no-]fast-math" and "-f[no-]math-errno" options that can
affect the value of math_errhandling.

Would not that qualify as at "compile time"?

That's certainly what I meant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jan 15 00:47:59 2025

On Tue, 14 Jan 2025 21:13:40 +0000, Michael S wrote:

I would guess that today majority of numerical work is done from python
by calling libraries.

New software, but many of us are still using FEM from 1970s.

That is the problem with floating point software--once developed
you can continue using it forever.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Jan 15 03:31:47 2025

According to MitchAlsup1 <mitchalsup@aol.com>:

I would think that for Fortran (==everything passed by reference)
memory would beat registers most of the time.

Pass by COMMON block was even faster.

Sometimes. On machines that don't have direct addressing, such as S/360,
the code needs to load a pointer to the data either way so it's a wash.

Even when you do have direct addressing, if code is compiled to be
position indepedent, the common block wouldn't be in the same module
as the code that references it so it still needs to load a pointer
from the GOT or whatever its equivalent is.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Wed Jan 15 16:50:58 2025

On Wed, 15 Jan 2025 3:31:47 +0000, John Levine wrote:

According to MitchAlsup1 <mitchalsup@aol.com>:

I would think that for Fortran (==everything passed by reference)
memory would beat registers most of the time.

Pass by COMMON block was even faster.

Sometimes. On machines that don't have direct addressing, such as
S/360,
the code needs to load a pointer to the data either way so it's a wash.

Even when you do have direct addressing, if code is compiled to be
position indepedent, the common block wouldn't be in the same module
as the code that references it so it still needs to load a pointer
from the GOT or whatever its equivalent is.

Pass by COMMON block allows one to pass hundreds of data values in a
single call.

You are treating the common block as if it had but one data container.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Jan 15 22:03:54 2025

According to MitchAlsup1 <mitchalsup@aol.com>:

On Wed, 15 Jan 2025 3:31:47 +0000, John Levine wrote:

According to MitchAlsup1 <mitchalsup@aol.com>:

Pass by COMMON block was even faster.

Sometimes. On machines that don't have direct addressing, such as
S/360,
the code needs to load a pointer to the data either way so it's a wash.

Even when you do have direct addressing, if code is compiled to be
position indepedent, the common block wouldn't be in the same module
as the code that references it so it still needs to load a pointer
from the GOT or whatever its equivalent is.

Pass by COMMON block allows one to pass hundreds of data values in a
single call.

You are treating the common block as if it had but one data container.

If I were that kind of programmer, I could use EQUIVALENCE to glue a bunch of local variables and arrays together and pass that as a subroutine argument. Also
remember that on machines without direct addressing there's extra code if the size of a block of whatever size is more than the offset size in an instruction,
12 bits on S/360 and usually 16 on z.

It's really a matter of taste and programming style more than efficiency.

R's,
John

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Thu Jan 16 03:02:44 2025

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Tue, 14 Jan 2025 21:48:19 +0000, Michael S wrote:

On Tue, 14 Jan 2025 19:18:27 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Stephen Fuld wrote:

On 1/12/2025 5:20 PM, Waldek Hebisch wrote:

You are implicitely assuming that passing large number of
arguments is expensive.

I guess. I am actually assuming that passing arguments in memory
is more expensive than passing them in registers. I don't think
that is controversial.

Usually true, except for recursive functions where you have to store
most stuff on the stack anyway, so going directly there can sometimes
generate more compact code.

Terje

I would think that for Fortran (==everything passed by reference)
memory would beat registers most of the time.

One still needs to pass _values_ of addresses. Doing it in
registers (assuming that enough are available) is likely to
be more efficient than storing addresses in memory and
re-fetching them later. _Relatively_ difference between
passing in registers and passing in memory is smaller, as
there are memory references to access arguments, but registers
are likely to be a plus (unless there is excessive spiling and
called routine needs to write addreses to memory and load
them later).

Pass by COMMON block was even faster.

I do not think so. I LAPACK-like cases there are array arguments.
Normal calling convention needs to store and later read parameters
and pass addresses. COMMON would force copying of entire arrays,
much less efficienct than handling parameters.

In complicated program there could be many COMMON blocks, leading
to worse locality than stack use (not relevant for cacheless
machine and one with very bing caches, but could make a difference
for machines with small caches).

It would require replacement of natural by-reference "pointer in
register points to value in memory" calling sequence to something like
copy-in/copy-out, right?

No, Fortran will pass dope vectors to called subroutines. The
called subroutine needs to understand the dope vector.

I would not say this. AFAIK in Fortran 77 caller passes enough
information so that called routine can construct its own dope
vector (if desired). IIUC that is very similar to VMT-s in C99.

I think PL/I, Ada, Extended Pascal and probably Fortran 90 use
dope vectors.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Waldek Hebisch on Thu Jan 16 15:08:37 2025

On Thu, 16 Jan 2025 3:02:44 +0000, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

Pass by COMMON block was even faster.

I do not think so. I LAPACK-like cases there are array arguments.
Normal calling convention needs to store and later read parameters
and pass addresses. COMMON would force copying of entire arrays,
much less efficienct than handling parameters.

SUBROUTINE FOO
COMMON /ALPHA/ i,j,k,a[100],b[100],c[100,100]

See no arguments, passed directly by common-block, no copying of
data, no dope vectors needed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Thu Jan 16 16:24:38 2025

MitchAlsup1 <mitchalsup@aol.com> wrote:

On Thu, 16 Jan 2025 3:02:44 +0000, Waldek Hebisch wrote:

MitchAlsup1 <mitchalsup@aol.com> wrote:

Pass by COMMON block was even faster.

I do not think so. I LAPACK-like cases there are array arguments.
Normal calling convention needs to store and later read parameters
and pass addresses. COMMON would force copying of entire arrays,
much less efficienct than handling parameters.

SUBROUTINE FOO
COMMON /ALPHA/ i,j,k,a[100],b[100],c[100,100]

See no arguments, passed directly by common-block, no copying of
data, no dope vectors needed.

No copy only if there is single set of arguments. If there are
different arguments, then one needs to pass them, that is copy
them.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to George Neuner on Mon Jan 27 17:09:59 2025

George Neuner <gneuner2@comcast.net> writes:

On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

Large numbers of parameters may be generated either by closure
conversion or by lambda lifting. These are FP language
transformations that are analogous to, but potentially more complex
than, the rewriting of object methods and their call sites to pass the current object in an OO language.

[The difference between closure conversion and lambda lifting is the
scope of the tranformation: conversion limits code transformations to
within the defining call chain, whereas lifting pulls the closure to
top level making it (at least potentially) globally available.]

In either case the original function is rewritten such that non-local variables can be passed as parameters. The function's code must be
altered to access the non-locals - either directly as explicit
individual parameters, or by indexing from a pointer to an environment
data structure.

While in a simple case this could look exactly like the OO method transformation, recall that a general closure may require access to
non-local variables spread through multiple environments. Even if
whole environments are passed via single pointers, there still may
need to be multiple parameters added.

Isn't it the case that access to all of the enclosing environments
can be provided by passing a single pointer? I'm pretty sure it
is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to tr.17687@z991.linuxsc.com on Tue Jan 28 22:53:00 2025

On Mon, 27 Jan 2025 17:09:59 -0800, Tim Rentsch
<tr.17687@z991.linuxsc.com> wrote:

George Neuner <gneuner2@comcast.net> writes:

On Mon, 6 Jan 2025 20:10:13 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.

Large numbers of parameters may be generated either by closure
conversion or by lambda lifting. These are FP language
transformations that are analogous to, but potentially more complex
than, the rewriting of object methods and their call sites to pass the
current object in an OO language.

[The difference between closure conversion and lambda lifting is the
scope of the tranformation: conversion limits code transformations to
within the defining call chain, whereas lifting pulls the closure to
top level making it (at least potentially) globally available.]

In either case the original function is rewritten such that non-local
variables can be passed as parameters. The function's code must be
altered to access the non-locals - either directly as explicit
individual parameters, or by indexing from a pointer to an environment
data structure.

While in a simple case this could look exactly like the OO method
transformation, recall that a general closure may require access to
non-local variables spread through multiple environments. Even if
whole environments are passed via single pointers, there still may
need to be multiple parameters added.

Isn't it the case that access to all of the enclosing environments
can be provided by passing a single pointer? I'm pretty sure it
is.

Certainly, if the enclosing environments somehow are chained together.
In real code though, in many instances such a chain will not already
exist when the closure is constructed. The compiler would have to
install pointers to the needed environments (or, alternatively,
pointers directly to the needed values) into the new closure's
immediate environment.
[essentially this creates a private "display" for the closure.]

Completely doable: it is simply that, if there are enough registers,
passing the pointers as parameters will tend to be more performant.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	07:16:49
Calls:	10,388
Calls today:	3
Files:	14,061
Messages:	6,416,822
Posted today:	1

Calling conventions (particularly 32-bit ARM)

Who's Online

Recent Visitors

System Info