Forum: >>> Magnum BBS <<<

Re: Code density

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jun 17 18:01:34 2025

On Tue, 17 Jun 2025 14:17:42 +0000, Anton Ertl wrote:

Here are the text size numbers:

Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Can you get numbers for RISC-V without compression ?? for the above
and for the below.

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
1030288 150686 79852 31492 mvme68k
779393 155764 75795 31813 vax

1302254 171505 83249 35085 amd64
1229032 178332 89180 36876 evbarm-aarch64
1539052 179055 82280 34717 amd64-daily
1374961 184458 96971 37218 i386
1247476 185792 96728 42028 evbarm-earmv7hf
1333952 187452 96328 39472 sparc
1586608 204032 106896 45408 evbppc
1536144 204320 106768 43232 hppa

---------

1397024 216832 109792 48512 sparc64
1538536 222336 107776 44912 evbmips-mips64eb
1623952 243008 122096 50640 evbmips-mipseb
1689920 251376 120672 51168 alpha

This appears to be the region of standard RISC architectures
about 1.5× VAX
---------

2324752 2259984 1378000 ia64

^
Is there 1 too many zeros on the last entry ??

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Jun 17 23:55:48 2025

MitchAlsup1 wrote:

On Tue, 17 Jun 2025 14:17:42 +0000, Anton Ertl wrote:

Here are the text size numbers:

Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

Can you get numbers for RISC-V without compression ?? for the above
and for the below.

The numbers for RV64 look suspiciously low.
For RV64 there are multiple "code models" for building addresses.

https://www.sifive.com/blog/all-aboard-part-4-risc-v-code-models

https://starfivetech.com/uploads/optimizing-riscv-software.pdf

I suspect the numbers Anton quotes for RV64 are for the GCC defaults

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/RISC-V-Options.html

which for code models is "-mcmodel=medlow". Other RV64 code models are "-mcmodel=medany", and (strangely not listed there but referenced on
some web pages) is "-mcmodel=large" for full 64 bit offsets.

"-mcmodel=medlow" compiles for only 32-bit offsets in the unsigned low 2GB
and upper 2GB which is essentially compiling 64-bit code as though 32-bit.
All static code and data offsets must be in that 2 GB range because it uses
the 2 instruction sequence to load a 32 bit offset to a register or PC.

I would be interested in what happens to the code size if "-mcmodel=large"
is used (if that is indeed supported) which presumably allows one to
directly address static declarations in the full 64-bit address space.
(The documentation is none existent so just guessing).

Other optimizations such as "-flto" for "link time optimizations"
are listed but not really documented just what is.

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
1030288 150686 79852 31492 mvme68k
779393 155764 75795 31813 vax

1302254 171505 83249 35085 amd64
1229032 178332 89180 36876 evbarm-aarch64
1539052 179055 82280 34717 amd64-daily
1374961 184458 96971 37218 i386
1247476 185792 96728 42028 evbarm-earmv7hf
1333952 187452 96328 39472 sparc
1586608 204032 106896 45408 evbppc
1536144 204320 106768 43232 hppa

---------

1397024 216832 109792 48512 sparc64
1538536 222336 107776 44912 evbmips-mips64eb
1623952 243008 122096 50640 evbmips-mipseb
1689920 251376 120672 51168 alpha

This appears to be the region of standard RISC architectures
about 1.5× VAX
---------

2324752 2259984 1378000 ia64

^
Is there 1 too many zeros on the last entry ??

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Wed Jun 18 06:22:41 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

I would be interested in what happens to the code size if "-mcmodel=large"
is used

No code is generated by gcc-10.3.1. Instead, I get an error message

gcc: error: unrecognized argument in option ‘-mcmodel=large’
gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow

I'll show numbers for medany in a different posting.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jun 18 06:26:43 2025

mitchalsup@aol.com (MitchAlsup1) writes:

Can you get numbers for RISC-V without compression ?? for the above
and for the below.

You can do it as easily as I can: Set up a build server for Debian
(above) and one for NetBSD, then change the RISC-V compiler settings
not to compress, then measure the text sizes. I don't have the time
to do that, however.

You can, however compare ARM T32 and A32 in the Debian results:

bash grep gzip
595204 107636 46744 armhf ARM T32
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel ARM A32
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

There may be additional differences between the two ARM 32-bit
builds, however.

What I could do relatively easily is to compile a file from gforth
with different options. The file I used is what is compiled to engine/main-fast-ll.o

text siz compiler options
20242 -O2
18146 -Os
18146 -Os -march=rv64gc
18444 -Os -march=rv64gc -mcmodel=medany
23092 -Os -march=rv64g

So, for this file, the compressed instructions provide a factor 1.27 improvement in code density.

That's surprisingly little. I would expect similar code density for
RV64G as for MIPS64 (the instruction sets are similar, the addressing
modes the same), and both in the Debian and in the NetBSD results the
factors between these two architectures look to be larger. Either main-fast-ll.o is an outlier, or the default unrolling and inlining
options for MIPS64 are more aggressive than for RV64.

2324752 2259984 1378000 ia64

^
Is there 1 too many zeros on the last entry ??

No. NetBSD only has statically-linked IA64 binaries, while all the
other architectures have dynamically-linked binaries. So the IA64
numbers include the parts of the libraries that the binary calls and
are therefore not comparable to the results for the other
architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed Jun 18 09:32:14 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I would be interested in what happens to the code size if "-mcmodel=large" >> is used

No code is generated by gcc-10.3.1. Instead, I get an error message

gcc: error: unrecognized argument in option ‘-mcmodel=large’
gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow

I'll show numbers for medany in a different posting.

- anton

Ok, thanks.
So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
which would blow up their code size and tank their performance.
And that's not a good look for them.

Documentation does say that Aarch64 supports it (note =small is default):

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

AArch64 Options

-mcmodel=tiny
Generate code for the tiny code model. The program and its statically
defined symbols must be within 1MB of each other. Programs can be
statically or dynamically linked.

-mcmodel=small
Generate code for the small code model. The program and its statically
defined symbols must be within 4GB of each other. Programs can be
statically or dynamically linked.
This is the default code model.

-mcmodel=large
Generate code for the large code model. This makes no assumptions about
addresses and sizes of sections. Programs can be statically linked only.

For x86-64 -mcmodel=large is also supported, =medium is the default.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kerr-Mudd, John@21:1/5 to EricP on Wed Jun 18 16:19:27 2025

On Wed, 18 Jun 2025 09:32:14 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I would be interested in what happens to the code size if "-mcmodel=large" >> is used

No code is generated by gcc-10.3.1. Instead, I get an error message

gcc: error: unrecognized argument in option â€˜-mcmodel=largeâ€™ gcc: note: valid arguments to â€˜-mcmodel=â€™ are: medany medlow

I'll show numbers for medany in a different posting.

- anton

Ok, thanks.
So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
which would blow up their code size and tank their performance.
And that's not a good look for them.

Documentation does say that Aarch64 supports it (note =small is default):

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

AArch64 Options

-mcmodel=tiny
Generate code for the tiny code model. The program and its statically
defined symbols must be within 1MB of each other. Programs can be
statically or dynamically linked.

-mcmodel=small
Generate code for the small code model. The program and its statically
defined symbols must be within 4GB of each other. Programs can be
statically or dynamically linked.
This is the default code model.

I'm so retro that I remember when 'small model' meant <64k.

-mcmodel=large
Generate code for the large code model. This makes no assumptions about
addresses and sizes of sections. Programs can be statically linked only.

For x86-64 -mcmodel=large is also supported, =medium is the default.

--
Bah, and indeed Humbug.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jun 18 18:27:40 2025

On Wed, 18 Jun 2025 13:32:14 +0000, EricP wrote:

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

AArch64 Options

-mcmodel=tiny
Generate code for the tiny code model. The program and its statically
defined symbols must be within 1MB of each other. Programs can be
statically or dynamically linked.

-mcmodel=small
Generate code for the small code model. The program and its
statically
defined symbols must be within 4GB of each other. Programs can be
statically or dynamically linked.
This is the default code model.

-mcmodel=large
Generate code for the large code model. This makes no assumptions
about
addresses and sizes of sections. Programs can be statically linked
only.

For x86-64 -mcmodel=large is also supported, =medium is the default.

My 66000 Options (IIRC)

Where "program" means the statically linked object module.
Dynamically linked modules can be added at will via GOT.

-mcmodel=tiny
The program (.text) must fit in a 28-bit address space.
The data must fit in a 32-bit address space.
GOT contains Word entries.
GOT must be within 4GB of instruction accessing GOT.

-mcmodel=small
The program must fit in a 32-bit address space.
The data must fit in a 32-bit address space.
GOT contains Word entries.
GOT must be within 4GB of instruction accessing GOT.

-mcmodel=large
The program must fit in a 63-bit address space.
The data must fit in a 63-bit address space.
GOT contains DoubleWord entries.

I don't know if we have finalized a default yet.

OH, and BTW, this is not a compiler option, but a linker option.
So you can have a single compiled library, that gets linked under
any model.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Thu Jun 19 09:21:25 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.

medany is 2GB for "a program and its statically defined symbols", and
these 2GB can be anywhere in address space. Dynamically linked
symbols can be further away AFAICT. So unless you do binaries that
are larger than 2GB, this does not appear to be a restriction.

medlow means that the binary most reside in the lower 2GB of the
address space (actually between -2GB and 2GB, but at least on 64-bit
Linux user programs cannot reside at negative addresses). Again,
dynamically linked symbols can be further away. But if both the
executable and the shared libraries are compiled for the medlow model,
they must all fit in the lower 2GB. Probably not a big problem,
either, except maybe for the largest C++ projects.

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Interestingly, when I look at the "DEC Alpha" options, there is
-msmall-data (64KB global tables are enough) and -mlarge-data (data
segment <2GB). There is also -msmall-code (code <4MB) and
-mlarge-code. The gcc manual says about these:

| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.

I think that these four are:

ldq $27, ...($gp) #load target address
jsr $26, ($27) #call target
ldq $gp, offset($26) #restore gp

and at the target:

target:
ldq $gp, offset($27) #load gp

whereas in a small/small variant it would just be

bsr $26, target

which would blow up their code size and tank their performance.
And that's not a good look for them.

Why burden all programs with the costs of large programs the way it
is done by default on Alpha?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Sun Jun 22 10:05:38 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.

medany is 2GB for "a program and its statically defined symbols", and
these 2GB can be anywhere in address space. Dynamically linked
symbols can be further away AFAICT. So unless you do binaries that
are larger than 2GB, this does not appear to be a restriction.

medlow means that the binary most reside in the lower 2GB of the
address space (actually between -2GB and 2GB, but at least on 64-bit
Linux user programs cannot reside at negative addresses). Again,
dynamically linked symbols can be further away. But if both the
executable and the shared libraries are compiled for the medlow model,
they must all fit in the lower 2GB. Probably not a big problem,
either, except maybe for the largest C++ projects.

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Yes, it is (also) using an extra memory load to pick up large immediates.
It also requires a BAL to get the IP into a register.

Interestingly, when I look at the "DEC Alpha" options, there is
-msmall-data (64KB global tables are enough) and -mlarge-data (data
segment <2GB). There is also -msmall-code (code <4MB) and
-mlarge-code. The gcc manual says about these:

| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.

I think that these four are:

ldq $27, ...($gp) #load target address
jsr $26, ($27) #call target
ldq $gp, offset($26) #restore gp

and at the target:

target:
ldq $gp, offset($27) #load gp

whereas in a small/small variant it would just be

bsr $26, target

And the large-text limit is 4 MB of code.
Above 2 GB data or 4 MB code you must use dynamic allocation.

which would blow up their code size and tank their performance.
And that's not a good look for them.

Why burden all programs with the costs of large programs the way it
is done by default on Alpha?

- anton

I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.

For example, for Alpha to load a 64-bit constant requires 6 instructions,
24 bytes. That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.
And after the constant is loaded it must be manually added to the base
because there is no LD/ST combined with a scaled index.

Furthermore, the actual load or store of the target value is serially
dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks
it is serious penalty.

By making it a priority for relatively cheap access to the full 64-bit
address space during an ISA design, what alternatives might have minimized
its extra cost?

First, I have two designs which load a 64-bit constant in 3 32-bit fixed
length instructions, the prefix CONST approach, and another using 3 opcodes
and requires a temp register. In both cases the constants can easily be
fused in Decode and have zero execute cost. My preference is for the
prefix CONST as it can be used with many other instructions besides
LD and ST and doesn't require an extra temp register.

Second, if the base register of LD, ST, or LDA (Load Address) is R31,
the zero register, then it means use the PC as base, and the extra BAL
is almost always unnecessary.

Third, recognize that whether the const offset is loaded by instructions
or from a constant table by an extra LDQ, it will be adding that offset
to the base a lot so have LD, ST, LDA with a scaled-index address mode
and eliminate the extra ADD. The prefix-CONST approach doesn't require
this because it fuses the immediate directly onto its consumer in Decode.
But have the scaled index address mode anyway.

Fourth, have a compacting linker so the programmer doesn't need to specify
a code model. The compiler emits a worst-case sequence and the linker
gets rid of all the ones it doesn't need.

So there are three alternatives to accessing full 64-bit addresses.

The CONST prefix requires 3 instructions to access the target data
with no temp register and zero execute cost if fused in Decode.
CONST value
CONST value
LDx rDst, [r31+offset]

The separate const instructions requires 4 instructions
and requires a temp register and scaled index address mode,
but no execute cost if fused in Decode
CONST1 rTemp=value
CONST2 rTemp=value
CONST2 rTemp=value
LDx rDst, [r31+rTemp<<0]

The load from constant table requires 2 instructions,
requires a temp register and scaled index address mode,
eliminates the BAL and ADD, but takes an extra data memory access.
LDQ rTemp, [r31+offset]
LDx rDst, [r31+rTemp<<0]

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Jun 22 17:35:13 2025

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

---------------------

I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.

For example, for Alpha to load a 64-bit constant requires 6
instructions,
24 bytes.

The corresponding data for My 66000 is:
1 instruction:: LD Rd,[IP,,DISP64-.]
3 words in .text
LD pipeline latency 3 (instead of several with arithmetic)

That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table

At least doubling the LD latency and adding even more dependent inst.

RISC-V is no better.

located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.
And after the constant is loaded it must be manually added to the base because there is no LD/ST combined with a scaled index.

Furthermore, the actual load or store of the target value is serially dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4
clocks it is serious penalty.

Making the execution window grow by the added latency in order to
stumble over all the ILP not available due to the dependence
latencies.

By making it a priority for relatively cheap access to the full 64-bit address space during an ISA design, what alternatives might have
minimized its extra cost?

I would argue that making 64-bit access is (IS) what minimizes the cost
of huge address spaces.

First, I have two designs which load a 64-bit constant in 3 32-bit fixed length instructions, the prefix CONST approach, and another using 3
opcodes
and requires a temp register. In both cases the constants can easily be
fused in Decode and have zero execute cost. My preference is for the
prefix CONST as it can be used with many other instructions besides
LD and ST and doesn't require an extra temp register.

Universal constants provides this without wasting instructions, memory accesses, or register uses.

Second, if the base register of LD, ST, or LDA (Load Address) is R31,
the zero register, then it means use the PC as base, and the extra BAL
is almost always unnecessary.

I use R0, which CAN contain any data the program wants to put in it,
as the proxy for IP.

Third, recognize that whether the const offset is loaded by instructions
or from a constant table by an extra LDQ, it will be adding that offset
to the base a lot so have LD, ST, LDA with a scaled-index address mode
and eliminate the extra ADD. The prefix-CONST approach doesn't require
this because it fuses the immediate directly onto its consumer in
Decode. But have the scaled index address mode anyway.

Sure,

Fourth, have a compacting linker so the programmer doesn't need to
specify
a code model. The compiler emits a worst-case sequence and the linker
gets rid of all the ones it doesn't need.

So there are three alternatives to accessing full 64-bit addresses.

The CONST prefix requires 3 instructions to access the target data
with no temp register and zero execute cost if fused in Decode.
CONST value
CONST value
LDx rDst, [r31+offset]

The separate const instructions requires 4 instructions
and requires a temp register and scaled index address mode,
but no execute cost if fused in Decode
CONST1 rTemp=value
CONST2 rTemp=value
CONST2 rTemp=value
LDx rDst, [r31+rTemp<<0]

The load from constant table requires 2 instructions,
requires a temp register and scaled index address mode,
eliminates the BAL and ADD, but takes an extra data memory access.
LDQ rTemp, [r31+offset]
LDx rDst, [r31+rTemp<<0]

Why not just::

LDx Rd,[ip,,disp64]

1 instruction
3 words
no added latency
no added instructions
easy to shrink if DISP64 has no higher order bits set.

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or
c or d; and has to generate::

LDD Rt,[IP,,&GOT[a]-.]
LDD Ru,[IP,,&GOT[b]-.]
LDD Rv,[IP,,&GOT[c]-.]
LDD Rw,[IP,,&GOT[d]-.]
then
LDD Ra,[Rt]
LDD Rb,[Ru]
LDD Rc,[Rv]
ADD R8,Ra,Rb
ADD R8,R8,Rc
STD R8,[Rw]

When the linker figures out that a,b, and d are in the same module
it can shrink the code to::

LDD Rt,[IP,,&GOT[a]-.]
LDD Rv,[IP,,&GOT[c]-.]
then
LDD Ra,[Rt]
LDD Rb,[Rt+8]
LDD Rc,[Rv]
ADD R8,Ra,Rb
ADD R8,R8,Rc
STD R8,[Rt+16]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Jun 22 20:26:41 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or
c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 26 01:12:00 2025

On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or
c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

The extern's are in a dynamically loaded module, and the .text
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.

I don't see how one can do what you suggest and have .text
remain PIC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 26 15:17:18 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or
c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

The extern's are in a dynamically loaded module, and the .text >section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.

The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).

Here is an illustrative example:

This C++ code invokes the 'dlsym' function, which is hosted
in a dynamically linked shared object (libdl.so):

sym = (get_dlp_t)dlsym(handle, "get_dlp");
if (sym == NULL) {
lp->log("Invalid DLP shared object format: %s\n", dlerror());
unregister_handle(channel);
dlclose(handle);
return 1;
}

===========================================
g++ generates this assembler code:

...
movq %rax, 784(%r13,%rbx,8)
.L14:
.LBE39:
.LBE38:
.loc 2 118 0
movl $.LC7, %esi
movq %r14, %rdi
call dlsym
.LVL19:
.loc 2 119 0
testq %rax, %rax
je .L24

....

===========================================
The linker (ld command) generated the following trampoline and
corresponding trampoline invocation.

000000000040ccc0 <dlsym@plt>:
40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
40ccc6: 68 58 00 00 00 pushq $0x58
40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
...

412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
41268c: 48 83 fb 63 cmp $0x63,%rbx
412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
412699: 00
41269a: be 0d 66 43 00 mov $0x43660d,%esi
41269f: 4c 89 f7 mov %r14,%rdi
4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
4126a7: 48 85 c0 test %rax,%rax

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Fri Jun 27 07:48:57 2025

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or
c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

The extern's are in a dynamically loaded module, and the .text
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.

The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).

Here is an illustrative example:

This C++ code invokes the 'dlsym' function, which is hosted
in a dynamically linked shared object (libdl.so):

sym = (get_dlp_t)dlsym(handle, "get_dlp");
if (sym == NULL) {
lp->log("Invalid DLP shared object format: %s\n", dlerror());
unregister_handle(channel);
dlclose(handle);
return 1;
}

===========================================
g++ generates this assembler code:

...
movq %rax, 784(%r13,%rbx,8)
..L14:
..LBE39:
..LBE38:
.loc 2 118 0
movl $.LC7, %esi
movq %r14, %rdi
call dlsym
..LVL19:
.loc 2 119 0
testq %rax, %rax
je .L24

.....

===========================================
The linker (ld command) generated the following trampoline and
corresponding trampoline invocation.

000000000040ccc0 <dlsym@plt>:
40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
40ccc6: 68 58 00 00 00 pushq $0x58
40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
....

412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
41268c: 48 83 fb 63 cmp $0x63,%rbx
412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
412699: 00
41269a: be 0d 66 43 00 mov $0x43660d,%esi
41269f: 4c 89 f7 mov %r14,%rdi
4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
4126a7: 48 85 c0 test %rax,%rax

This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value,
verses a direct PC-rel memref for extern a, b, c.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Fri Jun 27 08:33:40 2025

EricP wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

The extern's are in a dynamically loaded module, and the .text
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.

The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).

Here is an illustrative example:

This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, verses a direct PC-rel memref for extern a, b, c.

For example, in Microsoft one can mark the DLL export variable with
__declspec(dllexport) int errno;

in the header file and the compiler knows that all references
to errno require an extra level of indirection.
I've seen no such equivalent attribute in the GCC world.

This is the one example where I considered adding an indirect
addressing mode enabled by 1 bit on all LD and ST instructions
as it eliminates the need to emit different code sequences for
intra-module and inter-module accesses.

The compiler always emits a PC-rel address. Later the linker discovers
the it is a reference to a DLL export variable and sets the offset to be
to the address in the GOT table and sets the Indirect bit on the LD/ST.

An Indirect address mode has side effects for the Load Store Queue but
they are mostly the same as if there were two separate instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Fri Jun 27 13:55:16 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.

Consider::

extern int64_t a,b,c,d;

and some code:

{ d = a+b+c; }

The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::

Why does the compiler need to assume anything?

It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.

The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).

The extern's are in a dynamically loaded module, and the .text
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.

The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).

Here is an illustrative example:

This C++ code invokes the 'dlsym' function, which is hosted
in a dynamically linked shared object (libdl.so):

sym = (get_dlp_t)dlsym(handle, "get_dlp");
if (sym == NULL) {
lp->log("Invalid DLP shared object format: %s\n", dlerror());
unregister_handle(channel);
dlclose(handle);
return 1;
}

===========================================
g++ generates this assembler code:

...
movq %rax, 784(%r13,%rbx,8)
..L14:
..LBE39:
..LBE38:
.loc 2 118 0
movl $.LC7, %esi
movq %r14, %rdi
call dlsym
..LVL19:
.loc 2 119 0
testq %rax, %rax
je .L24

.....

===========================================
The linker (ld command) generated the following trampoline and
corresponding trampoline invocation.

000000000040ccc0 <dlsym@plt>:
40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
40ccc6: 68 58 00 00 00 pushq $0x58
40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
....

412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
41268c: 48 83 fb 63 cmp $0x63,%rbx
412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
412699: 00
41269a: be 0d 66 43 00 mov $0x43660d,%esi
41269f: 4c 89 f7 mov %r14,%rdi
4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
4126a7: 48 85 c0 test %rax,%rax

This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.

The compiler doesn't care. Absent threads, it generates a simple
reference to the errno symbol and lets the linker handle resolving
it.

In the thread case:

/usr/include/bits/errno.h:

extern int *__errno_location (void) __THROW __attribute__ ((__const__));

# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Fri Jun 27 15:09:50 2025

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value,
verses a direct PC-rel memref for extern a, b, c.

The compiler doesn't care. Absent threads, it generates a simple
reference to the errno symbol and lets the linker handle resolving
it.

In the thread case:

/usr/include/bits/errno.h:

extern int *__errno_location (void) __THROW __attribute__ ((__const__));

# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

That replaces a memory reference with a function call.
Compiled on godbolt with GCC x86-64 trunk -O3

#include "errno.h"

long GetErrno (void)
{ return errno;
}

"GetErrno()":
sub rsp, 8
call "__errno_location"
movsx rax, DWORD PTR [rax]
add rsp, 8
ret

What I am asking about is below.
Here are two variables, one is in a DLL export and therefore an
inter-module reference that requires an extra MOV to load the address
from the import table (what Linux calls the GOT), and the other is a
a regular intra-module variable directly accessed with one PC-rel MOV.

How does GCC import a variable exported from a shared module?

Compiled on godbolt MSVC v19.2 -O2

__declspec(dllimport) long dllVar;
extern long exeVar;

long GetDllVar (void)
{ return dllVar;
}

long GetDllVar(void) PROC ; GetDllVar, COMDAT
mov rax, QWORD PTR __imp_long dllVar
mov eax, DWORD PTR [rax]
ret 0

long GetExeVar (void)
{ return exeVar;
}

long GetExeVar(void) PROC ; GetExeVar, COMDAT
mov eax, DWORD PTR long exeVar ; exeVar
ret 0

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Fri Jun 27 21:01:33 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >>> verses a direct PC-rel memref for extern a, b, c.

The compiler doesn't care. Absent threads, it generates a simple
reference to the errno symbol and lets the linker handle resolving
it.

In the thread case:

/usr/include/bits/errno.h:

extern int *__errno_location (void) __THROW __attribute__ ((__const__));

# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

That replaces a memory reference with a function call.

Yes, that's pretty clear from the above fragment of errno.h.

Clearly the linker has the freedom to recognize "__errno_location()"
and alter things as necessary. In this case, is implemented as
a real function call. I can envision an implementation that
replaced the function call with a reference to a thread-specific variable when compiled and linked with the proper options (e.g. when linked with
-lpthread).

For global data, take the math.h 'signgam' for instance:

Compiled on godbolt with GCC x86-64 trunk -O3

#include "errno.h"

long GetErrno (void)
{ return errno;
}

"GetErrno()":
sub rsp, 8
call "__errno_location"
movsx rax, DWORD PTR [rax]
add rsp, 8
ret

What I am asking about is below.
Here are two variables, one is in a DLL export and therefore an
inter-module reference that requires an extra MOV to load the address
from the import table (what Linux calls the GOT), and the other is a
a regular intra-module variable directly accessed with one PC-rel MOV.

How does GCC import a variable exported from a shared module?

In this example, signgam is a variable (int) exported from libm.so.

$ cat /tmp/a.c
#include <math.h>

int
main(int argc, const char **argv, const char **envp)
{
double d = 0.135;

signgam = 3u;

(void) trunc(d);

return signgam;
}

$ cc -S -D_USE_XOPEN /tmp/a.c

main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movq %rdx, -40(%rbp)
movsd .LC0(%rip), %xmm0
movsd %xmm0, -8(%rbp)
movl $3, signgam(%rip)
movl signgam(%rip), %eax
popq %rbp
.cfi_def_cfa 7, 8

The assembler pass doesn't change the output presented by
the compiler, it just adds 'signgam' to the undefined symbol
table in the resulting object file and generates the rip-relative
reference pointing to the symbol table entry for fixup by the
linker.

After linking with -lm:

0000000000401106 <main>:
401106: 55 push %rbp
401107: 48 89 e5 mov %rsp,%rbp
40110a: 89 7d ec mov %edi,-0x14(%rbp)
40110d: 48 89 75 e0 mov %rsi,-0x20(%rbp)
401111: 48 89 55 d8 mov %rdx,-0x28(%rbp)
401115: f2 0f 10 05 bb 10 00 movsd 0x10bb(%rip),%xmm0 # 4021d8 <__dso_handle+0x8>
40111c: 00
40111d: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
401122: c7 05 d8 2e 00 00 03 movl $0x3,0x2ed8(%rip) # 404004 <__signgam@GLIBC_2.23>
401129: 00 00 00
40112c: 8b 05 d2 2e 00 00 mov 0x2ed2(%rip),%eax # 404004 <__signgam@GLIBC_2.23>
401132: 5d pop %rbp
401133: c3 ret

$ ldd /tmp/a
linux-vdso.so.1 (0x00007f19a941d000)
libm.so.6 => /lib64/libm.so.6 (0x00007f19a9314000)
libc.so.6 => /lib64/libc.so.6 (0x00007f19a9120000)
/lib64/ld-linux-x86-64.so.2 (0x00007f19a941f000)

$ nm -D /lib64/libm.so.6 | grep signgam
00000000000e5008 B __signgam@@GLIBC_2.23
00000000000e5008 V signgam@@GLIBC_2.2.5

$ objdump -x /tmp/a
Sections:
Idx Name Size VMA LMA File off Algn
...

19 .got 00000010 0000000000403fd8 0000000000403fd8 00002fd8 2**3
CONTENTS, ALLOC, LOAD, DATA
20 .got.plt 00000018 0000000000403fe8 0000000000403fe8 00002fe8 2**3
CONTENTS, ALLOC, LOAD, DATA
21 .data 00000004 0000000000404000 0000000000404000 00003000 2**0
CONTENTS, ALLOC, LOAD, DATA
22 .bss 0000000c 0000000000404004 0000000000404004 00003004 2**2
ALLOC

0x2ed8 + %rip lands directly in the .bss region (0x40404).

(gdb) x/d 0x404004
0x404004 <signgam@GLIBC_2.2.5>: 3

The linker allocated space in the bss for the signnam variable
"exported" by libm.so. All the compiler did was tell the linker
which symbol to reference - the compiler doesn't know that signgam
is in another object file, an archive library or a shared object.

A similar fixup would be made to the .data section if the library
had pre-initialized 'signgam' rather than just leaving it
uninitialized.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Fri Jun 27 21:44:32 2025

According to EricP <ThatWouldBeTelling@thevillage.com>:

extern int *__errno_location (void) __THROW __attribute__ ((__const__));

# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

That replaces a memory reference with a function call.

Not really. On all of the Unix-like systems I know, errno is a macro
wrapped around a function call that fetches the most recent error in
the current thread, done that way to avoid breaking old programs
written back before threads when errno was an extern int. It's a
peculiar special case and I don't offhand know of anything else like
that.

Here's the ABI manual for amd64 systems:

http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to EricP on Fri Jun 27 22:41:54 2025

On 6/27/2025 5:33 AM, EricP wrote:

snip

This is the one example where I considered adding an indirect
addressing mode enabled by 1 bit on all LD and ST instructions
as it eliminates the need to emit different code sequences for
intra-module and inter-module accesses.

The compiler always emits a PC-rel address. Later the linker discovers
the it is a reference to a DLL export variable and sets the offset to be
to the address in the GOT table and sets the Indirect bit on the LD/ST.

An Indirect address mode has side effects for the Load Store Queue but
they are mostly the same as if there were two separate instructions.

But you are now allowing two cache, tlb or even page misses within one instruction, which complicates things considerably. Yes, it is
"somewhat" the same as two separate instructions, except you have to
keep track of which of the two you got and get the pipeline back to the
correct place, etc. There is a reason why one of the main tenants of
RISC is only one memory address per instruction.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 28 07:45:23 2025

On 2025-06-28, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

On 6/27/2025 5:33 AM, EricP wrote:

snip

This is the one example where I considered adding an indirect
addressing mode enabled by 1 bit on all LD and ST instructions
as it eliminates the need to emit different code sequences for
intra-module and inter-module accesses.

The compiler always emits a PC-rel address. Later the linker discovers
the it is a reference to a DLL export variable and sets the offset to be
to the address in the GOT table and sets the Indirect bit on the LD/ST.

An Indirect address mode has side effects for the Load Store Queue but
they are mostly the same as if there were two separate instructions.

But you are now allowing two cache, tlb or even page misses within one instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Scott Lurndal on Sat Jun 28 09:07:10 2025

On 2025-06-27, Scott Lurndal <scott@slp53.sl.home> wrote:

Clearly the linker has the freedom to recognize "__errno_location()"
and alter things as necessary. In this case, is implemented as
a real function call. I can envision an implementation that
replaced the function call with a reference to a thread-specific variable when
compiled and linked with the proper options (e.g. when linked with -lpthread).

That is not what linkers are supposed to do (unless you use
link-time optimization, which is a bit of the misnomer).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Sat Jun 28 07:02:47 2025

On Fri, 27 Jun 2025 07:48:57 -0400, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.

For quite a long time now, errno has been thread-local. errno is a
macro that accesses the current thread's private copy.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Sat Jun 28 08:05:37 2025

On Fri, 27 Jun 2025 08:33:40 -0400, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

For example, in Microsoft one can mark the DLL export variable with
__declspec(dllexport) int errno;

But how many DLLs actually export data globally? That too easily can
make the DLL non-reentrant, whch greatly limits its usefulness.

[I know non-reentrant DLLs were a thing ... back with Windows 3.x. I
wrote some applications back in the day that had to work around it by
loading multiple copies of a particular device's API DLL so that the
programs could control multiple instances of the device.
I haven't encountered anything like that for decades.]

Windows DLLs can have their own private heap(s) too, but almost all
choose to use the program's heap instead. Keeping track of private
heap allocations on behalf of multiple programs so you can clean up
if/when they terminate is just a lot of extra programmer effort.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 28 12:00:23 2025

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

Power10 has prefixed instructions, which are 64 bits, which serve
to access 34-bit constants. To quote version 3.1 of the ISA:

"Prefixed instructions do not cross 64-byte instruction address
boundaries. When a prefixed instruction crosses a 64-byte boundary,
the system alignment error handler is invoked."

In practice, that means that functions have to be aligned to a
64-byte boundary (presumably a cache line) and that the occasional
nop may be required; prefixed instructions aren't all that common.
It is fairly trivial to add that requirement to an assembler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jun 28 11:11:08 2025

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

Supporting unaligned data accesses the requirement of up to two
cache-line accesses (and, consequently, up to two TLB or cache misses)
for data accesses (including stores). Power has this support, as has
every other modern general-purpose architecture.

Of course, if you added one level of indirection, that would double
the number of potential memory accesses on the data side.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jun 28 16:01:29 2025

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

Supporting unaligned data accesses the requirement of up to two
cache-line accesses (and, consequently, up to two TLB or cache misses)
for data accesses (including stores). Power has this support, as has
every other modern general-purpose architecture.

Of course, if you added one level of indirection, that would double
the number of potential memory accesses on the data side.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sat Jun 28 16:30:58 2025

According to Thomas Koenig <tkoenig@netcologne.de>:

On 2025-06-27, Scott Lurndal <scott@slp53.sl.home> wrote:

Clearly the linker has the freedom to recognize "__errno_location()"
and alter things as necessary. In this case, is implemented as
a real function call. I can envision an implementation that
replaced the function call with a reference to a thread-specific variable when
compiled and linked with the proper options (e.g. when linked with
-lpthread).

That is not what linkers are supposed to do (unless you use
link-time optimization, which is a bit of the misnomer).

I don't know any linker that does that. As I said yesterday, errno
is an odd special case that uses a C macro to wrap a function call.
It's not the way anyone does normal inter-module references.

Yesterday's message had a link to the amd64 ABI manual for anyone who
wonders how this actually works.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Jun 29 14:54:28 2025

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one >>>> instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.

And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.

And bugs in there will occur only rarely, so they will be difficult
to find and debug.

This is indeed possible (almost anything except skiing through a
revolving door), but IMHO this is something to avoid.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sun Jun 29 18:01:41 2025

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.

And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.

And bugs in there will occur only rarely, so they will be difficult
to find and debug.

This is indeed possible (almost anything except skiing through a
revolving door), but IMHO this is something to avoid.

Skiing through a revolving door is in fact possible, as long as you are
using xc gear, since there the binding allow you to fold the skis up
along your body. You just need the balance to be able to use the tail
ends of your skis as stilts.

Back in uni days (as members of the uni scouts group) we tried all sorts
of funny stuff, including running a very hard obstacle course with xc
skis on.

I do agree that for most people/skiing gear/revolving doors, the
combination is effectively impossible.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Thomas Koenig on Sun Jun 29 13:21:39 2025

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.

And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.

And bugs in there will occur only rarely, so they will be difficult
to find and debug.

Compilers, assemblers and linkers have long supported alignment directives. Memory sections on 64k boundaries, some data must be on page,
routine entries on 16 bytes, loop starts on 4 bytes.
An implied non-page-straddle alignment for instructions looks like a
variation on the loop start alignment.

It is more interesting when there are multiple offset sizes for branch,
call or mem refs. For example, my ISA has 16, 32 and 64 bit offsets.

Some time ago we were discussing compacting linkers and Ivan described
the algorithm he uses. It was pretty straight forward so I built a
little trial program in a few hours using x86 8 and 32 bit offsets.

It has only 4 item kinds to consider: alignment directives,
fixed size byte block declarations, zero sized symbol defs,
variable sized symbol refs of 2 or 5 bytes, all items in a linked list.
Each item has status bits (resolved flag, etc), and two address fields:
the lowest possible address and highest possible address.
If the largest possible offset between items always fits into the smallest bucket, or the smallest possible offset is always greater than the largest bucket then those items resolve and are removed from the pending list.
Then recalculate the lowest and highest addresses and check again.

For testing I disassembled the trial program, hand coded the code and data blocks into a test table, and ran it through itself. Without any attempt at optimization it did a perfect compaction after 3 sweeps.

If anyone is interested, the rate at which it compacts is determined by the bucket sizes - smaller buckets like 1 byte offset take longer to determine
what can fit into them than larger ones because whether an individual offset
is 1 or 4 bytes has more chance of affecting other compactions.
In the worst case it could take N sweeps to compact N variable sized items. With offset buckets of 2, 4 or 8 bytes almost all items would pack into
the 2 byte bucket on the first sweep leaving few items for the second sweep.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Levine on Sun Jun 29 14:02:32 2025

John Levine wrote:

According to EricP <ThatWouldBeTelling@thevillage.com>:

extern int *__errno_location (void) __THROW __attribute__ ((__const__)); >>>
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

That replaces a memory reference with a function call.

Not really. On all of the Unix-like systems I know, errno is a macro
wrapped around a function call that fetches the most recent error in
the current thread, done that way to avoid breaking old programs
written back before threads when errno was an extern int. It's a
peculiar special case and I don't offhand know of anything else like
that.

I was just looking for a shared module export variable but errno was
a poor choice because everyone has replaced it with a function call.

Scott suggested signgam in math.h but that doesn't exist in Microsoft's
math.h because MS is stuck on C-89 so not useful for comparing MS and GCC.

Here's the ABI manual for amd64 systems:

http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

Thanks but I've got a copy v1.0 dated 6-Dec-2022.
That one is a draft v0.95 from 2005.

The issue I have with it is that while it does provide an overview of the address models, it does not describe how the whole mechanism works,
how the compiler interacts with linker and loader.
Also it makes multiple references to x64 instructions that don't exist in
Intel or AMD docs, LEAQ and MOVABS. It does not define them, does not show
any instruction bytes, and does not reference any other docs that do so.

I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.
There is a web document at GNU but it has no search function
and does not explain leaq or movabs.

After a few hours of flopping about searching the web I think I have
figured out how it works, specifically how the movabs works with the loader, and why the MS code for Windows is different from the GCC code for Linux,
and how the MS compiler uses dllimport attribute and GCC does not.

First about LEAQ and MOVABS.
It seems LEAQ is the LEA Load Effective Address instruction with a data type attached to it so it knows the operand size and thus the address mode.
This replaces the Intel B/W/D/QWORD PTR nomenclature.

MOVABS is more complicated and actually has two versions,
MOVABS and MOVABSx (where x is a data type b, w, d, or q).

MOVABS (no type) is really Intel "MOV r64, imm64" which loads a 64-bit immediate into a register. MOVABS has nothing to do with absolute addresses except if the imm64 happens to be a relocatable symbol value then it
can be patched by the loader, as with all such immediate symbols.

MOVABSx (with type) is really Intel "MOV moffs, rAn" or "MOV rAn, moffs"
where moffs is an 8, 16, 32 or 64-bit offset into a segment register,
and for the default segment registers with a base of 0 that means the
offset is really either a zero extended 32-bit or 64-bit absolute address,
and rAn is registers AL, AX, EAX, RAX (depends on operand size). MOVABSx is
a relocatable absolute address that loads or stores to/from an "A" register.

continuing...

The answer to my original question seems to be that MS always generates
what GCC calls Position Independent Executable enabled with the -fPIE option, and MS always uses a large memory model whereas GCC must enable it.

But also as GCC doesn't know if exeVar is an intra- or inter- module
reference so it always has to generate a worst case access for every program global variable. Because MS knows which global variables are are dllimports
it generates optimal code for intra- (RIP-rel) and inter- (GOT indirect)
module references.

extern long exeVar;

long GetExeVar (void)
{ return exeVar;
}

Compiled with GCC x86-64 15.1 -O3 -fPIE -mcmodel=large

GetExeVar():
.L2:
movabs r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2
lea rax, .L2[rip]
movabs rdx, OFFSET FLAT:exeVar@GOT
add rax, r11
mov rax, QWORD PTR [rax+rdx]
mov rax, QWORD PTR [rax]
ret

(The above GCC code also doesn't look optimal. I don't see why it fiddles
about calculating addresses when it should just should just use a RIP-rel
load to pull the absolute address of exeVar from the GOT and then load it,
as MS does with its imports table below.)

Compiled with MSVC latest -O3
Intra-module reference:

long GetExeVar(void) PROC ; GetExeVar, COMDAT
mov eax, DWORD PTR long exeVar ; exeVar
ret 0

Inter-module reference:

__declspec(dllimport) long dllVar;

long GetDllVar (void)
{ return dllVar;
}

long GetDllVar(void) PROC ; GetDllVar, COMDAT
mov rax, QWORD PTR __imp_long dllVar
mov eax, DWORD PTR [rax]
ret 0

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Sun Jun 29 20:41:04 2025

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.

And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.

And bugs in there will occur only rarely, so they will be difficult
to find and debug.

Compilers, assemblers and linkers have long supported alignment directives.

Sure, it's possible to align on a page boundary, like I wrote above.
It is just something that you probably _want_ to avoid for every
translation unit.

[..]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Sun Jun 29 20:50:56 2025

On Sun, 29 Jun 2025 16:01:41 +0000, Terje Mathisen wrote:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

But you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.

So does RISC-V with compressed instructions. POWER doesn't.

Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.

One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.

That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.

And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.

And bugs in there will occur only rarely, so they will be difficult
to find and debug.

This is indeed possible (almost anything except skiing through a
revolving door), but IMHO this is something to avoid.

Skiing through a revolving door is in fact possible, as long as you are
using xc gear, since there the binding allow you to fold the skis up
along your body. You just need the balance to be able to use the tail
ends of your skis as stilts.

Drilling a hole in the air is also possible--as long as you don't
mind the air re-filing the hole after you stop drilling, too.

Back in uni days (as members of the uni scouts group) we tried all sorts
of funny stuff, including running a very hard obstacle course with xc
skis on.

I do agree that for most people/skiing gear/revolving doors, the
combination is effectively impossible.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jun 30 06:21:32 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist in >Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

MOVABS imm64, r64 is AT&T syntax for Intel syntax MOV r64, imm64.

It does not define them, does not show
any instruction bytes, and does not reference any other docs that do so.

The gas manual may be the best reference about AT&T syntax.

I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.

Searching for "GNU as manual" gave me the link to <https://sourceware.org/binutils/docs/> where you can find links to
the gas manual in different formats. And searching for "AT&T" in the
table of contents brings up three subsections of <https://sourceware.org/binutils/docs/as/i386_002dDependent.html>

I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.

The section mentioned above says:

|The i386 version as supports both the original Intel 386 architecture
|in both 16 and 32-bit mode as well as AMD x86-64 architecture
|extending the Intel architecture to 64-bits.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to EricP on Mon Jun 30 13:55:03 2025

On Sun, 29 Jun 2025 14:02:32 -0400
EricP <ThatWouldBeTelling@thevillage.com> wrote:

John Levine wrote:

According to EricP <ThatWouldBeTelling@thevillage.com>:

extern int *__errno_location (void) __THROW __attribute__
((__const__));

# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */

That replaces a memory reference with a function call.

Not really. On all of the Unix-like systems I know, errno is a macro wrapped around a function call that fetches the most recent error in
the current thread, done that way to avoid breaking old programs
written back before threads when errno was an extern int. It's a
peculiar special case and I don't offhand know of anything else like
that.

I was just looking for a shared module export variable but errno was
a poor choice because everyone has replaced it with a function call.

Scott suggested signgam in math.h but that doesn't exist in
Microsoft's math.h because MS is stuck on C-89 so not useful for
comparing MS and GCC.

This particular case is not related to Microsoft's refusal to support
certain parts of C99 (mostly those part that made optional in C11).
signgam was never a part of any C standard. It's a POSIX extension.

Here's the ABI manual for amd64 systems:

http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

Thanks but I've got a copy v1.0 dated 6-Dec-2022.
That one is a draft v0.95 from 2005.

The issue I have with it is that while it does provide an overview of
the address models, it does not describe how the whole mechanism
works, how the compiler interacts with linker and loader.
Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS. It does not define them,
does not show any instruction bytes, and does not reference any other
docs that do so.

When you are not sure about meaning of AT&T mnemonics you can utilize
objdump (or Microsoft's dumpbin) to see you object files in Intel asm.
objdump -d -M,Intel yourfile.o
It works with exe files as well.

I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
I was unable to find a PDF manual for x86-64, or x64, or AMD64
anywhere. There is a web document at GNU but it has no search function
and does not explain leaq or movabs.

After a few hours of flopping about searching the web I think I have
figured out how it works, specifically how the movabs works with the
loader, and why the MS code for Windows is different from the GCC
code for Linux, and how the MS compiler uses dllimport attribute and
GCC does not.

First about LEAQ and MOVABS.
It seems LEAQ is the LEA Load Effective Address instruction with a
data type attached to it so it knows the operand size and thus the
address mode. This replaces the Intel B/W/D/QWORD PTR nomenclature.

MOVABS is more complicated and actually has two versions,
MOVABS and MOVABSx (where x is a data type b, w, d, or q).

MOVABS (no type) is really Intel "MOV r64, imm64" which loads a 64-bit immediate into a register. MOVABS has nothing to do with absolute
addresses except if the imm64 happens to be a relocatable symbol
value then it can be patched by the loader, as with all such
immediate symbols.

MOVABSx (with type) is really Intel "MOV moffs, rAn" or "MOV rAn,
moffs" where moffs is an 8, 16, 32 or 64-bit offset into a segment
register, and for the default segment registers with a base of 0 that
means the offset is really either a zero extended 32-bit or 64-bit
absolute address, and rAn is registers AL, AX, EAX, RAX (depends on
operand size). MOVABSx is a relocatable absolute address that loads
or stores to/from an "A" register.

continuing...

The answer to my original question seems to be that MS always
generates what GCC calls Position Independent Executable enabled with
the -fPIE option, and MS always uses a large memory model whereas GCC
must enable it.

But also as GCC doesn't know if exeVar is an intra- or inter- module reference so it always has to generate a worst case access for every
program global variable. Because MS knows which global variables are
are dllimports it generates optimal code for intra- (RIP-rel) and
inter- (GOT indirect) module references.

extern long exeVar;

long GetExeVar (void)
{ return exeVar;
}

Compiled with GCC x86-64 15.1 -O3 -fPIE -mcmodel=large

GetExeVar():
.L2:
movabs r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2
lea rax, .L2[rip]
movabs rdx, OFFSET FLAT:exeVar@GOT
add rax, r11
mov rax, QWORD PTR [rax+rdx]
mov rax, QWORD PTR [rax]
ret

(The above GCC code also doesn't look optimal. I don't see why it
fiddles about calculating addresses when it should just should just
use a RIP-rel load to pull the absolute address of exeVar from the
GOT and then load it, as MS does with its imports table below.)

Compiled with MSVC latest -O3
Intra-module reference:

long GetExeVar(void) PROC ; GetExeVar,
COMDAT mov eax, DWORD PTR long exeVar ; exeVar
ret 0

Inter-module reference:

__declspec(dllimport) long dllVar;

long GetDllVar (void)
{ return dllVar;
}

long GetDllVar(void) PROC ; GetDllVar,
COMDAT mov rax, QWORD PTR __imp_long dllVar
mov eax, DWORD PTR [rax]
ret 0

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Mon Jun 30 17:13:51 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist in >>> Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Jun 30 12:51:13 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist in
Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?
Because it is not documented.

MOVABS imm64, r64 is AT&T syntax for Intel syntax MOV r64, imm64.

It does not define them, does not show
any instruction bytes, and does not reference any other docs that do so.

The gas manual may be the best reference about AT&T syntax.

I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.

Searching for "GNU as manual" gave me the link to <https://sourceware.org/binutils/docs/> where you can find links to
the gas manual in different formats. And searching for "AT&T" in the
table of contents brings up three subsections of <https://sourceware.org/binutils/docs/as/i386_002dDependent.html>

Yes that is the unsearchable web page manual I referred to.

I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.

The section mentioned above says:

|The i386 version as supports both the original Intel 386 architecture
|in both 16 and 32-bit mode as well as AMD x86-64 architecture
|extending the Intel architecture to 64-bits.

- anton

Yes, it does say that doesn't it.

It says about MOVABS in section '9.16.3 i386 Syntactical Considerations', '9.16.3.1 AT&T Syntax versus Intel Syntax':

"In 64-bit code, �movabs� can be used to encode the �mov� instruction
with the 64-bit displacement or immediate operand."

but doesn't say why or how, or mention any variants,
or which actual x64 instruction(s) it actually maps to.

The section '9.16.4 i386-Mnemonics' doesn't mention MOVABS or LEA at all.

In x64 mode it is important to know exactly when you are dealing
with a "displacement" which are 8 or 32 bits sign extended to 64 bits,
and an "address" which are zero extended to 64 bits, if necessary.
Or what instructions support which variants.
(That's why some code models only work on 2GB and others are 4GB,
or only certain MOV instructions actually support 64-bit addresses.)

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
it seems to me that it was at least worth a mention.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Jun 30 17:11:46 2025

On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist
in
Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?

LEA need to distinguish between::
LEA Rd,[Rb+DISP16]
LEA Rd,[Rb+Ri<<s]
LEA Rd,[Rb+DISP32]
LEA Rd,[Rb+DISP54]
LEA Rd,[Rb+Ri<<s+DISP32]
LEA Rd,[Rb+Ri<<s+DISP64]

Because it is not documented.

{Gomer Pyle mode = ON}

Surprise, surprise, surprise !

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jun 30 16:08:05 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[Alpha]

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Yes, it is (also) using an extra memory load to pick up large immediates.
It also requires a BAL to get the IP into a register.

I have finally gotten around to turning on our working Alpha and
compiled the following program on it:

#include <math.h>
extern int a, b;
extern double x, y;
extern void foo(void);

int main()
{
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, x, y, and foo with
gcc -Wall -O -FPIC and then linked the two files with the same options.

The result on Alpha is:

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 48 85 bd 23 lda gp,-31416(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
120000640: 68 80 3d 20 lda t0,-32664(gp)
120000644: 00 00 01 8e ldt $f16,0(t0)
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
120000658: 60 80 3d 20 lda t0,-32672(gp)
12000065c: 00 00 01 9c stt $f0,0(t0)
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
120000670: 02 04 41 40 addq t1,t0,t1
120000674: 5c 80 3d 20 lda t0,-32676(gp)
120000678: 00 00 21 a0 ldl t0,0(t0)
12000067c: 02 04 41 40 addq t1,t0,t1
120000680: 00 00 43 b0 stl t1,0(t2)
120000684: 00 04 ff 47 clr v0
120000688: 00 00 5e a7 ldq ra,0(sp)
12000068c: 10 00 de 23 lda sp,16(sp)
120000690: 01 80 fa 6b ret

The load of signgam is achieved with the following sequence

120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)

I.e., 2 instructions, not 6.

It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like

120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)

which could be implemented more efficiently as

ldl t1,-32680(gp)

but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).

We also see that the call to foo() within the same linked unit is
linked as

120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

One can see again that both code sequences have the same size, to
avoid shrinking or expanding.

The gcc manual says about these:

| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.

We see the 4 and 1 instructions above, but it's not clear to me that
there is a real benefit. The compiler cannot assume that an external
reference is local, and the linker knows, but does not benefit from
it. And for references within a compilation unit, I would hope that
the compiler/assembler manages to use the smallest variant based on
actual size.

Why burden all programs with the costs of large programs the way it
is done by default on Alpha?

...

I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.

I can use 100 GB arrays with code that is the same size as code that
limits itself to the lower 2GB of address space (there is an option on
Alpha compilers and linkers for that).

For example, for Alpha to load a 64-bit constant requires 6 instructions,
24 bytes.

I forgot to add this to the program, maybe tomorrow.

That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.

That's not necessary. The global pointer is derived from the function
address (in t12) on entry to the function and from the return address
(in ra) after a jsr.

The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.

No, the global table is elsewhere; in the case above we it's about
96KB behind the start of main(). The text ends a few hundred bytes
later, so there is no page that contains both code and data (i.e., no
TLB entries that describe the same page, not that this would be a
problem).

And after the constant is loaded it must be manually added to the base >because there is no LD/ST combined with a scaled index.

Which base? You were only mentioning constants up to now.

Furthermore, the actual load or store of the target value is serially >dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >it is serious penalty.

This century, you use an OoO CPU (even the last Alpha was OoO), and
the 4-5 clocks latency of a load are added to the ready time of the
base address, i.e., gp in this case. gp only changes on far calls, so
loading from a gp-relative address is rarely in the critical
dependence path.

By making it a priority for relatively cheap access to the full 64-bit >address space during an ISA design, what alternatives might have minimized >its extra cost?

It would certainly be an interesting experiment to see how much size
and speed difference we would get if we eliminated the "mov r64,
imm64" instruction when compiling to AMD64 and used Alpha-like
techniques instead. My guess is: barely measurable.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 13:11:44 2025

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

It seems, you got it backward. The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux. It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 16:21:44 2025

On Tue, 01 Jul 2025 13:18:33 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide
most of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

It seems, you got it backward.

Got what backwards?

The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.

I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4). I've never considered it poorly documented.

It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.

Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Tue Jul 1 13:18:33 2025

Michael S <already5chosen@yahoo.com> writes:

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

It seems, you got it backward.

Got what backwards?

The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.

I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4). I've never considered it poorly documented.

It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.

Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 16:28:59 2025

On Tue, 01 Jul 2025 13:18:33 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide
most of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

It seems, you got it backward.

Got what backwards?

The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.

I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4).

No, you didn't, because x86-64 didn't exist until ~2001.
My impression from reading Eric's post is that AT&T syntax for x386 is documented relatively better.

I've never considered it poorly documented.

It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.

Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.

First, the claim was not mine.
Second, the claim was not about Windows, but about x86-64.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Tue Jul 1 14:08:27 2025

Michael S <already5chosen@yahoo.com> writes:

On Tue, 01 Jul 2025 13:18:33 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.

I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4).

No, you didn't, because x86-64 didn't exist until ~2001.

Ah, I meant AT&T syntax, not necessarily related to the
AMD64 extensions, which I used in 2004
pretty extensively. Your point, however, is taken.

I did have a document at the time from AMD that fully documented
the extensions - I'll have to see if I can dig it up.

$ grep -i abs boot/*
boot/setup64.S: movabsq $PHYSMAP_BASE, %r12 # Base DVMM virtual address boot/setup64.S: movabsq $handlerlist, %r11 # List of interrupt handlers boot/setup64.S: movabsq $debugger, %rdi # Set this boot/setup64.S: movabsq $_ZN10c_debugger10early_initEv, %rcx # Call ::early_init
boot/setup64.S: movabsq $__call_constructors, %rcx
boot/setup64.S: movabsq $dvmm_bsp_start, %rcx # We use an indirect jump to invoke main
boot/setup64.S: movabsq $dvmm_ap_start, %rcx # AP, use 'dvmm_ap_start'

Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.

First, the claim was not mine.
Second, the claim was not about Windows, but about x86-64.

Yes, I see that now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Tue Jul 1 11:09:07 2025

Michael S wrote:

On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,

Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.

Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.

In terms of pure server numbers, windows is likely less than
20% globally.

It seems, you got it backward. The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux. It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.

I meant x86-64 processor as a planetary class of machines and making the observation that if one is targeting that potential market then it seems
in ones own self interest to have documentation of a quality level that,
for example, saves everyone from having to disassemble the code to
figure out what it actually does.

Not that Windows doesn't need lots of disassembly too,
but back then they already had most of that market.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to BGB on Tue Jul 1 15:23:30 2025

BGB <cr88192@gmail.com> writes:

On 6/18/2025 1:26 AM, Anton Ertl wrote:

You can, however compare ARM T32 and A32 in the Debian results:

bash grep gzip
595204 107636 46744 armhf ARM T32
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel ARM A32
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el

There may be additional differences between the two ARM 32-bit
builds, however.

What I could do relatively easily is to compile a file from gforth
with different options. The file I used is what is compiled to
engine/main-fast-ll.o

text siz compiler options
20242 -O2
18146 -Os
18146 -Os -march=rv64gc
18444 -Os -march=rv64gc -mcmodel=medany
23092 -Os -march=rv64g

...

Not super impressed with the 'C' extension, as it is both a pain to
decode and also the code size savings tend to be fairly modest.

That may be the case. If you look at the bottom line, i.e., which
platform has the smallest text size, in the table above RV64GC
(riscv64) is #1 for two programs and #2 for one program. Maybe RV64G
is only modestly worse and may be dense enough for your needs. OTOH,
if you want to use existing binaries, you have found that many
software distributions compile for RV64GC, and recompile the
distribution yourself may be more pain than implementing the C
extension.

IA-64 code density is bad, but one wouldn't expect it to be quite *that*
bad.

Maybe around 3-5x bigger than a RISC with 32-bit instructions.

You find IA-64 results in earlier code density measurements by me.
E.g., from <2017Aug9.140559@mips.complang.tuwien.ac.at>:

bash grep gzip
398384 88084 47944 armhf
584340 130872 68276 armel
588972 129096 66892 amd64
604656 131804 66268 i386
637620 133868 72712 s390
638912 140544 71744 sparc
674912 141120 74032 mipsel
674912 141168 74112 mips
680928 139664 74272 powerpc
688052 150680 75908 s390x
1539872 322432 158656 ia64

armel is probably ARM A32 (32-bit instructions), armhf is probably ARM
T32 (16-bit and 32-bit instructions). ia64 is ~2.5x bigger than armel.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Anton Ertl on Tue Jul 1 16:08:09 2025

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[Alpha]

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Yes, it is (also) using an extra memory load to pick up large immediates. >>It also requires a BAL to get the IP into a register.

I have finally gotten around to turning on our working Alpha and
compiled the following program on it:

Now with a large constant:

#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);

int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.

The result on Alpha is:

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret

The load of signgam is achieved with the following sequence

120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)

I.e., 2 instructions, not 6.

It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like

120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)

which could be implemented more efficiently as

ldl t1,-32680(gp)

but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).

We also see that the call to foo() within the same linked unit is
linked as

120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

One can see again that both code sequences have the same size, to
avoid shrinking or expanding.

...

For example, for Alpha to load a 64-bit constant requires 6 instructions, >>24 bytes.

The load of the large constant looks as follows:

120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)

Two instructions (8 bytes) plus 8 bytes of data. Interestingly, while
the global pointer points to $120018B78 (99672 bytes after the start
of main()) if I compute it correctly, the constant is placed at
$120000870 (99090 bytes before the gp, and 592 bytes after the start
of main(), and it's on the same 8KB page as main(); so in this case,
there is indeed a DTLB entry that points to the same page as an ITLB
entry.

However, the code, rodata, and data could be on separate pages, and
AFAICS, these pages could be close enough to each other to make the
ldah instructions unnecessary. And if we look at the RISC-V code,
that's what happens there.

That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.

The constant is actually located behind the code (not just the code
for main(), but all the code), and there is no BAL (actually, Alpha
has no instruction named BAL; do you mean BSR?).

By making it a priority for relatively cheap access to the full 64-bit >>address space during an ISA design, what alternatives might have minimized >>its extra cost?

It's interesting to look at how the same C code comes out on other
instruction sets. You can find the source code and disassembly output
for main() for Alpha, AMD64, ARM A64, and RV64GC (both default and
with -mcmodel=medany) on <http://www.complang.tuwien.ac.at/anton/memory-references/>. I'll
present RISC-V in full and highlights from the others.

Here's the output for RISC-V with -mcmodel=medany:

0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028 <__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040 <y>
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048 <x>
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

The medany variant accesses a by first computing its address in a4,
Alpha style:

10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
...
105a2: c31c sw a5,0(a4)

By contrast, the default variant accesses a directly through the gp:

1058c: 8541a783 lw a5,-1964(gp) # 12054 <a>
...
1059c: 84f1aa23 sw a5,-1964(gp) # 12054 <a>

It's not clear why the memory model should make a difference here, but
it does.

The two calls are both using pc-relative jal instead or
register-indirect jalr. This works for lgamma() by generating code
for a trampoline in the binary that contains main():

00000000000104b0 <lgamma@plt>:
104b0: 00002e17 auipc t3,0x2
104b4: b60e3e03 ld t3,-1184(t3) # 12010 <lgamma@GLIBC_2.27>
104b8: 000e0367 jalr t1,t3

RISC-V is quite similar to Alpha, yet produces much more compact code.
My guess is that the linker actually does growing or shrinking here
(and I have read complaints about the slowness of RISC-V linking), and
this pays off in the instruction count and code size.

ARM A64 synthesizes the large constant instead of loading it from
memory:

894: d29f39c1 mov x1, #0xf9ce // #63950
898: f2be7261 movk x1, #0xf393, lsl #16
89c: f2dbd281 movk x1, #0xde94, lsl #32
8a0: f2ea2201 movk x1, #0x5110, lsl #48

The accesses to the global variables are quite long-winded, e.g., here
we have an access to b or signgam:

8d4: 90000082 adrp x2, 10000 <__FRAME_END__+0xf510>
8d8: f947ec42 ldr x2, [x2, #4056]
8dc: b9400042 ldr w2, [x2]

The calls work as on RISC-V.

On AMD64 the constant is loaded as follows:

113d: 48 b8 ce f9 93 f3 94 movabs $0x5110de94f393f9ce,%rax
1144: de 10 51

The calls seem to be handled as on RISC-V. The global variables are
accessed using rip-relative addressing:

1168: 8b 05 c2 2e 00 00 mov 0x2ec2(%rip),%eax # 4030 <__signgam@GLIBC_2.23>

Again signgam is located close to a, and c.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 1 20:03:04 2025

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jul 1 21:07:13 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared library
you access at run-time is the exact same one you linked your binary against.

That's not necessarily the case - so long as the function signatures/API don't change, new versions of the shared library will be backward compatable with applications linked against earlier versions. So the data section requirements for the shared library could change after the application is linked if new
data section symbols are defined in the newer version of the shared library.

The run-time loader will know how much data space has been allocated
to the executable itself, and will append (and relocate corresponding references in the shared object and executable) the '.data' sections from
each dynamic library loaded by the application at the time the
library is loaded - which may be at startup or via dlopen().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 1 21:49:47 2025

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

---------------

Now with a large constant:

#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);

int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.

The result on Alpha is:

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret

------------------

Here's the output for RISC-V with -mcmodel=medany:

0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028 <__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040

10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048

10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret

--------------------
My 66000::
main: ; @main
enter r0,r0,0,0
std #5841413448022620622,[ip,c]
call foo
ldd r1,[ip,y]
call lgamma
std r1,[ip,x]
call __signgam
lduw r1,[r1]
lduw r2,[ip,a]
add r1,r2,r1
lduw r2,[ip,b]
add r1,r1,r2
stw r1,[ip,a]
mov r1,#0
exit r0,r0,0,0

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jul 1 23:20:46 2025

On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>>
This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.

I am well aware of that.

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

That's not necessarily the case - so long as the function signatures/API don't
change, new versions of the shared library will be backward compatable
with
applications linked against earlier versions. So the data section requirements
for the shared library could change after the application is linked if
new
data section symbols are defined in the newer version of the shared
library.

The run-time loader will know how much data space has been allocated
to the executable itself, and will append (and relocate corresponding references in the shared object and executable) the '.data' sections
from
each dynamic library loaded by the application at the time the
library is loaded - which may be at startup or via dlopen().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jul 1 23:26:05 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

---------------

Now with a large constant:

#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);

int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.

The result on Alpha is:

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret

------------------

Here's the output for RISC-V with -mcmodel=medany:

0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028
<__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040
<y>
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
<x>
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret

--------------------
My 66000::
main: ; @main
enter r0,r0,0,0
std #5841413448022620622,[ip,c]
call foo
ldd r1,[ip,y]
call lgamma
std r1,[ip,x]
call __signgam

__signgam is an "int" variable in the shared library, not a function.

What is the purpose of 'call' here?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 2 00:04:55 2025

On Tue, 1 Jul 2025 23:26:05 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

---------------

Now with a large constant:

#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);

int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.

The result on Alpha is:

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret

------------------

Here's the output for RISC-V with -mcmodel=medany:

0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028
<__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c> >>> 10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040
<y>
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
<x>
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>> 105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret

--------------------
My 66000::
main: ; @main
enter r0,r0,0,0
std #5841413448022620622,[ip,c]
call foo
ldd r1,[ip,y]
call lgamma
std r1,[ip,x]
call __signgam

__signgam is an "int" variable in the shared library, not a function.

What is the purpose of 'call' here?

IBM's source code for signgam is:

#define _XOPEN_SOURCE
#include <math.h>

int *__signgam(void);
#define signgam (*__signgam())

Which is what we fed into the compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jul 2 05:18:53 2025

mitchalsup@aol.com (MitchAlsup1) writes:

IBM's source code for signgam is:

#define _XOPEN_SOURCE
#include <math.h>

int *__signgam(void);
#define signgam (*__signgam())

Which is what we fed into the compiler.

The Linux systems on which I compiled the code used glibc, and that defines

extern int signgam;

in <math.h>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jul 2 05:25:57 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

My expectation was that the signgam variable would be in (the ELF
equivalent of) the data or bss segment of libm.so.6, and would
therefore need some indirection to access.

When I do

objdump -T /lib64/lp64d/libm.so.6|grep signgam

I get

000000000008b0a4 g DO .bss 0000000000000004 GLIBC_2.27 __signgam 000000000008b0a4 w DO .bss 0000000000000004 GLIBC_2.27 signgam

So signgam is a weak symbol (w) that is neither global nor local, is
not a constructor, not a warning, not indirect or evaluated during
reloc processing, is a dynamic symbol (D) and an object (O).

Not sure what the global __signgam has to do with it. When I do

objdump -R /lib64/lp64d/libm.so.6|grep signgam

(dynamic relocation information) I see

000000000008b080 R_RISCV_64 __signgam@@GLIBC_2.27

which is exactly the name we see in the disassembly code above.

So now my theory is that libm.so.6 contains information that tells the
linker of any executable that links to it to create the variable in
the executable, and the library itself uses indirection to access that variable, and during dynamic linking, this indirection is initialized.
If my theory is correct, any future versions of glibc will have to use
a compatible mechanism for implementing signgam.

In the glibc version used on our ancient Alpha, this mechanism has not
been used, so there we saw the indirection.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jul 2 08:06:20 2025

MitchAlsup1 wrote:

On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist >>>> in
Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?

LEA need to distinguish between::
LEA Rd,[Rb+DISP16]
LEA Rd,[Rb+Ri<<s]
LEA Rd,[Rb+DISP32]
LEA Rd,[Rb+DISP54]
LEA Rd,[Rb+Ri<<s+DISP32]
LEA Rd,[Rb+Ri<<s+DISP64]

In 64-bit mode there is no disp64, just disp8 and disp32,
There were no spare bits in the ModRM byte to indicate it.

Had there been disp64 then x64 could have had a smooth expansion
of address calculations into 64-bit space.

AMD worked around the ModRM limitation to provide at least some way to
access all of 64-bit address space. It did so by adding MOV opcodes,
to load an imm64, and to LD/ST memory using abs64 addresses.

And since those MOV's are different opcodes and not part of ModRM,
LEA does not know about those 64-bit imm64 or abs64 values and cannot
use them in general address calculations.

And all of the various compiler code models and their addressing
limitations follows from these discontinuities in addressing behavior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Wed Jul 2 11:20:09 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[Alpha]

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Yes, it is (also) using an extra memory load to pick up large immediates.
It also requires a BAL to get the IP into a register.

I have finally gotten around to turning on our working Alpha and
compiled the following program on it:

#include <math.h>
extern int a, b;
extern double x, y;
extern void foo(void);

int main()
{
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, x, y, and foo with
gcc -Wall -O -FPIC and then linked the two files with the same options.

The result on Alpha is:

GCC Alpha manual says that the limit with -mlarge-data (the default)
is 2 GB of data. Larger data must use mmap or malloc.
-mlarge-text (the default) is 4 MB code.

GCC has no ability to generate access to Alpha's full 64-bit address space
so there is no comparison with other ISA's.

Perhaps using A64 or RV64 would be better examples.

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 48 85 bd 23 lda gp,-31416(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
120000640: 68 80 3d 20 lda t0,-32664(gp)
120000644: 00 00 01 8e ldt $f16,0(t0)
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
120000658: 60 80 3d 20 lda t0,-32672(gp)
12000065c: 00 00 01 9c stt $f0,0(t0)
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
120000670: 02 04 41 40 addq t1,t0,t1
120000674: 5c 80 3d 20 lda t0,-32676(gp)
120000678: 00 00 21 a0 ldl t0,0(t0)
12000067c: 02 04 41 40 addq t1,t0,t1
120000680: 00 00 43 b0 stl t1,0(t2)
120000684: 00 04 ff 47 clr v0
120000688: 00 00 5e a7 ldq ra,0(sp)
12000068c: 10 00 de 23 lda sp,16(sp)
120000690: 01 80 fa 6b ret

The load of signgam is achieved with the following sequence

120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)

I.e., 2 instructions, not 6.

Yes, that is the same GOT table indirect two load sequence as x64.

I wasn't saying it had to use 6 instructions. I'm saying that if Alpha
wanted to full access to its 64-bit address space, then its options
are either to use 6 instructions to build a 64-bit immediate
OR do two loads (maybe plus other overhead). Both those options are poor.

RV64 and A64 are in a similar boat.

I wanted an ISA option that doesn't need two dependent LD's or
6 instructions to access the whole 64-bit address space.

It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like

120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)

which could be implemented more efficiently as

ldl t1,-32680(gp)

but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).

Nop'ing the first and using the second would be better because it doesn't
use a temp register and hardware can optimize a unop away.

We also see that the call to foo() within the same linked unit is
linked as

120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

One can see again that both code sequences have the same size, to
avoid shrinking or expanding.

I wondered about those nop's.

I would be cautious about depending on code expansion for optimization.
I read some online remarks about a developer adding "relaxation" to the
RV64 linker and causing it to take over an hour to run.
It is possible that the basic algorithm is relaxation factorial O(n!)
(it has that smell to me).

https://sourceware.org/binutils/docs/as/Xtensa-Relaxation.html#index-relaxation

Always starting with largest size and compacting down may be best.
That's why I was investigating the compacting linker algorithm.
If my ISA is going to depend on it for optimization, I want to
make sure it could be implemented easily and would not have
uncontrollable pathological behavior. And it does look acceptable.

The gcc manual says about these:

| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.

We see the 4 and 1 instructions above, but it's not clear to me that
there is a real benefit. The compiler cannot assume that an external reference is local, and the linker knows, but does not benefit from
it. And for references within a compilation unit, I would hope that
the compiler/assembler manages to use the smallest variant based on
actual size.

Yes but they also have their 2 GB limit which avoids all large
address space 'issues' (ie, fobs them off onto the programmer).

The question is what happens to the Alpha code (or RV64 or A64) when
you remove the address space compile limits and go for the Full Monty.

Why burden all programs with the costs of large programs the way it
is done by default on Alpha?

....

I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.

I can use 100 GB arrays with code that is the same size as code that
limits itself to the lower 2GB of address space (there is an option on
Alpha compilers and linkers for that).

100 GB is 37 bits of address. Where does that 37 number come from?
And the manual says the Alpha data limit is 2 GB.

For example, for Alpha to load a 64-bit constant requires 6 instructions,
24 bytes.

I forgot to add this to the program, maybe tomorrow.

Or pull it from the constant table just prior to the routine entry.

Long ago I read Alpha code standard puts constants into a table just
before the routine entry, does a BAL rTmp,+0 to copy the PC into rTmp,
then can access the constants in the table at negative rTmp offsets.

That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.

That's not necessary. The global pointer is derived from the function address (in t12) on entry to the function and from the return address
(in ra) after a jsr.

Yes, I see that now.
t12 is the link register specified by the caller in its JAL to above code.
That saves it a BAL rTmp,+0 to copy the PC as a PC-rel base.

The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.

No, the global table is elsewhere; in the case above we it's about
96KB behind the start of main(). The text ends a few hundred bytes
later, so there is no page that contains both code and data (i.e., no
TLB entries that describe the same page, not that this would be a
problem).

I'm referring to the routine's constants table I describe above.

And after the constant is loaded it must be manually added to the base
because there is no LD/ST combined with a scaled index.

Which base? You were only mentioning constants up to now.

The base for the constant table is the PC copied by the BAL+0.

Furthermore, the actual load or store of the target value is serially
dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >> it is serious penalty.

This century, you use an OoO CPU (even the last Alpha was OoO), and
the 4-5 clocks latency of a load are added to the ready time of the
base address, i.e., gp in this case. gp only changes on far calls, so loading from a gp-relative address is rarely in the critical
dependence path.

We are comparing apples and oranges because GCC for Alpha only supports
direct access to a resticted 2 GB address space.

Probably the reason they don't support the direct 64-bit access is
(a) all the things I said would happen would, and
(b) back then no one would have had code requiring such large data sets
so no backwards compatability issues, and
(c) by the time programmers wanted >2GB data sets the Alpha was dead.

By making it a priority for relatively cheap access to the full 64-bit
address space during an ISA design, what alternatives might have minimized >> its extra cost?

It would certainly be an interesting experiment to see how much size
and speed difference we would get if we eliminated the "mov r64,
imm64" instruction when compiling to AMD64 and used Alpha-like
techniques instead. My guess is: barely measurable.

- anton

Its probably better to use A64 and RV64 for such comparisons
as at least their active market might drive compiler enhancements.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jul 2 15:38:26 2025

On Wed, 2 Jul 2025 12:06:20 +0000, EricP wrote:

MitchAlsup1 wrote:

On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]

Also it makes multiple references to x64 instructions that don't exist >>>>> in
Intel or AMD docs, LEAQ and MOVABS.

LEAQ is AT&T syntax for LEA with a "quadword" operand.

Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?

LEA need to distinguish between::
LEA Rd,[Rb+DISP16]
LEA Rd,[Rb+Ri<<s]
LEA Rd,[Rb+DISP32]
LEA Rd,[Rb+DISP54]
LEA Rd,[Rb+Ri<<s+DISP32]
LEA Rd,[Rb+Ri<<s+DISP64]

In 64-bit mode there is no disp64, just disp8 and disp32,
There were no spare bits in the ModRM byte to indicate it.

In My 66000, there is.

Had there been disp64 then x64 could have had a smooth expansion
of address calculations into 64-bit space.

Something RISC-V should have learned.

AMD worked around the ModRM limitation to provide at least some way to
access all of 64-bit address space. It did so by adding MOV opcodes,
to load an imm64, and to LD/ST memory using abs64 addresses.

And since those MOV's are different opcodes and not part of ModRM,
LEA does not know about those 64-bit imm64 or abs64 values and cannot
use them in general address calculations.

And all of the various compiler code models and their addressing
limitations follows from these discontinuities in addressing behavior.

And it hurts--so, don't do it your ISA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jul 2 15:45:54 2025

On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[Alpha]

I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,

How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the >>>> global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up >>>> the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.

Yes, it is (also) using an extra memory load to pick up large
immediates.
It also requires a BAL to get the IP into a register.

I have finally gotten around to turning on our working Alpha and
compiled the following program on it:

#include <math.h>
extern int a, b;
extern double x, y;
extern void foo(void);

int main()
{
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}

I have compiled this and the file containing a, b, x, y, and foo with
gcc -Wall -O -FPIC and then linked the two files with the same options.

The result on Alpha is:

GCC Alpha manual says that the limit with -mlarge-data (the default)
is 2 GB of data. Larger data must use mmap or malloc.
-mlarge-text (the default) is 4 MB code.

GCC has no ability to generate access to Alpha's full 64-bit address
space
so there is no comparison with other ISA's.

Perhaps using A64 or RV64 would be better examples.

0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 48 85 bd 23 lda gp,-31416(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
120000640: 68 80 3d 20 lda t0,-32664(gp)
120000644: 00 00 01 8e ldt $f16,0(t0)
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
120000658: 60 80 3d 20 lda t0,-32672(gp)
12000065c: 00 00 01 9c stt $f0,0(t0)
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
120000670: 02 04 41 40 addq t1,t0,t1
120000674: 5c 80 3d 20 lda t0,-32676(gp)
120000678: 00 00 21 a0 ldl t0,0(t0)
12000067c: 02 04 41 40 addq t1,t0,t1
120000680: 00 00 43 b0 stl t1,0(t2)
120000684: 00 04 ff 47 clr v0
120000688: 00 00 5e a7 ldq ra,0(sp)
12000068c: 10 00 de 23 lda sp,16(sp)
120000690: 01 80 fa 6b ret

The load of signgam is achieved with the following sequence

120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)

I.e., 2 instructions, not 6.

Yes, that is the same GOT table indirect two load sequence as x64.

I wasn't saying it had to use 6 instructions. I'm saying that if Alpha
wanted to full access to its 64-bit address space, then its options
are either to use 6 instructions to build a 64-bit immediate
OR do two loads (maybe plus other overhead). Both those options are
poor.

Both waste instructions and registers.

RV64 and A64 are in a similar boat.

Whereas My 66000 is not.

I wanted an ISA option that doesn't need two dependent LD's or
6 instructions to access the whole 64-bit address space.

My 66000 has what you want.

It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like

120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)

which could be implemented more efficiently as

ldl t1,-32680(gp)

but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).

Nop'ing the first and using the second would be better because it
doesn't
use a temp register and hardware can optimize a unop away.

We also see that the call to foo() within the same linked unit is
linked as

120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

This 4 instruction sequence becomes::

CALX [IP,,GOT[n]-.]

In my 66000 ISA.

One can see again that both code sequences have the same size, to
avoid shrinking or expanding.

I wondered about those nop's.

I would be cautious about depending on code expansion for optimization.
I read some online remarks about a developer adding "relaxation" to the
RV64 linker and causing it to take over an hour to run.
It is possible that the basic algorithm is relaxation factorial O(n!)
(it has that smell to me).

https://sourceware.org/binutils/docs/as/Xtensa-Relaxation.html#index-relaxation

Always starting with largest size and compacting down may be best.
That's why I was investigating the compacting linker algorithm.
If my ISA is going to depend on it for optimization, I want to
make sure it could be implemented easily and would not have
uncontrollable pathological behavior. And it does look acceptable.

The gcc manual says about these:

| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.

We see the 4 and 1 instructions above, but it's not clear to me that
there is a real benefit. The compiler cannot assume that an external
reference is local, and the linker knows, but does not benefit from
it. And for references within a compilation unit, I would hope that
the compiler/assembler manages to use the smallest variant based on
actual size.

Yes but they also have their 2 GB limit which avoids all large
address space 'issues' (ie, fobs them off onto the programmer).

The question is what happens to the Alpha code (or RV64 or A64) when
you remove the address space compile limits and go for the Full Monty.

Bad things (well unexpected at best).

Why burden all programs with the costs of large programs the way it
is done by default on Alpha?

....

I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.

I can use 100 GB arrays with code that is the same size as code that
limits itself to the lower 2GB of address space (there is an option on
Alpha compilers and linkers for that).

100 GB is 37 bits of address. Where does that 37 number come from?
And the manual says the Alpha data limit is 2 GB.

For example, for Alpha to load a 64-bit constant requires 6
instructions,
24 bytes.

I forgot to add this to the program, maybe tomorrow.

Or pull it from the constant table just prior to the routine entry.

Long ago I read Alpha code standard puts constants into a table just
before the routine entry, does a BAL rTmp,+0 to copy the PC into rTmp,
then can access the constants in the table at negative rTmp offsets.

That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.

That's not necessary. The global pointer is derived from the function
address (in t12) on entry to the function and from the return address
(in ra) after a jsr.

Yes, I see that now.
t12 is the link register specified by the caller in its JAL to above
code.
That saves it a BAL rTmp,+0 to copy the PC as a PC-rel base.

A wasted register when you have Universal Constants.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 2 17:06:37 2025

On Wed, 2 Jul 2025 16:56:42 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:

<snip>

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30> >>>> 120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

This 4 instruction sequence becomes::

CALX [IP,,GOT[n]-.]

In my 66000 ISA.

For current architectures, function calls use the
procedure linkage table (PLT). The Global Offset Table
is only used for certain static global variables.

We figured out how to do this without a PLT and remain PIC
using just GOT; and adjusted ISA so that the needed parts
are present.

If you want to leverage standard tools, you may wish
to follow that paradigm in 66000.

In addition, we do not need 4 control transfers to get to
and back from an external subroutine call--just 2--the
CALX and the RET.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Wed Jul 2 17:14:13 2025

According to MitchAlsup1 <mitchalsup@aol.com>:

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

I don't think so. The code points to the GOT which is a fixed distance away but
in another page, so the code can be read-only and the GOT is patched for the current process.

This sort of arrangement goes way back. It was in all versions of ELF libraries
and something like it was in TSS/360 in 1966.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 2 16:56:42 2025

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:

<snip>

whereas the call to lgamma() in a shared library is linked as

120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)

This 4 instruction sequence becomes::

CALX [IP,,GOT[n]-.]

In my 66000 ISA.

For current architectures, function calls use the
procedure linkage table (PLT). The Global Offset Table
is only used for certain static global variables.

If you want to leverage standard tools, you may wish
to follow that paradigm in 66000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jul 2 15:38:48 2025

MitchAlsup1 wrote:

On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) # 12050
<b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it >>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.

I am well aware of that.

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

If such references in the module (exe/dll) are fixed size RIP-rel disp64
plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.

The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value,
and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.

The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other users.

If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Thu Jul 3 08:41:06 2025

EricP wrote:

MitchAlsup1 wrote:

On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>> <__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it >>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.

I am well aware of that.

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

If such references in the module (exe/dll) are fixed size RIP-rel disp64
plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.

The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value,
and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.

The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other users.

If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.

The downside of this approach is the two PIC modules are bound to each
other at a fixed offset and unless they relocate together then all the inter-module references have to be patched, however many there are.
And there are likely more than just two modules involved.

The advantage of the GOT-indirect approach is that only the
one location needs to be patched. Plus it can use a DISP32 offset
to access the GOT. The disadvantage is that you don't discover
that you need GOT-indirect addressing until link time,
and then you need to insert an extra LD with a temp register.
And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.

The option behind door number 3 is an indirect address mode.
That simplifies the software as the compiler only emits DISP32 offsets,
the linker only needs to set the offset to the GOT entry and flip an
indirect bit (so no extra LD insertion or temp register).
But indirect addressing is an ISA feature that once added cannot be removed.
It adds hardware complexity in the LSQ which is already
probably the most complex module in the core.
Some of it hardware to deal with worst case situations that likely
never occur, like 4 page or cache line straddles.

The conclusion I've come to is option two when combined with a
compacting linker is best.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Jul 16 01:18:59 2025

On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

EricP wrote:

MitchAlsup1 wrote:

On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>> with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>> <__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even though it >>>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.

I am well aware of that.

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

If such references in the module (exe/dll) are fixed size RIP-rel disp64
plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.

The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value,
and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.

The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other
users.

If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.

The downside of this approach is the two PIC modules are bound to each
other at a fixed offset and unless they relocate together then all the inter-module references have to be patched, however many there are.
And there are likely more than just two modules involved.

The advantage of the GOT-indirect approach is that only the
one location needs to be patched. Plus it can use a DISP32 offset
to access the GOT. The disadvantage is that you don't discover
that you need GOT-indirect addressing until link time,
and then you need to insert an extra LD with a temp register.

Not if you have a CALX instruction. You predict GOT access at
compile time, and when the linker resolves an extern, it can
change CALX into CALA by flipping 1 bit making it the same size
as predicted, but now control transfers to the AGEN address.
When the linker does not resolve, ld.so can do this at run time.

And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.

It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.

The option behind door number 3 is an indirect address mode.

BINGO--that is effectively what CALX and CALA are.

That simplifies the software as the compiler only emits DISP32 offsets,
the linker only needs to set the offset to the GOT entry and flip an
indirect bit (so no extra LD insertion or temp register).
But indirect addressing is an ISA feature that once added cannot be
removed.

Note: CALX and CALA are indirect only so far as they load IP
and not any register. Also note: they are CALLs not BRs.

It adds hardware complexity in the LSQ which is already
probably the most complex module in the core.

CALX performs through ICache not DCache.

Some of it hardware to deal with worst case situations that likely
never occur, like 4 page or cache line straddles.

The conclusion I've come to is option two when combined with a
compacting linker is best.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Jul 18 13:23:04 2025

MitchAlsup1 wrote:

On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

EricP wrote:

MitchAlsup1 wrote:

On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

--------------

Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>>> with the default model it is located at the same address, but
addressed using gp.

The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:

1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>

This includes signgam:

10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>>> <__signgam@@GLIBC_2.27>

It's unclear to me how signgam ends up right besides c, even
though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).

Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.

The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.

I am well aware of that.

But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?

If such references in the module (exe/dll) are fixed size RIP-rel disp64 >>> plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.

The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value, >>> and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.

The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other
users.

If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.

The downside of this approach is the two PIC modules are bound to each
other at a fixed offset and unless they relocate together then all the
inter-module references have to be patched, however many there are.
And there are likely more than just two modules involved.

The advantage of the GOT-indirect approach is that only the
one location needs to be patched. Plus it can use a DISP32 offset
to access the GOT. The disadvantage is that you don't discover
that you need GOT-indirect addressing until link time,
and then you need to insert an extra LD with a temp register.

Not if you have a CALX instruction. You predict GOT access at
compile time, and when the linker resolves an extern, it can
change CALX into CALA by flipping 1 bit making it the same size
as predicted, but now control transfers to the AGEN address.
When the linker does not resolve, ld.so can do this at run time.

And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.

It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.

You misunderstand - in the full 64-bit address space (the large memory model)
I want to eliminate the extra address load for intra-module references
so it only indirects through GOT for inter-module references.

To support full 64-bit addresses, the approach chosen was to
turn all program memory references into two, a LD of a disp64 or
an absolute GOT address, then the data access.

That first extra memory load is an unnecessary 64-bit "tax".

Getting rid of this requires the compiler know the different between an "extern" intra-module reference, which can use direct RIP-disp64 addressing, and a "dllimport" inter-module reference, which must load the absolute
address from the GOT table first, then use that.
But GCC has no "dllimport" attribute for declarations, only MSVC does.

OR it requires the compiler emit a worst-case access sequence for every
global variable access, and have the linker edit and compact the code
as it discovers which are "extern" and which are "dllimport" references,
the compacting linker approach.

The option behind door number 3 is an indirect address mode.

BINGO--that is effectively what CALX and CALA are.

That simplifies the software as the compiler only emits DISP32 offsets,
the linker only needs to set the offset to the GOT entry and flip an
indirect bit (so no extra LD insertion or temp register).
But indirect addressing is an ISA feature that once added cannot be
removed.

Note: CALX and CALA are indirect only so far as they load IP
and not any register. Also note: they are CALLs not BRs.

It adds hardware complexity in the LSQ which is already
probably the most complex module in the core.

CALX performs through ICache not DCache.

Some of it hardware to deal with worst case situations that likely
never occur, like 4 page or cache line straddles.

The conclusion I've come to is option two when combined with a
compacting linker is best.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jul 18 19:54:21 2025

On Fri, 18 Jul 2025 17:23:04 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

EricP wrote:

<snip>

And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.

It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.

You misunderstand - in the full 64-bit address space (the large memory
model)
I want to eliminate the extra address load for intra-module references
so it only indirects through GOT for inter-module references.

Yes, what we did was to make GOT 32-bit addressable from the current
module to reduce the size of the indirecting LD. But in My 66000
the indirect remains a single instruction instead of AUPIC+LDD.

To support full 64-bit addresses, the approach chosen was to
turn all program memory references into two, a LD of a disp64 or
an absolute GOT address, then the data access.

That first extra memory load is an unnecessary 64-bit "tax".

Agreed; but I would label this as the "extern" tax as it is still
required in dynamically loaded modules in the small (32-bit) model.

Getting rid of this requires the compiler know the different between an "extern" intra-module reference, which can use direct RIP-disp64
addressing,
and a "dllimport" inter-module reference, which must load the absolute address from the GOT table first, then use that.
But GCC has no "dllimport" attribute for declarations, only MSVC does.

What the compiler/linker pair needs to know is that the variable is
"extern" but will be "resolved" at link time.

OR it requires the compiler emit a worst-case access sequence for every global variable access, and have the linker edit and compact the code
as it discovers which are "extern" and which are "dllimport" references,
the compacting linker approach.

Compacting is a lot better than expanding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Jul 20 13:05:21 2025

MitchAlsup1 wrote:

On Fri, 18 Jul 2025 17:23:04 +0000, EricP wrote:

MitchAlsup1 wrote:

On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

EricP wrote:

<snip>

And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.

It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.

You misunderstand - in the full 64-bit address space (the large memory
model)
I want to eliminate the extra address load for intra-module references
so it only indirects through GOT for inter-module references.

Yes, what we did was to make GOT 32-bit addressable from the current
module to reduce the size of the indirecting LD. But in My 66000
the indirect remains a single instruction instead of AUPIC+LDD.

To support full 64-bit addresses, the approach chosen was to
turn all program memory references into two, a LD of a disp64 or
an absolute GOT address, then the data access.

That first extra memory load is an unnecessary 64-bit "tax".

Agreed; but I would label this as the "extern" tax as it is still
required in dynamically loaded modules in the small (32-bit) model.

Getting rid of this requires the compiler know the different between an
"extern" intra-module reference, which can use direct RIP-disp64
addressing,
and a "dllimport" inter-module reference, which must load the absolute
address from the GOT table first, then use that.
But GCC has no "dllimport" attribute for declarations, only MSVC does.

What the compiler/linker pair needs to know is that the variable is
"extern" but will be "resolved" at link time.

I wanted to avoid the traditional approach of editing the language source
code to add all sorts of implementation specific compiler attributes for
global variables, like dllimport, stdcall, etc for MS, (GCC has its own
list of attributes, as do all compilers for all languages).

One way to solve these kinds of issues could be a compile command line
option to specify a definitions file(s) that provides all the extra symbol attribute information a compiler needs to generate optimal code
for different circumstances, not just global references but also
optimal ABI's for specific routines:
-sym_attrib=<file_name>,<file_name>,<file_name>,...

For example, if I am compiling code the references the C RTL and I
intend to link with the shared CRTL.DLL then I compile with
-sym_attrib=CRTLSHR.DEF
and that tells compiler more information about all the symbols in C RTL specific for the shared DLL version.

A different file would be used when the C RTL is included by the linker.

This -sym_attrib file could also include ABI info for specific routines,
like changing the register assignments for specific arguments of specific routines.

OR it requires the compiler emit a worst-case access sequence for every
global variable access, and have the linker edit and compact the code
as it discovers which are "extern" and which are "dllimport" references,
the compacting linker approach.

Compacting is a lot better than expanding.

Yes. Compacting is better as you start with working (functionally correct)
but possibly oversize code and tries to make it smaller working code. Compacting can stop at any point as it is always dealing with working code.

Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those expansions,
and so on. Expansion can't stop until all broken items are fixed.

In theory these both deal with the same number of items and should
produce the same optimal result. The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Jul 20 17:33:29 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

Yes. Compacting is better as you start with working (functionally correct) >but possibly oversize code and tries to make it smaller working code. >Compacting can stop at any point as it is always dealing with working code.

Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those expansions, >and so on. Expansion can't stop until all broken items are fixed.

In theory these both deal with the same number of items and should
produce the same optimal result.

Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
have to make sure that you never compact, whereas with "compacting"
you could compact some things that look compactable, but find that the
result is no longer correct, because an earlier-compacted think needs
to expand.

But let's rule out the shrinking-this-causes-growth-elsewhere cases,
then the "compacting" approach can be caughtin a steady state where it
sees no opportunity for shrinking, but one or more span-dependent
instructions can be compacted. So the "expanding" approach can
produce a smaller result than the "compacting" approach.

The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

And what's the problem with that?

Read more about "Assembling Span-Dependent Instructions", and
misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jul 20 20:09:09 2025

On Sun, 20 Jul 2025 17:33:29 +0000, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Yes. Compacting is better as you start with working (functionally
correct)
but possibly oversize code and tries to make it smaller working code. >>Compacting can stop at any point as it is always dealing with working
code.

Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those
expansions,
and so on. Expansion can't stop until all broken items are fixed.

In theory these both deal with the same number of items and should
produce the same optimal result.

Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
have to make sure that you never compact, whereas with "compacting"
you could compact some things that look compactable, but find that the
result is no longer correct, because an earlier-compacted think needs
to expand.

-------------------
Mc 88100 used a compacting linker. The compiler would produce "large"
code, and the linker would compact it. There were several properties
to observe::

a) the linker would make a pass over the code and assign preliminary
addresses to code and to data using the large model. The displacements
can only shrink from this point.

b) a second pass would compact instructions when the code or data
could be addresses with the small model. Code size can only shrink
by performing this procedure.

c) once a piece of code had been (b)ed its addresses became fixed
so there were certain pieces that were not optimal but something
like 98% of all dynamic references were as optimal as they could be.

d) at no step along the 2 passes is any of the code non-executable.
The only things possibly left out are long references that could
have been compacted.

e) I am willing to live with that 2% degredation.
-------------------
We are now living in a world where certain ISAs have great difficulty
in accessing sections/segments that are "very far away". RISC-V as an
example accessing a statically positioned piece of data that is
farther away than 4GB. RISC-V is faced with building a pointer to
that reference and loading that pointer in order to access that
datum, or pasting a bunch of bit together in order to perform
such an access. On the former, you get a doubleword pointer in
DATA within reach of AUPIC+LDD so the cost if 5 words (2 .data
3 instructions (AUPIC+LDD+LD/ST). On the later, one has to create
the top 32-bits and then merge with AUPIC+LDA(lo()) IIRC this is 6 instructions.

Access to dynamically linked subroutines that can be placed "way
far away" have similar problems we generally solve with GOT and
PLTs.

In RISC-Vs case:: the compacting linker has to find the DW in .DATA
and the 3 instructions and express the same semantic content in fewer instructions, and then eliminate the DW in .DATA.
-------------------
With My 66000 ISA the instructions stay the same, but the size of the displacement changes--except in the case where CALX is converted into
CALA and this is performed by flipping a single bit in the minor opcode.

CALL can reach ±2^27 bytes
CALA can reach ±2^33 bytes or 2^64 bytes with a long displacement
CALX can reach ±2^33 bytes or 2^64 bytes with a long displacement

My 66000 SW model does not use a PLT and avoids the delay of the
Trampoline. CALX [IP,,GOT[name#]-.] transfers control to the
subroutine at the address contained in that GOT entry.

But let's rule out the shrinking-this-causes-growth-elsewhere cases,

Mc 88100 and My 66000 do not have this problem. Compacting A can only
add opportunities the ability to compact B.

then the "compacting" approach can be caughtin a steady state where it
sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
produce a smaller result than the "compacting" approach.

Automatic (wire) routers have this problem too. I once "river routed"
the 138 wires between the multiplier array and the accumulator tree
causing the layout of that block to be only 40% the size of the auto
routed equivalent.
-------------------
A 2-pass linker using an always correct code:: that gets within a couple
of percent of optimal; is about as good as one can expect/need. We
(the Mc 88100 designers) never found any opportunities on any of the
SPEC-like benchmarks and resulting performance to doubt our choices.

The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

And what's the problem with that?

Read more about "Assembling Span-Dependent Instructions", and
misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Mon Jul 21 09:01:53 2025

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308).

That paper involves compacting A->B->C branch chains which is NP-complete.

It's been about 40 years since I wrote an assembler that did compacting for the ROMP, but it started with all A->B branches long, and made passes over the code compacting what it could until it didn't find any more. It didn't try to handle branch chains, so compacting never made anything out of range.

That worked well enough and as I recall two passes was invariably enough.

R's,
John
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Levine on Mon Jul 21 12:33:14 2025

John Levine <johnl@taugh.com> writes:

That paper involves compacting A->B->C branch chains which is NP-complete.

If you want to say that the paper tries to transform such a chain into

C, then no, that's not the case.

And actually in the case:

A: jbr B
...
B: jbr C
...
C:

there are only /simple expressions/ and the jbrs are non-pathological
in terms of the paper, and the problem of minimizing the size of a
program with only simple expressions and non-pathological
span-dependent instructions is solvable in polynomial time (the paper
gives an algorithm for doing that in section 3).

It's been about 40 years since I wrote an assembler that did compacting for the
ROMP, but it started with all A->B branches long, and made passes over the code
compacting what it could until it didn't find any more. It didn't try to handle
branch chains, so compacting never made anything out of range.

It probably did not deal with nonsimple or nor with pathological
span-dependent instructions, or it recognized them and always used the
long form for them (theoretically suboptimal, but rarely occurs in
practice), the way that gas does it to this day. Of course, you have
to write code to recognize when a span-dependent instruction has a
non-simple expression or is pathological, which you avoid with the
expanding approach.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Jul 21 10:50:20 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Yes. Compacting is better as you start with working (functionally correct) >> but possibly oversize code and tries to make it smaller working code.
Compacting can stop at any point as it is always dealing with working code. >>
Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those expansions, >> and so on. Expansion can't stop until all broken items are fixed.

In theory these both deal with the same number of items and should
produce the same optimal result.

Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
have to make sure that you never compact, whereas with "compacting"
you could compact some things that look compactable, but find that the
result is no longer correct, because an earlier-compacted think needs
to expand.

Thanks for the reference, I'll have a look.
I wasn't thinking of those shrink-then-reexpand or expand-then-reshrink approaches as they are obviously potentially meta-stable.
Those are also the ones I thought smelled factorial O(N!) cost.

I'm comparing compact-only and expand-only approaches.

The compacting approach I was thinking of, which I described a while back,
is sweep over all items, start large, calculate lower and upper possible address range, shrink a reference when you know it will always work,
iterate until it stops changing, freeze at those sizes.
The worst case is that each sweep shrinks one item
so requires N sweeps of N items or O(N^2) cost.
But it can stop at any point.

The expand approach is similar but starts small and expands when a reference
is broken (out of range). Its worst case is when fixing a reference breaks
one more others (but crucially this can only happen at most once)
so also requires N sweeps for N items or O(N^2) cost.
But it can't stop iterating until all references are fixed.

In reality, both would likely terminate after 3 or 4 sweeps.

But let's rule out the shrinking-this-causes-growth-elsewhere cases,
then the "compacting" approach can be caughtin a steady state where it
sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
produce a smaller result than the "compacting" approach.

Yes, the compact approach I described will miss co-dependent shrinks
where either both shrink or neither shrink, where expand will catch those. Those should only occur when there are two or more references that cross
ranges that both happen to be right of the packing boundary,
which should be infrequent and is harmless for compacting as it only
means the occasional reference will be too big but still working.

The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

And what's the problem with that?

Given a big job, your expanding linker goes away and never comes back.
You can't produce a working product, your company goes bankrupt,
you die a penniless pauper in a homeless shelter.

My compacting linker always produces a product, which I ship to customers,
rake in the big bucks, and live happily ever after.

Read more about "Assembling Span-Dependent Instructions", and
misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

- anton

Right, but I was never even considering those metastable approaches.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 17:01:08 2025

Anton Ertl wrote:

John Levine <johnl@taugh.com> writes:

That paper involves compacting A->B->C branch chains which is NP-complete.

If you want to say that the paper tries to transform such a chain into

C, then no, that's not the case.

And actually in the case:

A: jbr B
...
B: jbr C
...
C:

there are only /simple expressions/ and the jbrs are non-pathological
in terms of the paper, and the problem of minimizing the size of a
program with only simple expressions and non-pathological
span-dependent instructions is solvable in polynomial time (the paper
gives an algorithm for doing that in section 3).

It's been about 40 years since I wrote an assembler that did compacting for the
ROMP, but it started with all A->B branches long, and made passes over the code
compacting what it could until it didn't find any more. It didn't try to handle
branch chains, so compacting never made anything out of range.

It probably did not deal with nonsimple or nor with pathological span-dependent instructions, or it recognized them and always used the
long form for them (theoretically suboptimal, but rarely occurs in
practice), the way that gas does it to this day. Of course, you have
to write code to recognize when a span-dependent instruction has a
non-simple expression or is pathological, which you avoid with the
expanding approach.

This is a recurring subject, personally I've found that the algorithm
used by Ivan G for Mill is "Good Enough" (TM).

For every branch or RIP-dependent load instruction with multiple
possible encodings, create both/all versions and determine how long each
will be.

For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest
form, use that and lock it down (remove from variable list), if min >=
shortest long span, use that and also remove it from the list.

After a very short number of passes, most code will have settled at or
very near the theoretical optimum.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jul 21 15:26:55 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >form, use that and lock it down (remove from variable list), if min >= >shortest long span, use that and also remove it from the list.

After a very short number of passes, most code will have settled at or
very near the theoretical optimum.

What do you do with those that stay in the variable list?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Mon Jul 21 15:28:41 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

The compacting approach I was thinking of, which I described a while back,
is sweep over all items, start large, calculate lower and upper possible >address range, shrink a reference when you know it will always work,
iterate until it stops changing, freeze at those sizes.

Taking the example from <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>:

foo:
movl foo+133-bar(%rdi),%eax
bar:

what does your approach do? What does "lower and upper possible
address range" mean? How do you know it will always work?

The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

And what's the problem with that?

Given a big job, your expanding linker goes away and never comes back.

As somone wrote in this thread:

In reality, both would likely terminate after 3 or 4 sweeps.

Plus the expanding approach does not need to "calculate lower and
upper possible address range", nor determin whether "it will always
work". If the operand needs to much space, expand (and remember to do
another sweep); that's all.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 18:06:33 2025

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest
form, use that and lock it down (remove from variable list), if min >=
shortest long span, use that and also remove it from the list.

After a very short number of passes, most code will have settled at or
very near the theoretical optimum.

What do you do with those that stay in the variable list?

Still don't know if it can use short or long form.

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jul 21 17:26:28 2025

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >>> form, use that and lock it down (remove from variable list), if min >=
shortest long span, use that and also remove it from the list.

After a very short number of passes, most code will have settled at or
very near the theoretical optimum.

What do you do with those that stay in the variable list?

Still don't know if it can use short or long form.

So what happens if some are never removed from the variable list?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 19:31:42 2025

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >>>> form, use that and lock it down (remove from variable list), if min >= >>>> shortest long span, use that and also remove it from the list.

After a very short number of passes, most code will have settled at or >>>> very near the theoretical optimum.

What do you do with those that stay in the variable list?

Still don't know if it can use short or long form.

So what happens if some are never removed from the variable list?

Just use the long form since the code will always work that way.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Jul 21 17:08:49 2025

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

The compacting approach I was thinking of, which I described a while back, >> is sweep over all items, start large, calculate lower and upper possible
address range, shrink a reference when you know it will always work,
iterate until it stops changing, freeze at those sizes.

Taking the example from <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>:

foo:
movl foo+133-bar(%rdi),%eax
bar:

what does your approach do? What does "lower and upper possible
address range" mean? How do you know it will always work?

This is the method Terje is describing, that Ivan uses.

Assuming that there are only two offset sizes...

Each item has two potential assigned addresses.
The lower address is the sum of all prior smaller object sizes.
The upper address is the sum of all prior larger object sizes.

(remembering that forward and backward offsets have different ranges),
if the largest offset difference between a reference and its target
fits into the small offset size, or the smallest offset difference
only fits into the largest offset size, then mark the item resolved
and fix it at that size.

The above compacting also works with alignment directives.
Alignments have as much or more effect on the results as the variable
sized offsets. Alignments behave like variable size objects
from 0..(A-1) bytes, whose size depends on what address it starts on.
These also have two sizes, for the lower and upper address.
Alignments can change size each time new addresses are assigned.

The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.

And what's the problem with that?

Given a big job, your expanding linker goes away and never comes back.

As somone wrote in this thread:

In reality, both would likely terminate after 3 or 4 sweeps.

For non-pathological cases.

Plus the expanding approach does not need to "calculate lower and
upper possible address range",

two subtracts

nor determin whether "it will always work".

two compares

If the operand needs to much space, expand (and remember to do
another sweep); that's all.

- anton

I have no control over whether a pathological case can occur.
But I can see that they are possible.

If doing a link with a large number of items, my preference would be
to have a link optimizer method that can be terminated after my specified number of iterations and still produce a working exe.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	02:46:00
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,755

Re: Code density

Who's Online

Recent Visitors

System Info