Here are the text size numbers:
Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:---------
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
1030288 150686 79852 31492 mvme68k
779393 155764 75795 31813 vax
1302254 171505 83249 35085 amd64
1229032 178332 89180 36876 evbarm-aarch64
1539052 179055 82280 34717 amd64-daily
1374961 184458 96971 37218 i386
1247476 185792 96728 42028 evbarm-earmv7hf
1333952 187452 96328 39472 sparc
1586608 204032 106896 45408 evbppc
1536144 204320 106768 43232 hppa
1397024 216832 109792 48512 sparc64
1538536 222336 107776 44912 evbmips-mips64eb
1623952 243008 122096 50640 evbmips-mipseb
1689920 251376 120672 51168 alpha
2324752 2259984 1378000 ia64^
- anton
On Tue, 17 Jun 2025 14:17:42 +0000, Anton Ertl wrote:
Here are the text size numbers:
Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:
bash grep gzip
595204 107636 46744 armhf
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
Can you get numbers for RISC-V without compression ?? for the above
and for the below.
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
1030288 150686 79852 31492 mvme68k
779393 155764 75795 31813 vax
1302254 171505 83249 35085 amd64---------
1229032 178332 89180 36876 evbarm-aarch64
1539052 179055 82280 34717 amd64-daily
1374961 184458 96971 37218 i386
1247476 185792 96728 42028 evbarm-earmv7hf
1333952 187452 96328 39472 sparc
1586608 204032 106896 45408 evbppc
1536144 204320 106768 43232 hppa
1397024 216832 109792 48512 sparc64
1538536 222336 107776 44912 evbmips-mips64eb
1623952 243008 122096 50640 evbmips-mipseb
1689920 251376 120672 51168 alpha
This appears to be the region of standard RISC architectures
about 1.5× VAX
---------
2324752 2259984 1378000 ia64^
Is there 1 too many zeros on the last entry ??
- anton
I would be interested in what happens to the code size if "-mcmodel=large"
is used
Can you get numbers for RISC-V without compression ?? for the above
and for the below.
bash grep gzip
595204 107636 46744 armhf ARM T32
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel ARM A32
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
2324752 2259984 1378000 ia64^
Is there 1 too many zeros on the last entry ??
EricP <ThatWouldBeTelling@thevillage.com> writes:
I would be interested in what happens to the code size if "-mcmodel=large" >> is used
No code is generated by gcc-10.3.1. Instead, I get an error message
gcc: error: unrecognized argument in option ‘-mcmodel=large’
gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow
I'll show numbers for medany in a different posting.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
I would be interested in what happens to the code size if "-mcmodel=large" >> is used
No code is generated by gcc-10.3.1. Instead, I get an error message
gcc: error: unrecognized argument in option ‘-mcmodel=large’ gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow
I'll show numbers for medany in a different posting.
- anton
Ok, thanks.
So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.
I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
which would blow up their code size and tank their performance.
And that's not a good look for them.
Documentation does say that Aarch64 supports it (note =small is default):
https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options
AArch64 Options
-mcmodel=tiny
Generate code for the tiny code model. The program and its statically
defined symbols must be within 1MB of each other. Programs can be
statically or dynamically linked.
-mcmodel=small
Generate code for the small code model. The program and its statically
defined symbols must be within 4GB of each other. Programs can be
statically or dynamically linked.
This is the default code model.
-mcmodel=large
Generate code for the large code model. This makes no assumptions about
addresses and sizes of sections. Programs can be statically linked only.
For x86-64 -mcmodel=large is also supported, =medium is the default.
https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options
AArch64 Options
-mcmodel=tiny
Generate code for the tiny code model. The program and its statically
defined symbols must be within 1MB of each other. Programs can be
statically or dynamically linked.
-mcmodel=small
Generate code for the small code model. The program and its
statically
defined symbols must be within 4GB of each other. Programs can be
statically or dynamically linked.
This is the default code model.
-mcmodel=large
Generate code for the large code model. This makes no assumptions
about
addresses and sizes of sections. Programs can be statically linked
only.
For x86-64 -mcmodel=large is also supported, =medium is the default.
So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.
I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
which would blow up their code size and tank their performance.
And that's not a good look for them.
EricP <ThatWouldBeTelling@thevillage.com> writes:
So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
a 32-bit program space inside a 64-bit address space.
And programs can be statically linked only.
medany is 2GB for "a program and its statically defined symbols", and
these 2GB can be anywhere in address space. Dynamically linked
symbols can be further away AFAICT. So unless you do binaries that
are larger than 2GB, this does not appear to be a restriction.
medlow means that the binary most reside in the lower 2GB of the
address space (actually between -2GB and 2GB, but at least on 64-bit
Linux user programs cannot reside at negative addresses). Again,
dynamically linked symbols can be further away. But if both the
executable and the shared libraries are compiled for the medlow model,
they must all fit in the lower 2GB. Probably not a big problem,
either, except maybe for the largest C++ projects.
I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.
Interestingly, when I look at the "DEC Alpha" options, there is
-msmall-data (64KB global tables are enough) and -mlarge-data (data
segment <2GB). There is also -msmall-code (code <4MB) and
-mlarge-code. The gcc manual says about these:
| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.
I think that these four are:
ldq $27, ...($gp) #load target address
jsr $26, ($27) #call target
ldq $gp, offset($26) #restore gp
and at the target:
target:
ldq $gp, offset($27) #load gp
whereas in a small/small variant it would just be
bsr $26, target
which would blow up their code size and tank their performance.
And that's not a good look for them.
Why burden all programs with the costs of large programs the way it
is done by default on Alpha?
- anton
Anton Ertl wrote:---------------------
EricP <ThatWouldBeTelling@thevillage.com> writes:
I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.
For example, for Alpha to load a 64-bit constant requires 6
instructions,
24 bytes.
That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.
And after the constant is loaded it must be manually added to the base because there is no LD/ST combined with a scaled index.
Furthermore, the actual load or store of the target value is serially dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4
clocks it is serious penalty.
By making it a priority for relatively cheap access to the full 64-bit address space during an ISA design, what alternatives might have
minimized its extra cost?
First, I have two designs which load a 64-bit constant in 3 32-bit fixed length instructions, the prefix CONST approach, and another using 3
opcodes
and requires a temp register. In both cases the constants can easily be
fused in Decode and have zero execute cost. My preference is for the
prefix CONST as it can be used with many other instructions besides
LD and ST and doesn't require an extra temp register.
Second, if the base register of LD, ST, or LDA (Load Address) is R31,
the zero register, then it means use the PC as base, and the extra BAL
is almost always unnecessary.
Third, recognize that whether the const offset is loaded by instructions
or from a constant table by an extra LDQ, it will be adding that offset
to the base a lot so have LD, ST, LDA with a scaled-index address mode
and eliminate the extra ADD. The prefix-CONST approach doesn't require
this because it fuses the immediate directly onto its consumer in
Decode. But have the scaled index address mode anyway.
Fourth, have a compacting linker so the programmer doesn't need to
specify
a code model. The compiler emits a worst-case sequence and the linker
gets rid of all the ones it doesn't need.
So there are three alternatives to accessing full 64-bit addresses.
The CONST prefix requires 3 instructions to access the target data
with no temp register and zero execute cost if fused in Decode.
CONST value
CONST value
LDx rDst, [r31+offset]
The separate const instructions requires 4 instructions
and requires a temp register and scaled index address mode,
but no execute cost if fused in Decode
CONST1 rTemp=value
CONST2 rTemp=value
CONST2 rTemp=value
LDx rDst, [r31+rTemp<<0]
The load from constant table requires 2 instructions,
requires a temp register and scaled index address mode,
eliminates the BAL and ADD, but takes an extra data memory access.
LDQ rTemp, [r31+offset]
LDx rDst, [r31+rTemp<<0]
Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.
Consider::
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or
c or d; and has to generate::
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.
Consider::
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or
c or d; and has to generate::
Why does the compiler need to assume anything?
It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.
The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).
On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
Irrespective of which way ones chooses, the compacting linker gets
rid of any unneeded excess.
Consider::
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or
c or d; and has to generate::
Why does the compiler need to assume anything?
It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.
The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).
The extern's are in a dynamically loaded module, and the .text >section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:The extern's are in a dynamically loaded module, and the .text
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:Why does the compiler need to assume anything?
Irrespective of which way ones chooses, the compacting linker getsConsider::
rid of any unneeded excess.
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or
c or d; and has to generate::
It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.
The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.
The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).
Here is an illustrative example:
This C++ code invokes the 'dlsym' function, which is hosted
in a dynamically linked shared object (libdl.so):
sym = (get_dlp_t)dlsym(handle, "get_dlp");
if (sym == NULL) {
lp->log("Invalid DLP shared object format: %s\n", dlerror());
unregister_handle(channel);
dlclose(handle);
return 1;
}
===========================================
g++ generates this assembler code:
...
movq %rax, 784(%r13,%rbx,8)
..L14:
..LBE39:
..LBE38:
.loc 2 118 0
movl $.LC7, %esi
movq %r14, %rdi
call dlsym
..LVL19:
.loc 2 119 0
testq %rax, %rax
je .L24
.....
===========================================
The linker (ld command) generated the following trampoline and
corresponding trampoline invocation.
000000000040ccc0 <dlsym@plt>:
40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
40ccc6: 68 58 00 00 00 pushq $0x58
40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
....
412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
41268c: 48 83 fb 63 cmp $0x63,%rbx
412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
412699: 00
41269a: be 0d 66 43 00 mov $0x43660d,%esi
41269f: 4c 89 f7 mov %r14,%rdi
4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
4126a7: 48 85 c0 test %rax,%rax
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:The extern's are in a dynamically loaded module, and the .text
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:Why does the compiler need to assume anything?
Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.Consider::
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::
It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.
The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.
The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).
Here is an illustrative example:
This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.
How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, verses a direct PC-rel memref for extern a, b, c.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:The extern's are in a dynamically loaded module, and the .text
On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:Why does the compiler need to assume anything?
Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.Consider::
extern int64_t a,b,c,d;
and some code:
{ d = a+b+c; }
The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::
It simply issues
Load Register Ra from "a symbol table reference for a"
Load register Rb from "a symbol table reference for b"
etc.
The linker determines that the address refers to a
symbol in a shared object (or a different object file
include in the link) and generates the appropriate
code (GOT reference, PC-relative or absolute as
necessary).
section/segment is PIC. So, ld.so is not allowed to write over
the current displacement.
The static linker does the code transformation, not the
run-time dynamic linker (which just updates the PLT/GOT).
Here is an illustrative example:
This C++ code invokes the 'dlsym' function, which is hosted
in a dynamically linked shared object (libdl.so):
sym = (get_dlp_t)dlsym(handle, "get_dlp");
if (sym == NULL) {
lp->log("Invalid DLP shared object format: %s\n", dlerror());
unregister_handle(channel);
dlclose(handle);
return 1;
}
===========================================
g++ generates this assembler code:
...
movq %rax, 784(%r13,%rbx,8)
..L14:
..LBE39:
..LBE38:
.loc 2 118 0
movl $.LC7, %esi
movq %r14, %rdi
call dlsym
..LVL19:
.loc 2 119 0
testq %rax, %rax
je .L24
.....
===========================================
The linker (ld command) generated the following trampoline and
corresponding trampoline invocation.
000000000040ccc0 <dlsym@plt>:
40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
40ccc6: 68 58 00 00 00 pushq $0x58
40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
....
412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
41268c: 48 83 fb 63 cmp $0x63,%rbx
412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
412699: 00
41269a: be 0d 66 43 00 mov $0x43660d,%esi
41269f: 4c 89 f7 mov %r14,%rdi
4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
4126a7: 48 85 c0 test %rax,%rax
This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.
How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.
How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value,
verses a direct PC-rel memref for extern a, b, c.
The compiler doesn't care. Absent threads, it generates a simple
reference to the errno symbol and lets the linker handle resolving
it.
In the thread case:
/usr/include/bits/errno.h:
extern int *__errno_location (void) __THROW __attribute__ ((__const__));
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */
Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Scott Lurndal wrote:
This illustrates the PLT jump table usage but Mitch's question was
I think with regard to variables exported by a shared library,
say errno exported by the CRTLIB C runtime library,
verses extern variables in the main link module.
How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >>> verses a direct PC-rel memref for extern a, b, c.
The compiler doesn't care. Absent threads, it generates a simple
reference to the errno symbol and lets the linker handle resolving
it.
In the thread case:
/usr/include/bits/errno.h:
extern int *__errno_location (void) __THROW __attribute__ ((__const__));
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */
That replaces a memory reference with a function call.
Compiled on godbolt with GCC x86-64 trunk -O3
#include "errno.h"
long GetErrno (void)
{ return errno;
}
"GetErrno()":
sub rsp, 8
call "__errno_location"
movsx rax, DWORD PTR [rax]
add rsp, 8
ret
What I am asking about is below.
Here are two variables, one is in a DLL export and therefore an
inter-module reference that requires an extra MOV to load the address
from the import table (what Linux calls the GOT), and the other is a
a regular intra-module variable directly accessed with one PC-rel MOV.
How does GCC import a variable exported from a shared module?
extern int *__errno_location (void) __THROW __attribute__ ((__const__));
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */
That replaces a memory reference with a function call.
This is the one example where I considered adding an indirect
addressing mode enabled by 1 bit on all LD and ST instructions
as it eliminates the need to emit different code sequences for
intra-module and inter-module accesses.
The compiler always emits a PC-rel address. Later the linker discovers
the it is a reference to a DLL export variable and sets the offset to be
to the address in the GOT table and sets the Indirect bit on the LD/ST.
An Indirect address mode has side effects for the Load Store Queue but
they are mostly the same as if there were two separate instructions.
On 6/27/2025 5:33 AM, EricP wrote:
snip
This is the one example where I considered adding an indirect
addressing mode enabled by 1 bit on all LD and ST instructions
as it eliminates the need to emit different code sequences for
intra-module and inter-module accesses.
The compiler always emits a PC-rel address. Later the linker discovers
the it is a reference to a DLL export variable and sets the offset to be
to the address in the GOT table and sets the Indirect bit on the LD/ST.
An Indirect address mode has side effects for the Load Store Queue but
they are mostly the same as if there were two separate instructions.
But you are now allowing two cache, tlb or even page misses within one instruction, which complicates things considerably.
Clearly the linker has the freedom to recognize "__errno_location()"
and alter things as necessary. In this case, is implemented as
a real function call. I can envision an implementation that
replaced the function call with a reference to a thread-specific variable when
compiled and linked with the proper options (e.g. when linked with -lpthread).
How does the compiler know that it needs to go through the GOT table,
to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.
For example, in Microsoft one can mark the DLL export variable with
__declspec(dllexport) int errno;
Thomas Koenig <tkoenig@netcologne.de> writes:
But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Thomas Koenig <tkoenig@netcologne.de> writes:
But you are now allowing two cache, tlb or even page misses within one
instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
Supporting unaligned data accesses the requirement of up to two
cache-line accesses (and, consequently, up to two TLB or cache misses)
for data accesses (including stores). Power has this support, as has
every other modern general-purpose architecture.
Of course, if you added one level of indirection, that would double
the number of potential memory accesses on the data side.
- anton
On 2025-06-27, Scott Lurndal <scott@slp53.sl.home> wrote:
Clearly the linker has the freedom to recognize "__errno_location()"
and alter things as necessary. In this case, is implemented as
a real function call. I can envision an implementation that
replaced the function call with a reference to a thread-specific variable when
compiled and linked with the proper options (e.g. when linked with
-lpthread).
That is not what linkers are supposed to do (unless you use
link-time optimization, which is a bit of the misnomer).
On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
But you are now allowing two cache, tlb or even page misses within one >>>> instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
But you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Compressed instructions impose the requirement of up to two cache-line
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.
That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.
And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.
And bugs in there will occur only rarely, so they will be difficult
to find and debug.
This is indeed possible (almost anything except skiing through a
revolving door), but IMHO this is something to avoid.
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:One can allow line crossing compressed (or extended) instructions
Compressed instructions impose the requirement of up to two cache-lineBut you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.So does RISC-V with compressed instructions. POWER doesn't.
accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.
That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.
And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.
And bugs in there will occur only rarely, so they will be difficult
to find and debug.
According to EricP <ThatWouldBeTelling@thevillage.com>:
extern int *__errno_location (void) __THROW __attribute__ ((__const__)); >>>That replaces a memory reference with a function call.
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */
Not really. On all of the Unix-like systems I know, errno is a macro
wrapped around a function call that fetches the most recent error in
the current thread, done that way to avoid breaking old programs
written back before threads when errno was an extern int. It's a
peculiar special case and I don't offhand know of anything else like
that.
Here's the ABI manual for amd64 systems:
http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:One can allow line crossing compressed (or extended) instructions
Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) forBut you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.So does RISC-V with compressed instructions. POWER doesn't.
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.
That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.
And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.
And bugs in there will occur only rarely, so they will be difficult
to find and debug.
Compilers, assemblers and linkers have long supported alignment directives.
Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:Skiing through a revolving door is in fact possible, as long as you are
On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:
Thomas Koenig <tkoenig@netcologne.de> writes:
But you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.
So does RISC-V with compressed instructions. POWER doesn't.
Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) for
instruction fetch. Instruction sets with fixed-size instructions
indeed do not have this requirement.
One can allow line crossing compressed (or extended) instructions
while still disallowing page crossing of the same. You just have
to decide what is right for your architecture.
That is difficult to implement in assemblers and linkers.
Unless somebody wants to align each function, or at least each
translation unit, on a page boundary, the linker then would have
to insert NOPs for those rare cases where, after linking, a page
boundary is crossed.
And once you have put in the nop, you need to recheck all branches
if they are still in range, and you have to do a full relocation
on your code, including debug info and everything else.
And bugs in there will occur only rarely, so they will be difficult
to find and debug.
This is indeed possible (almost anything except skiing through a
revolving door), but IMHO this is something to avoid.
using xc gear, since there the binding allow you to fold the skis up
along your body. You just need the balance to be able to use the tail
ends of your skis as stilts.
Back in uni days (as members of the uni scouts group) we tried all sorts
of funny stuff, including running a very hard obstacle course with xc
skis on.
I do agree that for most people/skiing gear/revolving doors, the
combination is effectively impossible.
Terje
Also it makes multiple references to x64 instructions that don't exist in >Intel or AMD docs, LEAQ and MOVABS.
It does not define them, does not show
any instruction bytes, and does not reference any other docs that do so.
I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.
John Levine wrote:
According to EricP <ThatWouldBeTelling@thevillage.com>:
extern int *__errno_location (void) __THROW __attribute__That replaces a memory reference with a function call.
((__const__));
# if !defined _LIBC || defined _LIBC_REENTRANT
/* When using threads, errno is a per-thread value. */
# define errno (*__errno_location ())
# endif
# endif /* !__ASSEMBLER__ */
#endif /* _ERRNO_H */
Not really. On all of the Unix-like systems I know, errno is a macro wrapped around a function call that fetches the most recent error in
the current thread, done that way to avoid breaking old programs
written back before threads when errno was an extern int. It's a
peculiar special case and I don't offhand know of anything else like
that.
I was just looking for a shared module export variable but errno was
a poor choice because everyone has replaced it with a function call.
Scott suggested signgam in math.h but that doesn't exist in
Microsoft's math.h because MS is stuck on C-89 so not useful for
comparing MS and GCC.
Here's the ABI manual for amd64 systems:
http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf
Thanks but I've got a copy v1.0 dated 6-Dec-2022.
That one is a draft v0.95 from 2005.
The issue I have with it is that while it does provide an overview of
the address models, it does not describe how the whole mechanism
works, how the compiler interacts with linker and loader.
Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS. It does not define them,
does not show any instruction bytes, and does not reference any other
docs that do so.
I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
I was unable to find a PDF manual for x86-64, or x64, or AMD64
anywhere. There is a web document at GNU but it has no search function
and does not explain leaq or movabs.
After a few hours of flopping about searching the web I think I have
figured out how it works, specifically how the movabs works with the
loader, and why the MS code for Windows is different from the GCC
code for Linux, and how the MS compiler uses dllimport attribute and
GCC does not.
First about LEAQ and MOVABS.
It seems LEAQ is the LEA Load Effective Address instruction with a
data type attached to it so it knows the operand size and thus the
address mode. This replaces the Intel B/W/D/QWORD PTR nomenclature.
MOVABS is more complicated and actually has two versions,
MOVABS and MOVABSx (where x is a data type b, w, d, or q).
MOVABS (no type) is really Intel "MOV r64, imm64" which loads a 64-bit immediate into a register. MOVABS has nothing to do with absolute
addresses except if the imm64 happens to be a relocatable symbol
value then it can be patched by the loader, as with all such
immediate symbols.
MOVABSx (with type) is really Intel "MOV moffs, rAn" or "MOV rAn,
moffs" where moffs is an 8, 16, 32 or 64-bit offset into a segment
register, and for the default segment registers with a base of 0 that
means the offset is really either a zero extended 32-bit or 64-bit
absolute address, and rAn is registers AL, AX, EAX, RAX (depends on
operand size). MOVABSx is a relocatable absolute address that loads
or stores to/from an "A" register.
continuing...
The answer to my original question seems to be that MS always
generates what GCC calls Position Independent Executable enabled with
the -fPIE option, and MS always uses a large memory model whereas GCC
must enable it.
But also as GCC doesn't know if exeVar is an intra- or inter- module reference so it always has to generate a worst case access for every
program global variable. Because MS knows which global variables are
are dllimports it generates optimal code for intra- (RIP-rel) and
inter- (GOT indirect) module references.
extern long exeVar;
long GetExeVar (void)
{ return exeVar;
}
Compiled with GCC x86-64 15.1 -O3 -fPIE -mcmodel=large
GetExeVar():
.L2:
movabs r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2
lea rax, .L2[rip]
movabs rdx, OFFSET FLAT:exeVar@GOT
add rax, r11
mov rax, QWORD PTR [rax+rdx]
mov rax, QWORD PTR [rax]
ret
(The above GCC code also doesn't look optimal. I don't see why it
fiddles about calculating addresses when it should just should just
use a RIP-rel load to pull the absolute address of exeVar from the
GOT and then load it, as MS does with its imports table below.)
Compiled with MSVC latest -O3
Intra-module reference:
long GetExeVar(void) PROC ; GetExeVar,
COMDAT mov eax, DWORD PTR long exeVar ; exeVar
ret 0
Inter-module reference:
__declspec(dllimport) long dllVar;
long GetDllVar (void)
{ return dllVar;
}
long GetDllVar(void) PROC ; GetDllVar,
COMDAT mov rax, QWORD PTR __imp_long dllVar
mov eax, DWORD PTR [rax]
ret 0
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't exist in >>> Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't exist in
Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
MOVABS imm64, r64 is AT&T syntax for Intel syntax MOV r64, imm64.
It does not define them, does not show
any instruction bytes, and does not reference any other docs that do so.
The gas manual may be the best reference about AT&T syntax.
I tried looking for a manual on GNU Assembler GAS and the only one is
a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
Searching for "GNU as manual" gave me the link to <https://sourceware.org/binutils/docs/> where you can find links to
the gas manual in different formats. And searching for "AT&T" in the
table of contents brings up three subsections of <https://sourceware.org/binutils/docs/as/i386_002dDependent.html>
I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.
The section mentioned above says:
|The i386 version as supports both the original Intel 386 architecture
|in both 16 and 32-bit mode as well as AMD x86-64 architecture
|extending the Intel architecture to 64-bits.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't exist
in
Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?
Because it is not documented.
Anton Ertl wrote:[Alpha]
EricP <ThatWouldBeTelling@thevillage.com> writes:
I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.
Yes, it is (also) using an extra memory load to pick up large immediates.
It also requires a BAL to get the IP into a register.
The gcc manual says about these:
| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.
...Why burden all programs with the costs of large programs the way it
is done by default on Alpha?
I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.
For example, for Alpha to load a 64-bit constant requires 6 instructions,
24 bytes.
That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.
And after the constant is loaded it must be manually added to the base >because there is no LD/ST combined with a scaled index.
Furthermore, the actual load or store of the target value is serially >dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >it is serious penalty.
By making it a priority for relatively cheap access to the full 64-bit >address space during an ISA design, what alternatives might have minimized >its extra cost?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.
Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.
In terms of pure server numbers, windows is likely less than
20% globally.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.
Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide
most of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.
In terms of pure server numbers, windows is likely less than
20% globally.
It seems, you got it backward.
Got what backwards?
The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.
I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4). I've never considered it poorly documented.
It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.
Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.
On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't
exist in Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.
Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.
In terms of pure server numbers, windows is likely less than
20% globally.
It seems, you got it backward.
The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.
It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
Desktops, yes. That's changing slowly, although now with
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.
Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide
most of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.
In terms of pure server numbers, windows is likely less than
20% globally.
It seems, you got it backward.
Got what backwards?
The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.
I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4).
I've never considered it poorly documented.
It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.
Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.
On Tue, 01 Jul 2025 13:18:33 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that
don't exist in Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
And since this is the basis for the ABI design for a processor
that runs almost all the desktops and servers on the planet,
The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux.
I've been using x86-64 AT&T syntax since 1989
(e.g. SVR4).
No, you didn't, because x86-64 didn't exist until ~2001.
Your claim was that windows runs "almost all the servers on
the planet", which is clearly incorrect.
First, the claim was not mine.
Second, the claim was not about Windows, but about x86-64.
On Mon, 30 Jun 2025 17:13:51 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:Desktops, yes. That's changing slowly, although now with
EricP <ThatWouldBeTelling@thevillage.com> writes:And since this is the basis for the ABI design for a processor
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don'tLEAQ is AT&T syntax for LEA with a "quadword" operand.
exist in Intel or AMD docs, LEAQ and MOVABS.
that runs almost all the desktops and servers on the planet,
the orange clown in charge - foreign government entities are
moving away from software controlled by American companies,
and I expect the desktop migration rate from windows to linux to
increase considerably over the next decade.
Servers, not by a large margin. Linux (and a handful of
proprietary unix and linux servers, including Z-series) provide most
of the servers (excepting on-prem exchange, sharepoint
and AD systems). 60% of compute in Azure, for example,
are linux cores. The ratio is even larger in google (90% linux)
oracle and amazon(90% linux) cloud operations.
In terms of pure server numbers, windows is likely less than
20% globally.
It seems, you got it backward. The bigger problem with poorly
documented x86-64 AT&T syntax is on Linux. It's not that AT&T syntax is
not used at all on Windows, but it's less dominant here.
On 6/18/2025 1:26 AM, Anton Ertl wrote:...
You can, however compare ARM T32 and A32 in the Debian results:
bash grep gzip
595204 107636 46744 armhf ARM T32
599832 101102 46898 riscv64
796501 144926 57729 amd64
829776 134784 56868 arm64
853892 152068 61124 i386
891128 158544 68500 armel ARM A32
892688 168816 64664 s390x
1020720 170736 71088 mips64el
1168104 194900 83332 ppc64el
There may be additional differences between the two ARM 32-bit
builds, however.
What I could do relatively easily is to compile a file from gforth
with different options. The file I used is what is compiled to
engine/main-fast-ll.o
text siz compiler options
20242 -O2
18146 -Os
18146 -Os -march=rv64gc
18444 -Os -march=rv64gc -mcmodel=medany
23092 -Os -march=rv64g
Not super impressed with the 'C' extension, as it is both a pain to
decode and also the code size savings tend to be fairly modest.
IA-64 code density is bad, but one wouldn't expect it to be quite *that*
bad.
Maybe around 3-5x bigger than a RISC with 32-bit instructions.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:[Alpha]
EricP <ThatWouldBeTelling@thevillage.com> writes:
I suspect this is because almost every code or data address would
require a 6 instruction sequence to load it into a register for use,
How do you compute that? When I looked at the code produced for
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.
Yes, it is (also) using an extra memory load to pick up large immediates. >>It also requires a BAL to get the IP into a register.
I have finally gotten around to turning on our working Alpha and
compiled the following program on it:
The load of signgam is achieved with the following sequence
I.e., 2 instructions, not 6....
It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
which could be implemented more efficiently as
ldl t1,-32680(gp)
but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).
We also see that the call to foo() within the same linked unit is
linked as
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
whereas the call to lgamma() in a shared library is linked as
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
One can see again that both code sequences have the same size, to
avoid shrinking or expanding.
For example, for Alpha to load a 64-bit constant requires 6 instructions, >>24 bytes.
That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
By making it a priority for relatively cheap access to the full 64-bit >>address space during an ISA design, what alternatives might have minimized >>its extra cost?
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
- anton
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:---------------
Now with a large constant:
#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);
int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}
I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.
The result on Alpha is:
0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret
Here's the output for RISC-V with -mcmodel=medany:
0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028 <__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret
- anton
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.
That's not necessarily the case - so long as the function signatures/API don't
change, new versions of the shared library will be backward compatable
with
applications linked against earlier versions. So the data section requirements
for the shared library could change after the application is linked if
new
data section symbols are defined in the newer version of the shared
library.
The run-time loader will know how much data space has been allocated
to the executable itself, and will append (and relocate corresponding references in the shared object and executable) the '.data' sections
from
each dynamic library loaded by the application at the time the
library is loaded - which may be at startup or via dlopen().
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:---------------
Now with a large constant:------------------
#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);
int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}
I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.
The result on Alpha is:
0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret
Here's the output for RISC-V with -mcmodel=medany:--------------------
0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028
<__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040
<y>
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
<x>
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret
My 66000::
main: ; @main
enter r0,r0,0,0
std #5841413448022620622,[ip,c]
call foo
ldd r1,[ip,y]
call lgamma
std r1,[ip,x]
call __signgam
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:---------------
Now with a large constant:------------------
#include <math.h>
extern int a, b;
extern long c;
extern double x, y;
extern void foo(void);
int main()
{
c = 0x5110de94f393f9ceL;
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}
I have compiled this and the file containing a, b, c, x, y, and foo()
with gcc -Wall -O -FPIC and then linked the two files with the same
options.
The result on Alpha is:
0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 58 85 bd 23 lda gp,-31400(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: fe ff 3d 24 ldah t0,-2(gp)
120000634: f8 7c 41 a4 ldq t1,31992(t0)
120000638: 78 80 3d 20 lda t0,-32648(gp)
12000063c: 00 00 41 b4 stq t1,0(t0)
120000640: 00 00 fe 2f unop
120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
120000648: 00 00 fe 2f unop
12000064c: 00 00 fe 2f unop
120000650: 68 80 3d 20 lda t0,-32664(gp)
120000654: 00 00 01 8e ldt $f16,0(t0)
120000658: 08 80 7d a7 ldq t12,-32760(gp)
12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
120000660: 02 00 ba 27 ldah gp,2(ra)
120000664: 18 85 bd 23 lda gp,-31464(gp)
120000668: 60 80 3d 20 lda t0,-32672(gp)
12000066c: 00 00 01 9c stt $f0,0(t0)
120000670: 58 80 7d 20 lda t2,-32680(gp)
120000674: 00 00 43 a0 ldl t1,0(t2)
120000678: 10 80 3d a4 ldq t0,-32752(gp)
12000067c: 00 00 21 a0 ldl t0,0(t0)
120000680: 02 04 41 40 addq t1,t0,t1
120000684: 5c 80 3d 20 lda t0,-32676(gp)
120000688: 00 00 21 a0 ldl t0,0(t0)
12000068c: 02 04 41 40 addq t1,t0,t1
120000690: 00 00 43 b0 stl t1,0(t2)
120000694: 00 04 ff 47 clr v0
120000698: 00 00 5e a7 ldq ra,0(sp)
12000069c: 10 00 de 23 lda sp,16(sp)
1200006a0: 01 80 fa 6b ret
Here's the output for RISC-V with -mcmodel=medany:--------------------
0000000000010570 <main>:
10570: 1141 addi sp,sp,-16
10572: e406 sd ra,8(sp)
10574: 00002797 auipc a5,0x2
10578: ab47b783 ld a5,-1356(a5) # 12028
<__SDATA_BEGIN__>
1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c> >>> 10580: 02c000ef jal ra,105ac <foo>
10584: 8401b507 fld fa0,-1984(gp) # 12040
<y>
10588: f29ff0ef jal ra,104b0 <lgamma@plt>
1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
<x>
10590: 85418713 addi a4,gp,-1964 # 12054 <a>
10594: 431c lw a5,0(a4)
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
1059a: 9fb5 addw a5,a5,a3
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>> 105a0: 9fb5 addw a5,a5,a3
105a2: c31c sw a5,0(a4)
105a4: 4501 li a0,0
105a6: 60a2 ld ra,8(sp)
105a8: 0141 addi sp,sp,16
105aa: 8082 ret
My 66000::
main: ; @main
enter r0,r0,0,0
std #5841413448022620622,[ip,c]
call foo
ldd r1,[ip,y]
call lgamma
std r1,[ip,x]
call __signgam
__signgam is an "int" variable in the shared library, not a function.
What is the purpose of 'call' here?
IBM's source code for signgam is:
#define _XOPEN_SOURCE
#include <math.h>
int *__signgam(void);
#define signgam (*__signgam())
Which is what we fed into the compiler.
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't exist >>>> in
Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?
LEA need to distinguish between::
LEA Rd,[Rb+DISP16]
LEA Rd,[Rb+Ri<<s]
LEA Rd,[Rb+DISP32]
LEA Rd,[Rb+DISP54]
LEA Rd,[Rb+Ri<<s+DISP32]
LEA Rd,[Rb+Ri<<s+DISP64]
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:[Alpha]
EricP <ThatWouldBeTelling@thevillage.com> writes:
Yes, it is (also) using an extra memory load to pick up large immediates.I suspect this is because almost every code or data address wouldHow do you compute that? When I looked at the code produced for
require a 6 instruction sequence to load it into a register for use,
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the
global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up
the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.
It also requires a BAL to get the IP into a register.
I have finally gotten around to turning on our working Alpha and
compiled the following program on it:
#include <math.h>
extern int a, b;
extern double x, y;
extern void foo(void);
int main()
{
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}
I have compiled this and the file containing a, b, x, y, and foo with
gcc -Wall -O -FPIC and then linked the two files with the same options.
The result on Alpha is:
0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 48 85 bd 23 lda gp,-31416(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
120000640: 68 80 3d 20 lda t0,-32664(gp)
120000644: 00 00 01 8e ldt $f16,0(t0)
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
120000658: 60 80 3d 20 lda t0,-32672(gp)
12000065c: 00 00 01 9c stt $f0,0(t0)
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
120000670: 02 04 41 40 addq t1,t0,t1
120000674: 5c 80 3d 20 lda t0,-32676(gp)
120000678: 00 00 21 a0 ldl t0,0(t0)
12000067c: 02 04 41 40 addq t1,t0,t1
120000680: 00 00 43 b0 stl t1,0(t2)
120000684: 00 04 ff 47 clr v0
120000688: 00 00 5e a7 ldq ra,0(sp)
12000068c: 10 00 de 23 lda sp,16(sp)
120000690: 01 80 fa 6b ret
The load of signgam is achieved with the following sequence
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
I.e., 2 instructions, not 6.
It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
which could be implemented more efficiently as
ldl t1,-32680(gp)
but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).
We also see that the call to foo() within the same linked unit is
linked as
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
whereas the call to lgamma() in a shared library is linked as
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
One can see again that both code sequences have the same size, to
avoid shrinking or expanding.
The gcc manual says about these:
| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.
We see the 4 and 1 instructions above, but it's not clear to me that
there is a real benefit. The compiler cannot assume that an external reference is local, and the linker knows, but does not benefit from
it. And for references within a compilation unit, I would hope that
the compiler/assembler manages to use the smallest variant based on
actual size.
....Why burden all programs with the costs of large programs the way it
is done by default on Alpha?
I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.
I can use 100 GB arrays with code that is the same size as code that
limits itself to the lower 2GB of address space (there is an option on
Alpha compilers and linkers for that).
For example, for Alpha to load a 64-bit constant requires 6 instructions,
24 bytes.
I forgot to add this to the program, maybe tomorrow.
That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
That's not necessary. The global pointer is derived from the function address (in t12) on entry to the function and from the return address
(in ra) after a jsr.
The LDQ touches the same address space as the code but now as data
so it has to load the D-TLB with an entry redundant with I-TBL,
and bring in a data cache line with the constants.
No, the global table is elsewhere; in the case above we it's about
96KB behind the start of main(). The text ends a few hundred bytes
later, so there is no page that contains both code and data (i.e., no
TLB entries that describe the same page, not that this would be a
problem).
And after the constant is loaded it must be manually added to the base
because there is no LD/ST combined with a scaled index.
Which base? You were only mentioning constants up to now.
Furthermore, the actual load or store of the target value is serially
dependent on LDQ offset and the ADD. Back when the load-to-use latency
for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >> it is serious penalty.
This century, you use an OoO CPU (even the last Alpha was OoO), and
the 4-5 clocks latency of a load are added to the ready time of the
base address, i.e., gp in this case. gp only changes on far calls, so loading from a gp-relative address is rarely in the critical
dependence path.
By making it a priority for relatively cheap access to the full 64-bit
address space during an ISA design, what alternatives might have minimized >> its extra cost?
It would certainly be an interesting experiment to see how much size
and speed difference we would get if we eliminated the "mov r64,
imm64" instruction when compiling to AMD64 and used Alpha-like
techniques instead. My guess is: barely measurable.
- anton
MitchAlsup1 wrote:
On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[System V ABI for AMD64]
Also it makes multiple references to x64 instructions that don't exist >>>>> in
Intel or AMD docs, LEAQ and MOVABS.
LEAQ is AT&T syntax for LEA with a "quadword" operand.
Yes, and an LEA instruction which just calculates an address
needs an operand size suffix... why?
LEA need to distinguish between::
LEA Rd,[Rb+DISP16]
LEA Rd,[Rb+Ri<<s]
LEA Rd,[Rb+DISP32]
LEA Rd,[Rb+DISP54]
LEA Rd,[Rb+Ri<<s+DISP32]
LEA Rd,[Rb+Ri<<s+DISP64]
In 64-bit mode there is no disp64, just disp8 and disp32,
There were no spare bits in the ModRM byte to indicate it.
Had there been disp64 then x64 could have had a smooth expansion
of address calculations into 64-bit space.
AMD worked around the ModRM limitation to provide at least some way to
access all of 64-bit address space. It did so by adding MOV opcodes,
to load an imm64, and to LD/ST memory using abs64 addresses.
And since those MOV's are different opcodes and not part of ModRM,
LEA does not know about those 64-bit imm64 or abs64 values and cannot
use them in general address calculations.
And all of the various compiler code models and their addressing
limitations follows from these discontinuities in addressing behavior.
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Anton Ertl wrote:[Alpha]
EricP <ThatWouldBeTelling@thevillage.com> writes:
Yes, it is (also) using an extra memory load to pick up largeI suspect this is because almost every code or data address wouldHow do you compute that? When I looked at the code produced for
require a 6 instruction sequence to load it into a register for use,
Alpha, I got the impression that they wanted to support arbitrarily
large programs and they generated such code by default, but IIRC the
typical code for loading an absolute addres was by loading it from the >>>> global table of the current function; so it requires typically 1 load
(and a 64-bit value in the global table). It also requires setting up >>>> the global pointer on every function entry and after every call, but
that can be amortized over several accesses to the global table.
immediates.
It also requires a BAL to get the IP into a register.
I have finally gotten around to turning on our working Alpha and
compiled the following program on it:
#include <math.h>
extern int a, b;
extern double x, y;
extern void foo(void);
int main()
{
foo();
x = lgamma(y);
a += signgam;
a += b;
return 0;
}
I have compiled this and the file containing a, b, x, y, and foo with
gcc -Wall -O -FPIC and then linked the two files with the same options.
The result on Alpha is:
GCC Alpha manual says that the limit with -mlarge-data (the default)
is 2 GB of data. Larger data must use mmap or malloc.
-mlarge-text (the default) is 4 MB code.
GCC has no ability to generate access to Alpha's full 64-bit address
space
so there is no comparison with other ISA's.
Perhaps using A64 or RV64 would be better examples.
0000000120000620 <main>:
120000620: 02 00 bb 27 ldah gp,2(t12)
120000624: 48 85 bd 23 lda gp,-31416(gp)
120000628: f0 ff de 23 lda sp,-16(sp)
12000062c: 00 00 5e b7 stq ra,0(sp)
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
120000640: 68 80 3d 20 lda t0,-32664(gp)
120000644: 00 00 01 8e ldt $f16,0(t0)
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
120000658: 60 80 3d 20 lda t0,-32672(gp)
12000065c: 00 00 01 9c stt $f0,0(t0)
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
120000670: 02 04 41 40 addq t1,t0,t1
120000674: 5c 80 3d 20 lda t0,-32676(gp)
120000678: 00 00 21 a0 ldl t0,0(t0)
12000067c: 02 04 41 40 addq t1,t0,t1
120000680: 00 00 43 b0 stl t1,0(t2)
120000684: 00 04 ff 47 clr v0
120000688: 00 00 5e a7 ldq ra,0(sp)
12000068c: 10 00 de 23 lda sp,16(sp)
120000690: 01 80 fa 6b ret
The load of signgam is achieved with the following sequence
120000668: 10 80 3d a4 ldq t0,-32752(gp)
12000066c: 00 00 21 a0 ldl t0,0(t0)
I.e., 2 instructions, not 6.
Yes, that is the same GOT table indirect two load sequence as x64.
I wasn't saying it had to use 6 instructions. I'm saying that if Alpha
wanted to full access to its 64-bit address space, then its options
are either to use 6 instructions to build a 64-bit immediate
OR do two loads (maybe plus other overhead). Both those options are
poor.
RV64 and A64 are in a similar boat.
I wanted an ISA option that doesn't need two dependent LD's or
6 instructions to access the whole 64-bit address space.
It's interesting that a, b, x, y (which end up in the same linked unit
as main()) result in code like
120000660: 58 80 7d 20 lda t2,-32680(gp)
120000664: 00 00 43 a0 ldl t1,0(t2)
which could be implemented more efficiently as
ldl t1,-32680(gp)
but apparently the linker just fixes up the first instruction of the
pair (either as ldq or as lda, maybe also as ldah), and maybe the
offset of the second instruction (but not in this example); the
benefit is that the linker just has to replace some instructions, but
it does not have to shrink or expand the code (which would require
changing even more instructions).
Nop'ing the first and using the second would be better because it
doesn't
use a temp register and hardware can optimize a unop away.
We also see that the call to foo() within the same linked unit is
linked as
120000630: 00 00 fe 2f unop
120000634: 17 00 40 d3 bsr ra,120000694 <foo>
120000638: 00 00 fe 2f unop
12000063c: 00 00 fe 2f unop
whereas the call to lgamma() in a shared library is linked as
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
One can see again that both code sequences have the same size, to
avoid shrinking or expanding.
I wondered about those nop's.
I would be cautious about depending on code expansion for optimization.
I read some online remarks about a developer adding "relaxation" to the
RV64 linker and causing it to take over an hour to run.
It is possible that the basic algorithm is relaxation factorial O(n!)
(it has that smell to me).
https://sourceware.org/binutils/docs/as/Xtensa-Relaxation.html#index-relaxation
Always starting with largest size and compacting down may be best.
That's why I was investigating the compacting linker algorithm.
If my ISA is going to depend on it for optimization, I want to
make sure it could be implemented easily and would not have
uncontrollable pathological behavior. And it does look acceptable.
The gcc manual says about these:
| When '-msmall-data' is used,
| the compiler can assume that all local symbols share the same '$gp'
| value, and thus reduce the number of instructions required for a
| function call from 4 to 1.
We see the 4 and 1 instructions above, but it's not clear to me that
there is a real benefit. The compiler cannot assume that an external
reference is local, and the linker knows, but does not benefit from
it. And for references within a compilation unit, I would hope that
the compiler/assembler manages to use the smallest variant based on
actual size.
Yes but they also have their 2 GB limit which avoids all large
address space 'issues' (ie, fobs them off onto the programmer).
The question is what happens to the Alpha code (or RV64 or A64) when
you remove the address space compile limits and go for the Full Monty.
....Why burden all programs with the costs of large programs the way it
is done by default on Alpha?
I'm not saying there shouldn't be optimizations for smaller sizes.
I'm pointing to the fact that to actually USE the 64-bit address space
there is a large increase in code size and execute cost,
and asking if that had to be so.
I can use 100 GB arrays with code that is the same size as code that
limits itself to the lower 2GB of address space (there is an option on
Alpha compilers and linkers for that).
100 GB is 37 bits of address. Where does that 37 number come from?
And the manual says the Alpha data limit is 2 GB.
For example, for Alpha to load a 64-bit constant requires 6
instructions,
24 bytes.
I forgot to add this to the program, maybe tomorrow.
Or pull it from the constant table just prior to the routine entry.
Long ago I read Alpha code standard puts constants into a table just
before the routine entry, does a BAL rTmp,+0 to copy the PC into rTmp,
then can access the constants in the table at negative rTmp offsets.
That sequence is too large so they are pretty much forced
to use an extra LDQ to pull the offset from the constant table
located just prior to the routine entry point and requires an extra
BAL to copy the RIP into a register as a base.
That's not necessary. The global pointer is derived from the function
address (in t12) on entry to the function and from the return address
(in ra) after a jsr.
Yes, I see that now.
t12 is the link register specified by the caller in its JAL to above
code.
That saves it a BAL rTmp,+0 to copy the PC as a PC-rel base.
- anton
mitchalsup@aol.com (MitchAlsup1) writes:
On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:<snip>
whereas the call to lgamma() in a shared library is linked as
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30> >>>> 120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
This 4 instruction sequence becomes::
CALX [IP,,GOT[n]-.]
In my 66000 ISA.
For current architectures, function calls use the
procedure linkage table (PLT). The Global Offset Table
is only used for certain static global variables.
If you want to leverage standard tools, you may wish
to follow that paradigm in 66000.
But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?
On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:
whereas the call to lgamma() in a shared library is linked as
120000648: 08 80 7d a7 ldq t12,-32760(gp)
12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
120000650: 02 00 ba 27 ldah gp,2(ra)
120000654: 18 85 bd 23 lda gp,-31464(gp)
This 4 instruction sequence becomes::
CALX [IP,,GOT[n]-.]
In my 66000 ISA.
On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) # 12050
<b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060
<__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it >>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.
I am well aware of that.
But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?
MitchAlsup1 wrote:
On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany,
this is addressed using auipc (add upper immediate to PC), whereas
with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>> <__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it >>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.
I am well aware of that.
But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?
If such references in the module (exe/dll) are fixed size RIP-rel disp64
plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.
The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value,
and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.
The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other users.
If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.
EricP wrote:
MitchAlsup1 wrote:
On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>> with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>> <__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even though it >>>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.
I am well aware of that.
But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?
If such references in the module (exe/dll) are fixed size RIP-rel disp64
plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.
The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value,
and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.
The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other
users.
If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.
The downside of this approach is the two PIC modules are bound to each
other at a fixed offset and unless they relocate together then all the inter-module references have to be patched, however many there are.
And there are likely more than just two modules involved.
The advantage of the GOT-indirect approach is that only the
one location needs to be patched. Plus it can use a DISP32 offset
to access the GOT. The disadvantage is that you don't discover
that you need GOT-indirect addressing until link time,
and then you need to insert an extra LD with a temp register.
And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.
The option behind door number 3 is an indirect address mode.
That simplifies the software as the compiler only emits DISP32 offsets,
the linker only needs to set the offset to the GOT entry and flip an
indirect bit (so no extra LD insertion or temp register).
But indirect addressing is an ISA feature that once added cannot be
removed.
It adds hardware complexity in the LSQ which is already
probably the most complex module in the core.
Some of it hardware to deal with worst case situations that likely
never occur, like 4 page or cache line straddles.
The conclusion I've come to is option two when combined with a
compacting linker is best.
On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:
EricP wrote:
MitchAlsup1 wrote:
On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:--------------
Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>>> with the default model it is located at the same address, but
addressed using gp.
The global variables are accessed through gp using a single
instruction in nearly all cases, e.g.:
1059c: 8501a683 lw a3,-1968(gp) #
12050 <b>
This includes signgam:
10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>>> <__signgam@@GLIBC_2.27>
It's unclear to me how signgam ends up right besides c, even
though it
is defined in libm.so.6, i.e., in a separate binary (and I have
checked that is libm actually linked dynamically).
Is there any way other than having the static linker preassign
extern variable to static link resolved addresses.
The problem with that approach is that it assumes that the shared
library
you access at run-time is the exact same one you linked your binary
against.
I am well aware of that.
But is there any way for the code to be emitted without indirection
and standard ISA displacement fields without those being resolved
by the linker (ld) and remain PIC ?!?
If such references in the module (exe/dll) are fixed size RIP-rel disp64 >>> plus a reference reloc in case the target moves it should work.
If loader finds target is different than default at link time then
the fixed disp64 field is large enough to hold any changed value.
The compiler always emits RIP-rel disp64 data references. If the linker
finds the offset is intra-module then it sets its assigned offset value, >>> and marks it as compactable in a later linker stage.
If inter-module then sets it to its target's default offset,
marks it as non-compactable and emits a DISP64 reloc entry
in case the target moves.
The loader should have the code and default GOT and ro-data marked
as Copy On Write (COW) during load-reloc so any patches fault into
the page file, then switch the protection to Read-Only-Execute and
Read-Only after it is finished. Pages that do require patches get their
own private code-GOT-ro-data page copies and don't bugger up other
users.
If the address space forks afterward loading then the children all
inherit the patched view of code-GOT-ro-data pages.
The downside of this approach is the two PIC modules are bound to each
other at a fixed offset and unless they relocate together then all the
inter-module references have to be patched, however many there are.
And there are likely more than just two modules involved.
The advantage of the GOT-indirect approach is that only the
one location needs to be patched. Plus it can use a DISP32 offset
to access the GOT. The disadvantage is that you don't discover
that you need GOT-indirect addressing until link time,
and then you need to insert an extra LD with a temp register.
Not if you have a CALX instruction. You predict GOT access at
compile time, and when the linker resolves an extern, it can
change CALX into CALA by flipping 1 bit making it the same size
as predicted, but now control transfers to the AGEN address.
When the linker does not resolve, ld.so can do this at run time.
And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.
It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.
The option behind door number 3 is an indirect address mode.
BINGO--that is effectively what CALX and CALA are.
That simplifies the software as the compiler only emits DISP32 offsets,
the linker only needs to set the offset to the GOT entry and flip an
indirect bit (so no extra LD insertion or temp register).
But indirect addressing is an ISA feature that once added cannot be
removed.
Note: CALX and CALA are indirect only so far as they load IP
and not any register. Also note: they are CALLs not BRs.
It adds hardware complexity in the LSQ which is already
probably the most complex module in the core.
CALX performs through ICache not DCache.
Some of it hardware to deal with worst case situations that likely
never occur, like 4 page or cache line straddles.
The conclusion I've come to is option two when combined with a
compacting linker is best.
MitchAlsup1 wrote:<snip>
On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:
EricP wrote:
And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.
It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.
You misunderstand - in the full 64-bit address space (the large memory
model)
I want to eliminate the extra address load for intra-module references
so it only indirects through GOT for inter-module references.
To support full 64-bit addresses, the approach chosen was to
turn all program memory references into two, a LD of a disp64 or
an absolute GOT address, then the data access.
That first extra memory load is an unnecessary 64-bit "tax".
Getting rid of this requires the compiler know the different between an "extern" intra-module reference, which can use direct RIP-disp64
addressing,
and a "dllimport" inter-module reference, which must load the absolute address from the GOT table first, then use that.
But GCC has no "dllimport" attribute for declarations, only MSVC does.
OR it requires the compiler emit a worst-case access sequence for every global variable access, and have the linker edit and compact the code
as it discovers which are "extern" and which are "dllimport" references,
the compacting linker approach.
On Fri, 18 Jul 2025 17:23:04 +0000, EricP wrote:
MitchAlsup1 wrote:<snip>
On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:
EricP wrote:
And in general the compiler doesn't know whether it needs a DISP64
or DISP32 offset to access any variable, so it is already dealing
with variable sized references.
It is unlikely that the address space is so cluttered that GOT
cannot be placed within ±2GB of IP--and still transfer control
to anywhere in 64-bit VAS.
You misunderstand - in the full 64-bit address space (the large memory
model)
I want to eliminate the extra address load for intra-module references
so it only indirects through GOT for inter-module references.
Yes, what we did was to make GOT 32-bit addressable from the current
module to reduce the size of the indirecting LD. But in My 66000
the indirect remains a single instruction instead of AUPIC+LDD.
To support full 64-bit addresses, the approach chosen was to
turn all program memory references into two, a LD of a disp64 or
an absolute GOT address, then the data access.
That first extra memory load is an unnecessary 64-bit "tax".
Agreed; but I would label this as the "extern" tax as it is still
required in dynamically loaded modules in the small (32-bit) model.
Getting rid of this requires the compiler know the different between an
"extern" intra-module reference, which can use direct RIP-disp64
addressing,
and a "dllimport" inter-module reference, which must load the absolute
address from the GOT table first, then use that.
But GCC has no "dllimport" attribute for declarations, only MSVC does.
What the compiler/linker pair needs to know is that the variable is
"extern" but will be "resolved" at link time.
OR it requires the compiler emit a worst-case access sequence for every
global variable access, and have the linker edit and compact the code
as it discovers which are "extern" and which are "dllimport" references,
the compacting linker approach.
Compacting is a lot better than expanding.
Yes. Compacting is better as you start with working (functionally correct) >but possibly oversize code and tries to make it smaller working code. >Compacting can stop at any point as it is always dealing with working code.
Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those expansions, >and so on. Expansion can't stop until all broken items are fixed.
In theory these both deal with the same number of items and should
produce the same optimal result.
The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.
EricP <ThatWouldBeTelling@thevillage.com> writes:-------------------
Yes. Compacting is better as you start with working (functionally
correct)
but possibly oversize code and tries to make it smaller working code. >>Compacting can stop at any point as it is always dealing with working
code.
Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those
expansions,
and so on. Expansion can't stop until all broken items are fixed.
In theory these both deal with the same number of items and should
produce the same optimal result.
Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
have to make sure that you never compact, whereas with "compacting"
you could compact some things that look compactable, but find that the
result is no longer correct, because an earlier-compacted think needs
to expand.
But let's rule out the shrinking-this-causes-growth-elsewhere cases,
then the "compacting" approach can be caughtin a steady state where it
sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
produce a smaller result than the "compacting" approach.
The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.
And what's the problem with that?
Read more about "Assembling Span-Dependent Instructions", and
misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>
- anton
Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308).
That paper involves compacting A->B->C branch chains which is NP-complete.
C, then no, that's not the case.
It's been about 40 years since I wrote an assembler that did compacting for the
ROMP, but it started with all A->B branches long, and made passes over the code
compacting what it could until it didn't find any more. It didn't try to handle
branch chains, so compacting never made anything out of range.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Yes. Compacting is better as you start with working (functionally correct) >> but possibly oversize code and tries to make it smaller working code.
Compacting can stop at any point as it is always dealing with working code. >>
Expanding starts with possible broken code because a branch, call,
or global ref is out of range or the wrong kind of reference,
then expands each broken item to make it function correctly,
and then deals with all the things that broke because of those expansions, >> and so on. Expansion can't stop until all broken items are fixed.
In theory these both deal with the same number of items and should
produce the same optimal result.
Which theory is that? In theory the general case (where some things
can need more space as other things need less) is NP-complete (Thomas
G. Szymanski: Assembling Code for Machines with Span-Dependent
Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
have to make sure that you never compact, whereas with "compacting"
you could compact some things that look compactable, but find that the
result is no longer correct, because an earlier-compacted think needs
to expand.
But let's rule out the shrinking-this-causes-growth-elsewhere cases,
then the "compacting" approach can be caughtin a steady state where it
sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
produce a smaller result than the "compacting" approach.
The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.
And what's the problem with that?
Read more about "Assembling Span-Dependent Instructions", and
misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>
- anton
John Levine <johnl@taugh.com> writes:
That paper involves compacting A->B->C branch chains which is NP-complete.
If you want to say that the paper tries to transform such a chain into
C, then no, that's not the case.
And actually in the case:
A: jbr B
...
B: jbr C
...
C:
there are only /simple expressions/ and the jbrs are non-pathological
in terms of the paper, and the problem of minimizing the size of a
program with only simple expressions and non-pathological
span-dependent instructions is solvable in polynomial time (the paper
gives an algorithm for doing that in section 3).
It's been about 40 years since I wrote an assembler that did compacting for the
ROMP, but it started with all A->B branches long, and made passes over the code
compacting what it could until it didn't find any more. It didn't try to handle
branch chains, so compacting never made anything out of range.
It probably did not deal with nonsimple or nor with pathological span-dependent instructions, or it recognized them and always used the
long form for them (theoretically suboptimal, but rarely occurs in
practice), the way that gas does it to this day. Of course, you have
to write code to recognize when a span-dependent instruction has a
non-simple expression or is pathological, which you avoid with the
expanding approach.
Anton Ertl wrote:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >form, use that and lock it down (remove from variable list), if min >= >shortest long span, use that and also remove it from the list.
After a very short number of passes, most code will have settled at or
very near the theoretical optimum.
The compacting approach I was thinking of, which I described a while back,
is sweep over all items, start large, calculate lower and upper possible >address range, shrink a reference when you know it will always work,
iterate until it stops changing, freeze at those sizes.
The difference is that when faced
with a pathological case compacting can just give up at any point
while expansion must run to completion.
And what's the problem with that?
Given a big job, your expanding linker goes away and never comes back.
In reality, both would likely terminate after 3 or 4 sweeps.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest
form, use that and lock it down (remove from variable list), if min >=
shortest long span, use that and also remove it from the list.
After a very short number of passes, most code will have settled at or
very near the theoretical optimum.
What do you do with those that stay in the variable list?
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >>> form, use that and lock it down (remove from variable list), if min >=
shortest long span, use that and also remove it from the list.
After a very short number of passes, most code will have settled at or
very near the theoretical optimum.
What do you do with those that stay in the variable list?
Still don't know if it can use short or long form.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
For each branch, calculate both the maximum and minimum length across
all intermediate variable-length instructions: If the max is <= shortest >>>> form, use that and lock it down (remove from variable list), if min >= >>>> shortest long span, use that and also remove it from the list.
After a very short number of passes, most code will have settled at or >>>> very near the theoretical optimum.
What do you do with those that stay in the variable list?
Still don't know if it can use short or long form.
So what happens if some are never removed from the variable list?
EricP <ThatWouldBeTelling@thevillage.com> writes:
The compacting approach I was thinking of, which I described a while back, >> is sweep over all items, start large, calculate lower and upper possible
address range, shrink a reference when you know it will always work,
iterate until it stops changing, freeze at those sizes.
Taking the example from <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>:
foo:
movl foo+133-bar(%rdi),%eax
bar:
what does your approach do? What does "lower and upper possible
address range" mean? How do you know it will always work?
Given a big job, your expanding linker goes away and never comes back.The difference is that when facedAnd what's the problem with that?
with a pathological case compacting can just give up at any point
while expansion must run to completion.
As somone wrote in this thread:
In reality, both would likely terminate after 3 or 4 sweeps.
Plus the expanding approach does not need to "calculate lower and
upper possible address range",
nor determin whether "it will always work".
If the operand needs to much space, expand (and remember to do
another sweep); that's all.
- anton
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 02:46:00 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,755 |