• Re: Code density

    From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jun 17 18:01:34 2025
    On Tue, 17 Jun 2025 14:17:42 +0000, Anton Ertl wrote:

    Here are the text size numbers:

    Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Can you get numbers for RISC-V without compression ?? for the above
    and for the below.


    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64
    1030288 150686 79852 31492 mvme68k
    779393 155764 75795 31813 vax

    1302254 171505 83249 35085 amd64
    1229032 178332 89180 36876 evbarm-aarch64
    1539052 179055 82280 34717 amd64-daily
    1374961 184458 96971 37218 i386
    1247476 185792 96728 42028 evbarm-earmv7hf
    1333952 187452 96328 39472 sparc
    1586608 204032 106896 45408 evbppc
    1536144 204320 106768 43232 hppa
    ---------
    1397024 216832 109792 48512 sparc64
    1538536 222336 107776 44912 evbmips-mips64eb
    1623952 243008 122096 50640 evbmips-mipseb
    1689920 251376 120672 51168 alpha

    This appears to be the region of standard RISC architectures
    about 1.5× VAX
    ---------
    2324752 2259984 1378000 ia64
    ^
    Is there 1 too many zeros on the last entry ??


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Tue Jun 17 23:55:48 2025
    MitchAlsup1 wrote:
    On Tue, 17 Jun 2025 14:17:42 +0000, Anton Ertl wrote:

    Here are the text size numbers:

    Debian numbers from <2024Jan4.101941@mips.complang.tuwien.ac.at>:

    bash grep gzip
    595204 107636 46744 armhf
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    Can you get numbers for RISC-V without compression ?? for the above
    and for the below.

    The numbers for RV64 look suspiciously low.
    For RV64 there are multiple "code models" for building addresses.

    https://www.sifive.com/blog/all-aboard-part-4-risc-v-code-models

    https://starfivetech.com/uploads/optimizing-riscv-software.pdf

    I suspect the numbers Anton quotes for RV64 are for the GCC defaults

    https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/RISC-V-Options.html

    which for code models is "-mcmodel=medlow". Other RV64 code models are "-mcmodel=medany", and (strangely not listed there but referenced on
    some web pages) is "-mcmodel=large" for full 64 bit offsets.

    "-mcmodel=medlow" compiles for only 32-bit offsets in the unsigned low 2GB
    and upper 2GB which is essentially compiling 64-bit code as though 32-bit.
    All static code and data offsets must be in that 2 GB range because it uses
    the 2 instruction sequence to load a 32 bit offset to a register or PC.

    I would be interested in what happens to the code size if "-mcmodel=large"
    is used (if that is indeed supported) which presumably allows one to
    directly address static declarations in the full 64-bit address space.
    (The documentation is none existent so just guessing).

    Other optimizations such as "-flto" for "link time optimizations"
    are listed but not really documented just what is.


    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64
    1030288 150686 79852 31492 mvme68k
    779393 155764 75795 31813 vax

    1302254 171505 83249 35085 amd64
    1229032 178332 89180 36876 evbarm-aarch64
    1539052 179055 82280 34717 amd64-daily
    1374961 184458 96971 37218 i386
    1247476 185792 96728 42028 evbarm-earmv7hf
    1333952 187452 96328 39472 sparc
    1586608 204032 106896 45408 evbppc
    1536144 204320 106768 43232 hppa
    ---------
    1397024 216832 109792 48512 sparc64
    1538536 222336 107776 44912 evbmips-mips64eb
    1623952 243008 122096 50640 evbmips-mipseb
    1689920 251376 120672 51168 alpha

    This appears to be the region of standard RISC architectures
    about 1.5× VAX
    ---------
    2324752 2259984 1378000 ia64
    ^
    Is there 1 too many zeros on the last entry ??


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Wed Jun 18 06:22:41 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I would be interested in what happens to the code size if "-mcmodel=large"
    is used

    No code is generated by gcc-10.3.1. Instead, I get an error message

    gcc: error: unrecognized argument in option ‘-mcmodel=large’
    gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow

    I'll show numbers for medany in a different posting.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jun 18 06:26:43 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    Can you get numbers for RISC-V without compression ?? for the above
    and for the below.

    You can do it as easily as I can: Set up a build server for Debian
    (above) and one for NetBSD, then change the RISC-V compiler settings
    not to compress, then measure the text sizes. I don't have the time
    to do that, however.

    You can, however compare ARM T32 and A32 in the Debian results:

    bash grep gzip
    595204 107636 46744 armhf ARM T32
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel ARM A32
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    There may be additional differences between the two ARM 32-bit
    builds, however.

    What I could do relatively easily is to compile a file from gforth
    with different options. The file I used is what is compiled to engine/main-fast-ll.o

    text siz compiler options
    20242 -O2
    18146 -Os
    18146 -Os -march=rv64gc
    18444 -Os -march=rv64gc -mcmodel=medany
    23092 -Os -march=rv64g

    So, for this file, the compressed instructions provide a factor 1.27 improvement in code density.

    That's surprisingly little. I would expect similar code density for
    RV64G as for MIPS64 (the instruction sets are similar, the addressing
    modes the same), and both in the Debian and in the NetBSD results the
    factors between these two architectures look to be larger. Either main-fast-ll.o is an outlier, or the default unrolling and inlining
    options for MIPS64 are more aggressive than for RV64.

    2324752 2259984 1378000 ia64
    ^
    Is there 1 too many zeros on the last entry ??

    No. NetBSD only has statically-linked IA64 binaries, while all the
    other architectures have dynamically-linked binaries. So the IA64
    numbers include the parts of the libraries that the binary calls and
    are therefore not comparable to the results for the other
    architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Jun 18 09:32:14 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I would be interested in what happens to the code size if "-mcmodel=large" >> is used

    No code is generated by gcc-10.3.1. Instead, I get an error message

    gcc: error: unrecognized argument in option ‘-mcmodel=large’
    gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow

    I'll show numbers for medany in a different posting.

    - anton

    Ok, thanks.
    So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
    a 32-bit program space inside a 64-bit address space.
    And programs can be statically linked only.

    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,
    which would blow up their code size and tank their performance.
    And that's not a good look for them.

    Documentation does say that Aarch64 supports it (note =small is default):

    https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

    AArch64 Options

    -mcmodel=tiny
    Generate code for the tiny code model. The program and its statically
    defined symbols must be within 1MB of each other. Programs can be
    statically or dynamically linked.

    -mcmodel=small
    Generate code for the small code model. The program and its statically
    defined symbols must be within 4GB of each other. Programs can be
    statically or dynamically linked.
    This is the default code model.

    -mcmodel=large
    Generate code for the large code model. This makes no assumptions about
    addresses and sizes of sections. Programs can be statically linked only.

    For x86-64 -mcmodel=large is also supported, =medium is the default.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kerr-Mudd, John@21:1/5 to EricP on Wed Jun 18 16:19:27 2025
    On Wed, 18 Jun 2025 09:32:14 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I would be interested in what happens to the code size if "-mcmodel=large" >> is used

    No code is generated by gcc-10.3.1. Instead, I get an error message

    gcc: error: unrecognized argument in option ‘-mcmodel=large’ gcc: note: valid arguments to ‘-mcmodel=’ are: medany medlow

    I'll show numbers for medany in a different posting.

    - anton

    Ok, thanks.
    So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
    a 32-bit program space inside a 64-bit address space.
    And programs can be statically linked only.

    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,
    which would blow up their code size and tank their performance.
    And that's not a good look for them.

    Documentation does say that Aarch64 supports it (note =small is default):

    https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

    AArch64 Options

    -mcmodel=tiny
    Generate code for the tiny code model. The program and its statically
    defined symbols must be within 1MB of each other. Programs can be
    statically or dynamically linked.

    -mcmodel=small
    Generate code for the small code model. The program and its statically
    defined symbols must be within 4GB of each other. Programs can be
    statically or dynamically linked.
    This is the default code model.

    I'm so retro that I remember when 'small model' meant <64k.

    -mcmodel=large
    Generate code for the large code model. This makes no assumptions about
    addresses and sizes of sections. Programs can be statically linked only.

    For x86-64 -mcmodel=large is also supported, =medium is the default.




    --
    Bah, and indeed Humbug.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jun 18 18:27:40 2025
    On Wed, 18 Jun 2025 13:32:14 +0000, EricP wrote:

    https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/AArch64-Options.html#AArch64-Options

    AArch64 Options

    -mcmodel=tiny
    Generate code for the tiny code model. The program and its statically
    defined symbols must be within 1MB of each other. Programs can be
    statically or dynamically linked.

    -mcmodel=small
    Generate code for the small code model. The program and its
    statically
    defined symbols must be within 4GB of each other. Programs can be
    statically or dynamically linked.
    This is the default code model.

    -mcmodel=large
    Generate code for the large code model. This makes no assumptions
    about
    addresses and sizes of sections. Programs can be statically linked
    only.

    For x86-64 -mcmodel=large is also supported, =medium is the default.

    My 66000 Options (IIRC)

    Where "program" means the statically linked object module.
    Dynamically linked modules can be added at will via GOT.

    -mcmodel=tiny
    The program (.text) must fit in a 28-bit address space.
    The data must fit in a 32-bit address space.
    GOT contains Word entries.
    GOT must be within 4GB of instruction accessing GOT.

    -mcmodel=small
    The program must fit in a 32-bit address space.
    The data must fit in a 32-bit address space.
    GOT contains Word entries.
    GOT must be within 4GB of instruction accessing GOT.

    -mcmodel=large
    The program must fit in a 63-bit address space.
    The data must fit in a 63-bit address space.
    GOT contains DoubleWord entries.

    I don't know if we have finalized a default yet.

    OH, and BTW, this is not a compiler option, but a linker option.
    So you can have a single compiled library, that gets linked under
    any model.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Thu Jun 19 09:21:25 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
    a 32-bit program space inside a 64-bit address space.
    And programs can be statically linked only.

    medany is 2GB for "a program and its statically defined symbols", and
    these 2GB can be anywhere in address space. Dynamically linked
    symbols can be further away AFAICT. So unless you do binaries that
    are larger than 2GB, this does not appear to be a restriction.

    medlow means that the binary most reside in the lower 2GB of the
    address space (actually between -2GB and 2GB, but at least on 64-bit
    Linux user programs cannot reside at negative addresses). Again,
    dynamically linked symbols can be further away. But if both the
    executable and the shared libraries are compiled for the medlow model,
    they must all fit in the lower 2GB. Probably not a big problem,
    either, except maybe for the largest C++ projects.

    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,

    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the
    global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up
    the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.

    Interestingly, when I look at the "DEC Alpha" options, there is
    -msmall-data (64KB global tables are enough) and -mlarge-data (data
    segment <2GB). There is also -msmall-code (code <4MB) and
    -mlarge-code. The gcc manual says about these:

    | When '-msmall-data' is used,
    | the compiler can assume that all local symbols share the same '$gp'
    | value, and thus reduce the number of instructions required for a
    | function call from 4 to 1.

    I think that these four are:

    ldq $27, ...($gp) #load target address
    jsr $26, ($27) #call target
    ldq $gp, offset($26) #restore gp

    and at the target:

    target:
    ldq $gp, offset($27) #load gp

    whereas in a small/small variant it would just be

    bsr $26, target

    which would blow up their code size and tank their performance.
    And that's not a good look for them.

    Why burden all programs with the costs of large programs the way it
    is done by default on Alpha?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Jun 22 10:05:38 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    So IIUC for RV64 "-mcmodel=medany" is 64-bit data size and
    a 32-bit program space inside a 64-bit address space.
    And programs can be statically linked only.

    medany is 2GB for "a program and its statically defined symbols", and
    these 2GB can be anywhere in address space. Dynamically linked
    symbols can be further away AFAICT. So unless you do binaries that
    are larger than 2GB, this does not appear to be a restriction.

    medlow means that the binary most reside in the lower 2GB of the
    address space (actually between -2GB and 2GB, but at least on 64-bit
    Linux user programs cannot reside at negative addresses). Again,
    dynamically linked symbols can be further away. But if both the
    executable and the shared libraries are compiled for the medlow model,
    they must all fit in the lower 2GB. Probably not a big problem,
    either, except maybe for the largest C++ projects.

    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,

    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the
    global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up
    the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.

    Yes, it is (also) using an extra memory load to pick up large immediates.
    It also requires a BAL to get the IP into a register.

    Interestingly, when I look at the "DEC Alpha" options, there is
    -msmall-data (64KB global tables are enough) and -mlarge-data (data
    segment <2GB). There is also -msmall-code (code <4MB) and
    -mlarge-code. The gcc manual says about these:

    | When '-msmall-data' is used,
    | the compiler can assume that all local symbols share the same '$gp'
    | value, and thus reduce the number of instructions required for a
    | function call from 4 to 1.

    I think that these four are:

    ldq $27, ...($gp) #load target address
    jsr $26, ($27) #call target
    ldq $gp, offset($26) #restore gp

    and at the target:

    target:
    ldq $gp, offset($27) #load gp

    whereas in a small/small variant it would just be

    bsr $26, target

    And the large-text limit is 4 MB of code.
    Above 2 GB data or 4 MB code you must use dynamic allocation.


    which would blow up their code size and tank their performance.
    And that's not a good look for them.

    Why burden all programs with the costs of large programs the way it
    is done by default on Alpha?

    - anton

    I'm not saying there shouldn't be optimizations for smaller sizes.
    I'm pointing to the fact that to actually USE the 64-bit address space
    there is a large increase in code size and execute cost,
    and asking if that had to be so.

    For example, for Alpha to load a 64-bit constant requires 6 instructions,
    24 bytes. That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table
    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.
    The LDQ touches the same address space as the code but now as data
    so it has to load the D-TLB with an entry redundant with I-TBL,
    and bring in a data cache line with the constants.
    And after the constant is loaded it must be manually added to the base
    because there is no LD/ST combined with a scaled index.

    Furthermore, the actual load or store of the target value is serially
    dependent on LDQ offset and the ADD. Back when the load-to-use latency
    for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks
    it is serious penalty.

    By making it a priority for relatively cheap access to the full 64-bit
    address space during an ISA design, what alternatives might have minimized
    its extra cost?

    First, I have two designs which load a 64-bit constant in 3 32-bit fixed
    length instructions, the prefix CONST approach, and another using 3 opcodes
    and requires a temp register. In both cases the constants can easily be
    fused in Decode and have zero execute cost. My preference is for the
    prefix CONST as it can be used with many other instructions besides
    LD and ST and doesn't require an extra temp register.

    Second, if the base register of LD, ST, or LDA (Load Address) is R31,
    the zero register, then it means use the PC as base, and the extra BAL
    is almost always unnecessary.

    Third, recognize that whether the const offset is loaded by instructions
    or from a constant table by an extra LDQ, it will be adding that offset
    to the base a lot so have LD, ST, LDA with a scaled-index address mode
    and eliminate the extra ADD. The prefix-CONST approach doesn't require
    this because it fuses the immediate directly onto its consumer in Decode.
    But have the scaled index address mode anyway.

    Fourth, have a compacting linker so the programmer doesn't need to specify
    a code model. The compiler emits a worst-case sequence and the linker
    gets rid of all the ones it doesn't need.

    So there are three alternatives to accessing full 64-bit addresses.

    The CONST prefix requires 3 instructions to access the target data
    with no temp register and zero execute cost if fused in Decode.
    CONST value
    CONST value
    LDx rDst, [r31+offset]

    The separate const instructions requires 4 instructions
    and requires a temp register and scaled index address mode,
    but no execute cost if fused in Decode
    CONST1 rTemp=value
    CONST2 rTemp=value
    CONST2 rTemp=value
    LDx rDst, [r31+rTemp<<0]

    The load from constant table requires 2 instructions,
    requires a temp register and scaled index address mode,
    eliminates the BAL and ADD, but takes an extra data memory access.
    LDQ rTemp, [r31+offset]
    LDx rDst, [r31+rTemp<<0]

    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Jun 22 17:35:13 2025
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    ---------------------

    I'm not saying there shouldn't be optimizations for smaller sizes.
    I'm pointing to the fact that to actually USE the 64-bit address space
    there is a large increase in code size and execute cost,
    and asking if that had to be so.

    For example, for Alpha to load a 64-bit constant requires 6
    instructions,
    24 bytes.

    The corresponding data for My 66000 is:
    1 instruction:: LD Rd,[IP,,DISP64-.]
    3 words in .text
    LD pipeline latency 3 (instead of several with arithmetic)

    That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table

    At least doubling the LD latency and adding even more dependent inst.

    RISC-V is no better.

    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.
    The LDQ touches the same address space as the code but now as data
    so it has to load the D-TLB with an entry redundant with I-TBL,
    and bring in a data cache line with the constants.
    And after the constant is loaded it must be manually added to the base because there is no LD/ST combined with a scaled index.

    Furthermore, the actual load or store of the target value is serially dependent on LDQ offset and the ADD. Back when the load-to-use latency
    for a cache hit 1 clock that might look ok but now that it is 3 or 4
    clocks it is serious penalty.

    Making the execution window grow by the added latency in order to
    stumble over all the ILP not available due to the dependence
    latencies.

    By making it a priority for relatively cheap access to the full 64-bit address space during an ISA design, what alternatives might have
    minimized its extra cost?

    I would argue that making 64-bit access is (IS) what minimizes the cost
    of huge address spaces.

    First, I have two designs which load a 64-bit constant in 3 32-bit fixed length instructions, the prefix CONST approach, and another using 3
    opcodes
    and requires a temp register. In both cases the constants can easily be
    fused in Decode and have zero execute cost. My preference is for the
    prefix CONST as it can be used with many other instructions besides
    LD and ST and doesn't require an extra temp register.

    Universal constants provides this without wasting instructions, memory accesses, or register uses.

    Second, if the base register of LD, ST, or LDA (Load Address) is R31,
    the zero register, then it means use the PC as base, and the extra BAL
    is almost always unnecessary.

    I use R0, which CAN contain any data the program wants to put in it,
    as the proxy for IP.

    Third, recognize that whether the const offset is loaded by instructions
    or from a constant table by an extra LDQ, it will be adding that offset
    to the base a lot so have LD, ST, LDA with a scaled-index address mode
    and eliminate the extra ADD. The prefix-CONST approach doesn't require
    this because it fuses the immediate directly onto its consumer in
    Decode. But have the scaled index address mode anyway.

    Sure,

    Fourth, have a compacting linker so the programmer doesn't need to
    specify
    a code model. The compiler emits a worst-case sequence and the linker
    gets rid of all the ones it doesn't need.

    So there are three alternatives to accessing full 64-bit addresses.

    The CONST prefix requires 3 instructions to access the target data
    with no temp register and zero execute cost if fused in Decode.
    CONST value
    CONST value
    LDx rDst, [r31+offset]

    The separate const instructions requires 4 instructions
    and requires a temp register and scaled index address mode,
    but no execute cost if fused in Decode
    CONST1 rTemp=value
    CONST2 rTemp=value
    CONST2 rTemp=value
    LDx rDst, [r31+rTemp<<0]

    The load from constant table requires 2 instructions,
    requires a temp register and scaled index address mode,
    eliminates the BAL and ADD, but takes an extra data memory access.
    LDQ rTemp, [r31+offset]
    LDx rDst, [r31+rTemp<<0]

    Why not just::

    LDx Rd,[ip,,disp64]

    1 instruction
    3 words
    no added latency
    no added instructions
    easy to shrink if DISP64 has no higher order bits set.

    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.

    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or
    c or d; and has to generate::

    LDD Rt,[IP,,&GOT[a]-.]
    LDD Ru,[IP,,&GOT[b]-.]
    LDD Rv,[IP,,&GOT[c]-.]
    LDD Rw,[IP,,&GOT[d]-.]
    then
    LDD Ra,[Rt]
    LDD Rb,[Ru]
    LDD Rc,[Rv]
    ADD R8,Ra,Rb
    ADD R8,R8,Rc
    STD R8,[Rw]

    When the linker figures out that a,b, and d are in the same module
    it can shrink the code to::

    LDD Rt,[IP,,&GOT[a]-.]
    LDD Rv,[IP,,&GOT[c]-.]
    then
    LDD Ra,[Rt]
    LDD Rb,[Rt+8]
    LDD Rc,[Rv]
    ADD R8,Ra,Rb
    ADD R8,R8,Rc
    STD R8,[Rt+16]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Jun 22 20:26:41 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:


    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.

    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or
    c or d; and has to generate::

    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 26 01:12:00 2025
    On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:


    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.

    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or
    c or d; and has to generate::

    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).

    The extern's are in a dynamically loaded module, and the .text
    section/segment is PIC. So, ld.so is not allowed to write over
    the current displacement.

    I don't see how one can do what you suggest and have .text
    remain PIC.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 26 15:17:18 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:


    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.

    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or
    c or d; and has to generate::

    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).

    The extern's are in a dynamically loaded module, and the .text >section/segment is PIC. So, ld.so is not allowed to write over
    the current displacement.

    The static linker does the code transformation, not the
    run-time dynamic linker (which just updates the PLT/GOT).

    Here is an illustrative example:

    This C++ code invokes the 'dlsym' function, which is hosted
    in a dynamically linked shared object (libdl.so):


    sym = (get_dlp_t)dlsym(handle, "get_dlp");
    if (sym == NULL) {
    lp->log("Invalid DLP shared object format: %s\n", dlerror());
    unregister_handle(channel);
    dlclose(handle);
    return 1;
    }

    ===========================================
    g++ generates this assembler code:

    ...
    movq %rax, 784(%r13,%rbx,8)
    .L14:
    .LBE39:
    .LBE38:
    .loc 2 118 0
    movl $.LC7, %esi
    movq %r14, %rdi
    call dlsym
    .LVL19:
    .loc 2 119 0
    testq %rax, %rax
    je .L24

    ....

    ===========================================
    The linker (ld command) generated the following trampoline and
    corresponding trampoline invocation.

    000000000040ccc0 <dlsym@plt>:
    40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
    40ccc6: 68 58 00 00 00 pushq $0x58
    40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
    ...

    412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
    41268c: 48 83 fb 63 cmp $0x63,%rbx
    412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
    412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
    412699: 00
    41269a: be 0d 66 43 00 mov $0x43660d,%esi
    41269f: 4c 89 f7 mov %r14,%rdi
    4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
    4126a7: 48 85 c0 test %rax,%rax

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Fri Jun 27 07:48:57 2025
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
    Irrespective of which way ones chooses, the compacting linker gets
    rid of any unneeded excess.
    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or
    c or d; and has to generate::
    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).
    The extern's are in a dynamically loaded module, and the .text
    section/segment is PIC. So, ld.so is not allowed to write over
    the current displacement.

    The static linker does the code transformation, not the
    run-time dynamic linker (which just updates the PLT/GOT).

    Here is an illustrative example:

    This C++ code invokes the 'dlsym' function, which is hosted
    in a dynamically linked shared object (libdl.so):


    sym = (get_dlp_t)dlsym(handle, "get_dlp");
    if (sym == NULL) {
    lp->log("Invalid DLP shared object format: %s\n", dlerror());
    unregister_handle(channel);
    dlclose(handle);
    return 1;
    }

    ===========================================
    g++ generates this assembler code:

    ...
    movq %rax, 784(%r13,%rbx,8)
    ..L14:
    ..LBE39:
    ..LBE38:
    .loc 2 118 0
    movl $.LC7, %esi
    movq %r14, %rdi
    call dlsym
    ..LVL19:
    .loc 2 119 0
    testq %rax, %rax
    je .L24

    .....

    ===========================================
    The linker (ld command) generated the following trampoline and
    corresponding trampoline invocation.

    000000000040ccc0 <dlsym@plt>:
    40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
    40ccc6: 68 58 00 00 00 pushq $0x58
    40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
    ....

    412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
    41268c: 48 83 fb 63 cmp $0x63,%rbx
    412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
    412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
    412699: 00
    41269a: be 0d 66 43 00 mov $0x43660d,%esi
    41269f: 4c 89 f7 mov %r14,%rdi
    4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
    4126a7: 48 85 c0 test %rax,%rax


    This illustrates the PLT jump table usage but Mitch's question was
    I think with regard to variables exported by a shared library,
    say errno exported by the CRTLIB C runtime library,
    verses extern variables in the main link module.

    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value,
    verses a direct PC-rel memref for extern a, b, c.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Fri Jun 27 08:33:40 2025
    EricP wrote:
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
    Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.
    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::
    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).
    The extern's are in a dynamically loaded module, and the .text
    section/segment is PIC. So, ld.so is not allowed to write over
    the current displacement.

    The static linker does the code transformation, not the
    run-time dynamic linker (which just updates the PLT/GOT).

    Here is an illustrative example:


    This illustrates the PLT jump table usage but Mitch's question was
    I think with regard to variables exported by a shared library,
    say errno exported by the CRTLIB C runtime library,
    verses extern variables in the main link module.

    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value, verses a direct PC-rel memref for extern a, b, c.

    For example, in Microsoft one can mark the DLL export variable with
    __declspec(dllexport) int errno;

    in the header file and the compiler knows that all references
    to errno require an extra level of indirection.
    I've seen no such equivalent attribute in the GCC world.

    This is the one example where I considered adding an indirect
    addressing mode enabled by 1 bit on all LD and ST instructions
    as it eliminates the need to emit different code sequences for
    intra-module and inter-module accesses.

    The compiler always emits a PC-rel address. Later the linker discovers
    the it is a reference to a DLL export variable and sets the offset to be
    to the address in the GOT table and sets the Indirect bit on the LD/ST.

    An Indirect address mode has side effects for the Load Store Queue but
    they are mostly the same as if there were two separate instructions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Fri Jun 27 13:55:16 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 20:26:41 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Sun, 22 Jun 2025 14:05:38 +0000, EricP wrote:
    Irrespective of which way ones chooses, the compacting linker gets >>>>>> rid of any unneeded excess.
    Consider::

    extern int64_t a,b,c,d;

    and some code:

    { d = a+b+c; }

    The compiler has to assume that a is in a different module than b or >>>>> c or d; and has to generate::
    Why does the compiler need to assume anything?

    It simply issues
    Load Register Ra from "a symbol table reference for a"
    Load register Rb from "a symbol table reference for b"
    etc.

    The linker determines that the address refers to a
    symbol in a shared object (or a different object file
    include in the link) and generates the appropriate
    code (GOT reference, PC-relative or absolute as
    necessary).
    The extern's are in a dynamically loaded module, and the .text
    section/segment is PIC. So, ld.so is not allowed to write over
    the current displacement.

    The static linker does the code transformation, not the
    run-time dynamic linker (which just updates the PLT/GOT).

    Here is an illustrative example:

    This C++ code invokes the 'dlsym' function, which is hosted
    in a dynamically linked shared object (libdl.so):


    sym = (get_dlp_t)dlsym(handle, "get_dlp");
    if (sym == NULL) {
    lp->log("Invalid DLP shared object format: %s\n", dlerror());
    unregister_handle(channel);
    dlclose(handle);
    return 1;
    }

    ===========================================
    g++ generates this assembler code:

    ...
    movq %rax, 784(%r13,%rbx,8)
    ..L14:
    ..LBE39:
    ..LBE38:
    .loc 2 118 0
    movl $.LC7, %esi
    movq %r14, %rdi
    call dlsym
    ..LVL19:
    .loc 2 119 0
    testq %rax, %rax
    je .L24

    .....

    ===========================================
    The linker (ld command) generated the following trampoline and
    corresponding trampoline invocation.

    000000000040ccc0 <dlsym@plt>:
    40ccc0: ff 25 12 56 23 00 jmpq *0x235612(%rip) # 6422d8 <_GLOBAL_OFFSET_TABLE_+0x2d8>
    40ccc6: 68 58 00 00 00 pushq $0x58
    40cccb: e9 60 fa ff ff jmpq 40c730 <_init+0x20>
    ....

    412686: 0f 84 89 01 00 00 je 412815 <c_mp::channel(int, char const**, c_logger*)+0x255>
    41268c: 48 83 fb 63 cmp $0x63,%rbx
    412690: 77 08 ja 41269a <c_mp::channel(int, char const**, c_logger*)+0xda>
    412692: 49 89 84 dd 10 03 00 mov %rax,0x310(%r13,%rbx,8)
    412699: 00
    41269a: be 0d 66 43 00 mov $0x43660d,%esi
    41269f: 4c 89 f7 mov %r14,%rdi
    4126a2: e8 19 a6 ff ff callq 40ccc0 <dlsym@plt>
    4126a7: 48 85 c0 test %rax,%rax


    This illustrates the PLT jump table usage but Mitch's question was
    I think with regard to variables exported by a shared library,
    say errno exported by the CRTLIB C runtime library,
    verses extern variables in the main link module.

    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.

    The compiler doesn't care. Absent threads, it generates a simple
    reference to the errno symbol and lets the linker handle resolving
    it.


    In the thread case:

    /usr/include/bits/errno.h:

    extern int *__errno_location (void) __THROW __attribute__ ((__const__));

    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Fri Jun 27 15:09:50 2025
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:

    This illustrates the PLT jump table usage but Mitch's question was
    I think with regard to variables exported by a shared library,
    say errno exported by the CRTLIB C runtime library,
    verses extern variables in the main link module.

    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value,
    verses a direct PC-rel memref for extern a, b, c.

    The compiler doesn't care. Absent threads, it generates a simple
    reference to the errno symbol and lets the linker handle resolving
    it.


    In the thread case:

    /usr/include/bits/errno.h:

    extern int *__errno_location (void) __THROW __attribute__ ((__const__));

    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */

    That replaces a memory reference with a function call.
    Compiled on godbolt with GCC x86-64 trunk -O3

    #include "errno.h"

    long GetErrno (void)
    { return errno;
    }

    "GetErrno()":
    sub rsp, 8
    call "__errno_location"
    movsx rax, DWORD PTR [rax]
    add rsp, 8
    ret

    What I am asking about is below.
    Here are two variables, one is in a DLL export and therefore an
    inter-module reference that requires an extra MOV to load the address
    from the import table (what Linux calls the GOT), and the other is a
    a regular intra-module variable directly accessed with one PC-rel MOV.

    How does GCC import a variable exported from a shared module?

    Compiled on godbolt MSVC v19.2 -O2

    __declspec(dllimport) long dllVar;
    extern long exeVar;

    long GetDllVar (void)
    { return dllVar;
    }

    long GetDllVar(void) PROC ; GetDllVar, COMDAT
    mov rax, QWORD PTR __imp_long dllVar
    mov eax, DWORD PTR [rax]
    ret 0

    long GetExeVar (void)
    { return exeVar;
    }

    long GetExeVar(void) PROC ; GetExeVar, COMDAT
    mov eax, DWORD PTR long exeVar ; exeVar
    ret 0

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Fri Jun 27 21:01:33 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Scott Lurndal wrote:

    This illustrates the PLT jump table usage but Mitch's question was
    I think with regard to variables exported by a shared library,
    say errno exported by the CRTLIB C runtime library,
    verses extern variables in the main link module.

    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value, >>> verses a direct PC-rel memref for extern a, b, c.

    The compiler doesn't care. Absent threads, it generates a simple
    reference to the errno symbol and lets the linker handle resolving
    it.


    In the thread case:

    /usr/include/bits/errno.h:

    extern int *__errno_location (void) __THROW __attribute__ ((__const__));

    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */

    That replaces a memory reference with a function call.

    Yes, that's pretty clear from the above fragment of errno.h.

    Clearly the linker has the freedom to recognize "__errno_location()"
    and alter things as necessary. In this case, is implemented as
    a real function call. I can envision an implementation that
    replaced the function call with a reference to a thread-specific variable when compiled and linked with the proper options (e.g. when linked with
    -lpthread).

    For global data, take the math.h 'signgam' for instance:


    Compiled on godbolt with GCC x86-64 trunk -O3

    #include "errno.h"

    long GetErrno (void)
    { return errno;
    }

    "GetErrno()":
    sub rsp, 8
    call "__errno_location"
    movsx rax, DWORD PTR [rax]
    add rsp, 8
    ret

    What I am asking about is below.
    Here are two variables, one is in a DLL export and therefore an
    inter-module reference that requires an extra MOV to load the address
    from the import table (what Linux calls the GOT), and the other is a
    a regular intra-module variable directly accessed with one PC-rel MOV.

    How does GCC import a variable exported from a shared module?

    In this example, signgam is a variable (int) exported from libm.so.

    $ cat /tmp/a.c
    #include <math.h>

    int
    main(int argc, const char **argv, const char **envp)
    {
    double d = 0.135;

    signgam = 3u;

    (void) trunc(d);

    return signgam;
    }

    $ cc -S -D_USE_XOPEN /tmp/a.c

    main:
    .LFB0:
    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq %rsp, %rbp
    .cfi_def_cfa_register 6
    movl %edi, -20(%rbp)
    movq %rsi, -32(%rbp)
    movq %rdx, -40(%rbp)
    movsd .LC0(%rip), %xmm0
    movsd %xmm0, -8(%rbp)
    movl $3, signgam(%rip)
    movl signgam(%rip), %eax
    popq %rbp
    .cfi_def_cfa 7, 8

    The assembler pass doesn't change the output presented by
    the compiler, it just adds 'signgam' to the undefined symbol
    table in the resulting object file and generates the rip-relative
    reference pointing to the symbol table entry for fixup by the
    linker.

    After linking with -lm:

    0000000000401106 <main>:
    401106: 55 push %rbp
    401107: 48 89 e5 mov %rsp,%rbp
    40110a: 89 7d ec mov %edi,-0x14(%rbp)
    40110d: 48 89 75 e0 mov %rsi,-0x20(%rbp)
    401111: 48 89 55 d8 mov %rdx,-0x28(%rbp)
    401115: f2 0f 10 05 bb 10 00 movsd 0x10bb(%rip),%xmm0 # 4021d8 <__dso_handle+0x8>
    40111c: 00
    40111d: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
    401122: c7 05 d8 2e 00 00 03 movl $0x3,0x2ed8(%rip) # 404004 <__signgam@GLIBC_2.23>
    401129: 00 00 00
    40112c: 8b 05 d2 2e 00 00 mov 0x2ed2(%rip),%eax # 404004 <__signgam@GLIBC_2.23>
    401132: 5d pop %rbp
    401133: c3 ret

    $ ldd /tmp/a
    linux-vdso.so.1 (0x00007f19a941d000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f19a9314000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f19a9120000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f19a941f000)

    $ nm -D /lib64/libm.so.6 | grep signgam
    00000000000e5008 B __signgam@@GLIBC_2.23
    00000000000e5008 V signgam@@GLIBC_2.2.5

    $ objdump -x /tmp/a
    Sections:
    Idx Name Size VMA LMA File off Algn
    ...

    19 .got 00000010 0000000000403fd8 0000000000403fd8 00002fd8 2**3
    CONTENTS, ALLOC, LOAD, DATA
    20 .got.plt 00000018 0000000000403fe8 0000000000403fe8 00002fe8 2**3
    CONTENTS, ALLOC, LOAD, DATA
    21 .data 00000004 0000000000404000 0000000000404000 00003000 2**0
    CONTENTS, ALLOC, LOAD, DATA
    22 .bss 0000000c 0000000000404004 0000000000404004 00003004 2**2
    ALLOC


    0x2ed8 + %rip lands directly in the .bss region (0x40404).

    (gdb) x/d 0x404004
    0x404004 <signgam@GLIBC_2.2.5>: 3

    The linker allocated space in the bss for the signnam variable
    "exported" by libm.so. All the compiler did was tell the linker
    which symbol to reference - the compiler doesn't know that signgam
    is in another object file, an archive library or a shared object.

    A similar fixup would be made to the .data section if the library
    had pre-initialized 'signgam' rather than just leaving it
    uninitialized.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Fri Jun 27 21:44:32 2025
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    extern int *__errno_location (void) __THROW __attribute__ ((__const__));

    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */

    That replaces a memory reference with a function call.

    Not really. On all of the Unix-like systems I know, errno is a macro
    wrapped around a function call that fetches the most recent error in
    the current thread, done that way to avoid breaking old programs
    written back before threads when errno was an extern int. It's a
    peculiar special case and I don't offhand know of anything else like
    that.

    Here's the ABI manual for amd64 systems:

    http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to EricP on Fri Jun 27 22:41:54 2025
    On 6/27/2025 5:33 AM, EricP wrote:

    snip

    This is the one example where I considered adding an indirect
    addressing mode enabled by 1 bit on all LD and ST instructions
    as it eliminates the need to emit different code sequences for
    intra-module and inter-module accesses.

    The compiler always emits a PC-rel address. Later the linker discovers
    the it is a reference to a DLL export variable and sets the offset to be
    to the address in the GOT table and sets the Indirect bit on the LD/ST.

    An Indirect address mode has side effects for the Load Store Queue but
    they are mostly the same as if there were two separate instructions.

    But you are now allowing two cache, tlb or even page misses within one instruction, which complicates things considerably. Yes, it is
    "somewhat" the same as two separate instructions, except you have to
    keep track of which of the two you got and get the pipeline back to the
    correct place, etc. There is a reason why one of the main tenants of
    RISC is only one memory address per instruction.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Sat Jun 28 07:45:23 2025
    On 2025-06-28, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 6/27/2025 5:33 AM, EricP wrote:

    snip

    This is the one example where I considered adding an indirect
    addressing mode enabled by 1 bit on all LD and ST instructions
    as it eliminates the need to emit different code sequences for
    intra-module and inter-module accesses.

    The compiler always emits a PC-rel address. Later the linker discovers
    the it is a reference to a DLL export variable and sets the offset to be
    to the address in the GOT table and sets the Indirect bit on the LD/ST.

    An Indirect address mode has side effects for the Load Store Queue but
    they are mostly the same as if there were two separate instructions.

    But you are now allowing two cache, tlb or even page misses within one instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Scott Lurndal on Sat Jun 28 09:07:10 2025
    On 2025-06-27, Scott Lurndal <scott@slp53.sl.home> wrote:

    Clearly the linker has the freedom to recognize "__errno_location()"
    and alter things as necessary. In this case, is implemented as
    a real function call. I can envision an implementation that
    replaced the function call with a reference to a thread-specific variable when
    compiled and linked with the proper options (e.g. when linked with -lpthread).

    That is not what linkers are supposed to do (unless you use
    link-time optimization, which is a bit of the misnomer).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Sat Jun 28 07:02:47 2025
    On Fri, 27 Jun 2025 07:48:57 -0400, EricP
    <ThatWouldBeTelling@thevillage.com> wrote:


    How does the compiler know that it needs to go through the GOT table,
    to access errno, which requires two mem refs to load the GOT then value, >verses a direct PC-rel memref for extern a, b, c.


    For quite a long time now, errno has been thread-local. errno is a
    macro that accesses the current thread's private copy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to ThatWouldBeTelling@thevillage.com on Sat Jun 28 08:05:37 2025
    On Fri, 27 Jun 2025 08:33:40 -0400, EricP
    <ThatWouldBeTelling@thevillage.com> wrote:


    For example, in Microsoft one can mark the DLL export variable with
    __declspec(dllexport) int errno;

    But how many DLLs actually export data globally? That too easily can
    make the DLL non-reentrant, whch greatly limits its usefulness.

    [I know non-reentrant DLLs were a thing ... back with Windows 3.x. I
    wrote some applications back in the day that had to work around it by
    loading multiple copies of a particular device's API DLL so that the
    programs could control multiple instances of the device.
    I haven't encountered anything like that for decades.]

    Windows DLLs can have their own private heap(s) too, but almost all
    choose to use the program's heap instead. Keeping track of private
    heap allocations on behalf of multiple programs so you can clean up
    if/when they terminate is just a lot of extra programmer effort.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Jun 28 12:00:23 2025
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one
    instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    Power10 has prefixed instructions, which are 64 bits, which serve
    to access 34-bit constants. To quote version 3.1 of the ISA:

    "Prefixed instructions do not cross 64-byte instruction address
    boundaries. When a prefixed instruction crosses a 64-byte boundary,
    the system alignment error handler is invoked."

    In practice, that means that functions have to be aligned to a
    64-byte boundary (presumably a cache line) and that the occasional
    nop may be required; prefixed instructions aren't all that common.
    It is fairly trivial to add that requirement to an assembler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Thomas Koenig on Sat Jun 28 11:11:08 2025
    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one
    instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line
    accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    Supporting unaligned data accesses the requirement of up to two
    cache-line accesses (and, consequently, up to two TLB or cache misses)
    for data accesses (including stores). Power has this support, as has
    every other modern general-purpose architecture.

    Of course, if you added one level of indirection, that would double
    the number of potential memory accesses on the data side.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Jun 28 16:01:29 2025
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one
    instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    Supporting unaligned data accesses the requirement of up to two
    cache-line accesses (and, consequently, up to two TLB or cache misses)
    for data accesses (including stores). Power has this support, as has
    every other modern general-purpose architecture.

    Of course, if you added one level of indirection, that would double
    the number of potential memory accesses on the data side.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sat Jun 28 16:30:58 2025
    According to Thomas Koenig <tkoenig@netcologne.de>:
    On 2025-06-27, Scott Lurndal <scott@slp53.sl.home> wrote:

    Clearly the linker has the freedom to recognize "__errno_location()"
    and alter things as necessary. In this case, is implemented as
    a real function call. I can envision an implementation that
    replaced the function call with a reference to a thread-specific variable when
    compiled and linked with the proper options (e.g. when linked with
    -lpthread).

    That is not what linkers are supposed to do (unless you use
    link-time optimization, which is a bit of the misnomer).

    I don't know any linker that does that. As I said yesterday, errno
    is an odd special case that uses a C macro to wrap a function call.
    It's not the way anyone does normal inter-module references.

    Yesterday's message had a link to the amd64 ABI manual for anyone who
    wonders how this actually works.


    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Sun Jun 29 14:54:28 2025
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one >>>> instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line
    accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    That is difficult to implement in assemblers and linkers.
    Unless somebody wants to align each function, or at least each
    translation unit, on a page boundary, the linker then would have
    to insert NOPs for those rare cases where, after linking, a page
    boundary is crossed.

    And once you have put in the nop, you need to recheck all branches
    if they are still in range, and you have to do a full relocation
    on your code, including debug info and everything else.

    And bugs in there will occur only rarely, so they will be difficult
    to find and debug.

    This is indeed possible (almost anything except skiing through a
    revolving door), but IMHO this is something to avoid.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sun Jun 29 18:01:41 2025
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line
    accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    That is difficult to implement in assemblers and linkers.
    Unless somebody wants to align each function, or at least each
    translation unit, on a page boundary, the linker then would have
    to insert NOPs for those rare cases where, after linking, a page
    boundary is crossed.

    And once you have put in the nop, you need to recheck all branches
    if they are still in range, and you have to do a full relocation
    on your code, including debug info and everything else.

    And bugs in there will occur only rarely, so they will be difficult
    to find and debug.

    This is indeed possible (almost anything except skiing through a
    revolving door), but IMHO this is something to avoid.

    Skiing through a revolving door is in fact possible, as long as you are
    using xc gear, since there the binding allow you to fold the skis up
    along your body. You just need the balance to be able to use the tail
    ends of your skis as stilts.

    Back in uni days (as members of the uni scouts group) we tried all sorts
    of funny stuff, including running a very hard obstacle course with xc
    skis on.

    I do agree that for most people/skiing gear/revolving doors, the
    combination is effectively impossible.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Thomas Koenig on Sun Jun 29 13:21:39 2025
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one >>>>> instruction, which complicates things considerably.
    So does RISC-V with compressed instructions. POWER doesn't.
    Compressed instructions impose the requirement of up to two cache-line
    accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.
    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    That is difficult to implement in assemblers and linkers.
    Unless somebody wants to align each function, or at least each
    translation unit, on a page boundary, the linker then would have
    to insert NOPs for those rare cases where, after linking, a page
    boundary is crossed.

    And once you have put in the nop, you need to recheck all branches
    if they are still in range, and you have to do a full relocation
    on your code, including debug info and everything else.

    And bugs in there will occur only rarely, so they will be difficult
    to find and debug.

    Compilers, assemblers and linkers have long supported alignment directives. Memory sections on 64k boundaries, some data must be on page,
    routine entries on 16 bytes, loop starts on 4 bytes.
    An implied non-page-straddle alignment for instructions looks like a
    variation on the loop start alignment.

    It is more interesting when there are multiple offset sizes for branch,
    call or mem refs. For example, my ISA has 16, 32 and 64 bit offsets.

    Some time ago we were discussing compacting linkers and Ivan described
    the algorithm he uses. It was pretty straight forward so I built a
    little trial program in a few hours using x86 8 and 32 bit offsets.

    It has only 4 item kinds to consider: alignment directives,
    fixed size byte block declarations, zero sized symbol defs,
    variable sized symbol refs of 2 or 5 bytes, all items in a linked list.
    Each item has status bits (resolved flag, etc), and two address fields:
    the lowest possible address and highest possible address.
    If the largest possible offset between items always fits into the smallest bucket, or the smallest possible offset is always greater than the largest bucket then those items resolve and are removed from the pending list.
    Then recalculate the lowest and highest addresses and check again.

    For testing I disassembled the trial program, hand coded the code and data blocks into a test table, and ran it through itself. Without any attempt at optimization it did a perfect compaction after 3 sweeps.

    If anyone is interested, the rate at which it compacts is determined by the bucket sizes - smaller buckets like 1 byte offset take longer to determine
    what can fit into them than larger ones because whether an individual offset
    is 1 or 4 bytes has more chance of affecting other compactions.
    In the worst case it could take N sweeps to compact N variable sized items. With offset buckets of 2, 4 or 8 bytes almost all items would pack into
    the 2 byte bucket on the first sweep leaving few items for the second sweep.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to John Levine on Sun Jun 29 14:02:32 2025
    John Levine wrote:
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    extern int *__errno_location (void) __THROW __attribute__ ((__const__)); >>>
    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */
    That replaces a memory reference with a function call.

    Not really. On all of the Unix-like systems I know, errno is a macro
    wrapped around a function call that fetches the most recent error in
    the current thread, done that way to avoid breaking old programs
    written back before threads when errno was an extern int. It's a
    peculiar special case and I don't offhand know of anything else like
    that.

    I was just looking for a shared module export variable but errno was
    a poor choice because everyone has replaced it with a function call.

    Scott suggested signgam in math.h but that doesn't exist in Microsoft's
    math.h because MS is stuck on C-89 so not useful for comparing MS and GCC.

    Here's the ABI manual for amd64 systems:

    http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

    Thanks but I've got a copy v1.0 dated 6-Dec-2022.
    That one is a draft v0.95 from 2005.

    The issue I have with it is that while it does provide an overview of the address models, it does not describe how the whole mechanism works,
    how the compiler interacts with linker and loader.
    Also it makes multiple references to x64 instructions that don't exist in
    Intel or AMD docs, LEAQ and MOVABS. It does not define them, does not show
    any instruction bytes, and does not reference any other docs that do so.

    I tried looking for a manual on GNU Assembler GAS and the only one is
    a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
    I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.
    There is a web document at GNU but it has no search function
    and does not explain leaq or movabs.

    After a few hours of flopping about searching the web I think I have
    figured out how it works, specifically how the movabs works with the loader, and why the MS code for Windows is different from the GCC code for Linux,
    and how the MS compiler uses dllimport attribute and GCC does not.

    First about LEAQ and MOVABS.
    It seems LEAQ is the LEA Load Effective Address instruction with a data type attached to it so it knows the operand size and thus the address mode.
    This replaces the Intel B/W/D/QWORD PTR nomenclature.

    MOVABS is more complicated and actually has two versions,
    MOVABS and MOVABSx (where x is a data type b, w, d, or q).

    MOVABS (no type) is really Intel "MOV r64, imm64" which loads a 64-bit immediate into a register. MOVABS has nothing to do with absolute addresses except if the imm64 happens to be a relocatable symbol value then it
    can be patched by the loader, as with all such immediate symbols.

    MOVABSx (with type) is really Intel "MOV moffs, rAn" or "MOV rAn, moffs"
    where moffs is an 8, 16, 32 or 64-bit offset into a segment register,
    and for the default segment registers with a base of 0 that means the
    offset is really either a zero extended 32-bit or 64-bit absolute address,
    and rAn is registers AL, AX, EAX, RAX (depends on operand size). MOVABSx is
    a relocatable absolute address that loads or stores to/from an "A" register.

    continuing...

    The answer to my original question seems to be that MS always generates
    what GCC calls Position Independent Executable enabled with the -fPIE option, and MS always uses a large memory model whereas GCC must enable it.

    But also as GCC doesn't know if exeVar is an intra- or inter- module
    reference so it always has to generate a worst case access for every program global variable. Because MS knows which global variables are are dllimports
    it generates optimal code for intra- (RIP-rel) and inter- (GOT indirect)
    module references.

    extern long exeVar;

    long GetExeVar (void)
    { return exeVar;
    }

    Compiled with GCC x86-64 15.1 -O3 -fPIE -mcmodel=large

    GetExeVar():
    .L2:
    movabs r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2
    lea rax, .L2[rip]
    movabs rdx, OFFSET FLAT:exeVar@GOT
    add rax, r11
    mov rax, QWORD PTR [rax+rdx]
    mov rax, QWORD PTR [rax]
    ret

    (The above GCC code also doesn't look optimal. I don't see why it fiddles
    about calculating addresses when it should just should just use a RIP-rel
    load to pull the absolute address of exeVar from the GOT and then load it,
    as MS does with its imports table below.)

    Compiled with MSVC latest -O3
    Intra-module reference:

    long GetExeVar(void) PROC ; GetExeVar, COMDAT
    mov eax, DWORD PTR long exeVar ; exeVar
    ret 0

    Inter-module reference:

    __declspec(dllimport) long dllVar;

    long GetDllVar (void)
    { return dllVar;
    }

    long GetDllVar(void) PROC ; GetDllVar, COMDAT
    mov rax, QWORD PTR __imp_long dllVar
    mov eax, DWORD PTR [rax]
    ret 0

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Sun Jun 29 20:41:04 2025
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.
    So does RISC-V with compressed instructions. POWER doesn't.
    Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.
    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    That is difficult to implement in assemblers and linkers.
    Unless somebody wants to align each function, or at least each
    translation unit, on a page boundary, the linker then would have
    to insert NOPs for those rare cases where, after linking, a page
    boundary is crossed.

    And once you have put in the nop, you need to recheck all branches
    if they are still in range, and you have to do a full relocation
    on your code, including debug info and everything else.

    And bugs in there will occur only rarely, so they will be difficult
    to find and debug.

    Compilers, assemblers and linkers have long supported alignment directives.

    Sure, it's possible to align on a page boundary, like I wrote above.
    It is just something that you probably _want_ to avoid for every
    translation unit.

    [..]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Sun Jun 29 20:50:56 2025
    On Sun, 29 Jun 2025 16:01:41 +0000, Terje Mathisen wrote:

    Thomas Koenig wrote:
    MitchAlsup1 <mitchalsup@aol.com> schrieb:
    On Sat, 28 Jun 2025 11:11:08 +0000, Anton Ertl wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    But you are now allowing two cache, tlb or even page misses within one >>>>>> instruction, which complicates things considerably.

    So does RISC-V with compressed instructions. POWER doesn't.

    Compressed instructions impose the requirement of up to two cache-line >>>> accesses (and, consequently, up to two TLB or cache misses) for
    instruction fetch. Instruction sets with fixed-size instructions
    indeed do not have this requirement.

    One can allow line crossing compressed (or extended) instructions
    while still disallowing page crossing of the same. You just have
    to decide what is right for your architecture.

    That is difficult to implement in assemblers and linkers.
    Unless somebody wants to align each function, or at least each
    translation unit, on a page boundary, the linker then would have
    to insert NOPs for those rare cases where, after linking, a page
    boundary is crossed.

    And once you have put in the nop, you need to recheck all branches
    if they are still in range, and you have to do a full relocation
    on your code, including debug info and everything else.

    And bugs in there will occur only rarely, so they will be difficult
    to find and debug.

    This is indeed possible (almost anything except skiing through a
    revolving door), but IMHO this is something to avoid.

    Skiing through a revolving door is in fact possible, as long as you are
    using xc gear, since there the binding allow you to fold the skis up
    along your body. You just need the balance to be able to use the tail
    ends of your skis as stilts.

    Drilling a hole in the air is also possible--as long as you don't
    mind the air re-filing the hole after you stop drilling, too.

    Back in uni days (as members of the uni scouts group) we tried all sorts
    of funny stuff, including running a very hard obstacle course with xc
    skis on.

    I do agree that for most people/skiing gear/revolving doors, the
    combination is effectively impossible.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jun 30 06:21:32 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist in >Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.

    MOVABS imm64, r64 is AT&T syntax for Intel syntax MOV r64, imm64.

    It does not define them, does not show
    any instruction bytes, and does not reference any other docs that do so.

    The gas manual may be the best reference about AT&T syntax.

    I tried looking for a manual on GNU Assembler GAS and the only one is
    a PDF from 1995, and it only covers the ATT syntax not specific ISA's.

    Searching for "GNU as manual" gave me the link to <https://sourceware.org/binutils/docs/> where you can find links to
    the gas manual in different formats. And searching for "AT&T" in the
    table of contents brings up three subsections of <https://sourceware.org/binutils/docs/as/i386_002dDependent.html>

    I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.

    The section mentioned above says:

    |The i386 version as supports both the original Intel 386 architecture
    |in both 16 and 32-bit mode as well as AMD x86-64 architecture
    |extending the Intel architecture to 64-bits.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Mon Jun 30 13:55:03 2025
    On Sun, 29 Jun 2025 14:02:32 -0400
    EricP <ThatWouldBeTelling@thevillage.com> wrote:

    John Levine wrote:
    According to EricP <ThatWouldBeTelling@thevillage.com>:
    extern int *__errno_location (void) __THROW __attribute__
    ((__const__));

    # if !defined _LIBC || defined _LIBC_REENTRANT
    /* When using threads, errno is a per-thread value. */
    # define errno (*__errno_location ())
    # endif
    # endif /* !__ASSEMBLER__ */
    #endif /* _ERRNO_H */
    That replaces a memory reference with a function call.

    Not really. On all of the Unix-like systems I know, errno is a macro wrapped around a function call that fetches the most recent error in
    the current thread, done that way to avoid breaking old programs
    written back before threads when errno was an extern int. It's a
    peculiar special case and I don't offhand know of anything else like
    that.

    I was just looking for a shared module export variable but errno was
    a poor choice because everyone has replaced it with a function call.

    Scott suggested signgam in math.h but that doesn't exist in
    Microsoft's math.h because MS is stuck on C-89 so not useful for
    comparing MS and GCC.


    This particular case is not related to Microsoft's refusal to support
    certain parts of C99 (mostly those part that made optional in C11).
    signgam was never a part of any C standard. It's a POSIX extension.

    Here's the ABI manual for amd64 systems:

    http://refspecs.linux-foundation.org/elf/x86_64-abi-0.95.pdf

    Thanks but I've got a copy v1.0 dated 6-Dec-2022.
    That one is a draft v0.95 from 2005.

    The issue I have with it is that while it does provide an overview of
    the address models, it does not describe how the whole mechanism
    works, how the compiler interacts with linker and loader.
    Also it makes multiple references to x64 instructions that don't
    exist in Intel or AMD docs, LEAQ and MOVABS. It does not define them,
    does not show any instruction bytes, and does not reference any other
    docs that do so.


    When you are not sure about meaning of AT&T mnemonics you can utilize
    objdump (or Microsoft's dumpbin) to see you object files in Intel asm.
    objdump -d -M,Intel yourfile.o
    It works with exe files as well.

    I tried looking for a manual on GNU Assembler GAS and the only one is
    a PDF from 1995, and it only covers the ATT syntax not specific ISA's.
    I was unable to find a PDF manual for x86-64, or x64, or AMD64
    anywhere. There is a web document at GNU but it has no search function
    and does not explain leaq or movabs.

    After a few hours of flopping about searching the web I think I have
    figured out how it works, specifically how the movabs works with the
    loader, and why the MS code for Windows is different from the GCC
    code for Linux, and how the MS compiler uses dllimport attribute and
    GCC does not.

    First about LEAQ and MOVABS.
    It seems LEAQ is the LEA Load Effective Address instruction with a
    data type attached to it so it knows the operand size and thus the
    address mode. This replaces the Intel B/W/D/QWORD PTR nomenclature.

    MOVABS is more complicated and actually has two versions,
    MOVABS and MOVABSx (where x is a data type b, w, d, or q).

    MOVABS (no type) is really Intel "MOV r64, imm64" which loads a 64-bit immediate into a register. MOVABS has nothing to do with absolute
    addresses except if the imm64 happens to be a relocatable symbol
    value then it can be patched by the loader, as with all such
    immediate symbols.

    MOVABSx (with type) is really Intel "MOV moffs, rAn" or "MOV rAn,
    moffs" where moffs is an 8, 16, 32 or 64-bit offset into a segment
    register, and for the default segment registers with a base of 0 that
    means the offset is really either a zero extended 32-bit or 64-bit
    absolute address, and rAn is registers AL, AX, EAX, RAX (depends on
    operand size). MOVABSx is a relocatable absolute address that loads
    or stores to/from an "A" register.

    continuing...

    The answer to my original question seems to be that MS always
    generates what GCC calls Position Independent Executable enabled with
    the -fPIE option, and MS always uses a large memory model whereas GCC
    must enable it.

    But also as GCC doesn't know if exeVar is an intra- or inter- module reference so it always has to generate a worst case access for every
    program global variable. Because MS knows which global variables are
    are dllimports it generates optimal code for intra- (RIP-rel) and
    inter- (GOT indirect) module references.

    extern long exeVar;

    long GetExeVar (void)
    { return exeVar;
    }

    Compiled with GCC x86-64 15.1 -O3 -fPIE -mcmodel=large

    GetExeVar():
    .L2:
    movabs r11, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_-.L2
    lea rax, .L2[rip]
    movabs rdx, OFFSET FLAT:exeVar@GOT
    add rax, r11
    mov rax, QWORD PTR [rax+rdx]
    mov rax, QWORD PTR [rax]
    ret

    (The above GCC code also doesn't look optimal. I don't see why it
    fiddles about calculating addresses when it should just should just
    use a RIP-rel load to pull the absolute address of exeVar from the
    GOT and then load it, as MS does with its imports table below.)

    Compiled with MSVC latest -O3
    Intra-module reference:

    long GetExeVar(void) PROC ; GetExeVar,
    COMDAT mov eax, DWORD PTR long exeVar ; exeVar
    ret 0

    Inter-module reference:

    __declspec(dllimport) long dllVar;

    long GetDllVar (void)
    { return dllVar;
    }

    long GetDllVar(void) PROC ; GetDllVar,
    COMDAT mov rax, QWORD PTR __imp_long dllVar
    mov eax, DWORD PTR [rax]
    ret 0





    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Mon Jun 30 17:13:51 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist in >>> Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,

    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide most
    of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Jun 30 12:51:13 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist in
    Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.

    Yes, and an LEA instruction which just calculates an address
    needs an operand size suffix... why?
    Because it is not documented.

    MOVABS imm64, r64 is AT&T syntax for Intel syntax MOV r64, imm64.

    It does not define them, does not show
    any instruction bytes, and does not reference any other docs that do so.

    The gas manual may be the best reference about AT&T syntax.

    I tried looking for a manual on GNU Assembler GAS and the only one is
    a PDF from 1995, and it only covers the ATT syntax not specific ISA's.

    Searching for "GNU as manual" gave me the link to <https://sourceware.org/binutils/docs/> where you can find links to
    the gas manual in different formats. And searching for "AT&T" in the
    table of contents brings up three subsections of <https://sourceware.org/binutils/docs/as/i386_002dDependent.html>

    Yes that is the unsearchable web page manual I referred to.

    I was unable to find a PDF manual for x86-64, or x64, or AMD64 anywhere.

    The section mentioned above says:

    |The i386 version as supports both the original Intel 386 architecture
    |in both 16 and 32-bit mode as well as AMD x86-64 architecture
    |extending the Intel architecture to 64-bits.

    - anton

    Yes, it does say that doesn't it.

    It says about MOVABS in section '9.16.3 i386 Syntactical Considerations', '9.16.3.1 AT&T Syntax versus Intel Syntax':

    "In 64-bit code, ‘movabs’ can be used to encode the ‘mov’ instruction
    with the 64-bit displacement or immediate operand."

    but doesn't say why or how, or mention any variants,
    or which actual x64 instruction(s) it actually maps to.

    The section '9.16.4 i386-Mnemonics' doesn't mention MOVABS or LEA at all.

    In x64 mode it is important to know exactly when you are dealing
    with a "displacement" which are 8 or 32 bits sign extended to 64 bits,
    and an "address" which are zero extended to 64 bits, if necessary.
    Or what instructions support which variants.
    (That's why some code models only work on 2GB and others are 4GB,
    or only certain MOV instructions actually support 64-bit addresses.)

    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,
    it seems to me that it was at least worth a mention.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Mon Jun 30 17:11:46 2025
    On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist
    in
    Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.

    Yes, and an LEA instruction which just calculates an address
    needs an operand size suffix... why?

    LEA need to distinguish between::
    LEA Rd,[Rb+DISP16]
    LEA Rd,[Rb+Ri<<s]
    LEA Rd,[Rb+DISP32]
    LEA Rd,[Rb+DISP54]
    LEA Rd,[Rb+Ri<<s+DISP32]
    LEA Rd,[Rb+Ri<<s+DISP64]

    Because it is not documented.

    {Gomer Pyle mode = ON}

    Surprise, surprise, surprise !

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jun 30 16:08:05 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [Alpha]
    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,

    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the
    global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up
    the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.

    Yes, it is (also) using an extra memory load to pick up large immediates.
    It also requires a BAL to get the IP into a register.

    I have finally gotten around to turning on our working Alpha and
    compiled the following program on it:

    #include <math.h>
    extern int a, b;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, x, y, and foo with
    gcc -Wall -O -FPIC and then linked the two files with the same options.

    The result on Alpha is:

    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 48 85 bd 23 lda gp,-31416(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop
    120000640: 68 80 3d 20 lda t0,-32664(gp)
    120000644: 00 00 01 8e ldt $f16,0(t0)
    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)
    120000658: 60 80 3d 20 lda t0,-32672(gp)
    12000065c: 00 00 01 9c stt $f0,0(t0)
    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)
    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)
    120000670: 02 04 41 40 addq t1,t0,t1
    120000674: 5c 80 3d 20 lda t0,-32676(gp)
    120000678: 00 00 21 a0 ldl t0,0(t0)
    12000067c: 02 04 41 40 addq t1,t0,t1
    120000680: 00 00 43 b0 stl t1,0(t2)
    120000684: 00 04 ff 47 clr v0
    120000688: 00 00 5e a7 ldq ra,0(sp)
    12000068c: 10 00 de 23 lda sp,16(sp)
    120000690: 01 80 fa 6b ret

    The load of signgam is achieved with the following sequence

    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)

    I.e., 2 instructions, not 6.

    It's interesting that a, b, x, y (which end up in the same linked unit
    as main()) result in code like

    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)

    which could be implemented more efficiently as

    ldl t1,-32680(gp)

    but apparently the linker just fixes up the first instruction of the
    pair (either as ldq or as lda, maybe also as ldah), and maybe the
    offset of the second instruction (but not in this example); the
    benefit is that the linker just has to replace some instructions, but
    it does not have to shrink or expand the code (which would require
    changing even more instructions).

    We also see that the call to foo() within the same linked unit is
    linked as

    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop

    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    One can see again that both code sequences have the same size, to
    avoid shrinking or expanding.

    The gcc manual says about these:

    | When '-msmall-data' is used,
    | the compiler can assume that all local symbols share the same '$gp'
    | value, and thus reduce the number of instructions required for a
    | function call from 4 to 1.

    We see the 4 and 1 instructions above, but it's not clear to me that
    there is a real benefit. The compiler cannot assume that an external
    reference is local, and the linker knows, but does not benefit from
    it. And for references within a compilation unit, I would hope that
    the compiler/assembler manages to use the smallest variant based on
    actual size.

    Why burden all programs with the costs of large programs the way it
    is done by default on Alpha?
    ...
    I'm not saying there shouldn't be optimizations for smaller sizes.
    I'm pointing to the fact that to actually USE the 64-bit address space
    there is a large increase in code size and execute cost,
    and asking if that had to be so.

    I can use 100 GB arrays with code that is the same size as code that
    limits itself to the lower 2GB of address space (there is an option on
    Alpha compilers and linkers for that).

    For example, for Alpha to load a 64-bit constant requires 6 instructions,
    24 bytes.

    I forgot to add this to the program, maybe tomorrow.

    That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table
    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.

    That's not necessary. The global pointer is derived from the function
    address (in t12) on entry to the function and from the return address
    (in ra) after a jsr.

    The LDQ touches the same address space as the code but now as data
    so it has to load the D-TLB with an entry redundant with I-TBL,
    and bring in a data cache line with the constants.

    No, the global table is elsewhere; in the case above we it's about
    96KB behind the start of main(). The text ends a few hundred bytes
    later, so there is no page that contains both code and data (i.e., no
    TLB entries that describe the same page, not that this would be a
    problem).

    And after the constant is loaded it must be manually added to the base >because there is no LD/ST combined with a scaled index.

    Which base? You were only mentioning constants up to now.

    Furthermore, the actual load or store of the target value is serially >dependent on LDQ offset and the ADD. Back when the load-to-use latency
    for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >it is serious penalty.

    This century, you use an OoO CPU (even the last Alpha was OoO), and
    the 4-5 clocks latency of a load are added to the ready time of the
    base address, i.e., gp in this case. gp only changes on far calls, so
    loading from a gp-relative address is rarely in the critical
    dependence path.

    By making it a priority for relatively cheap access to the full 64-bit >address space during an ISA design, what alternatives might have minimized >its extra cost?

    It would certainly be an interesting experiment to see how much size
    and speed difference we would get if we eliminated the "mov r64,
    imm64" instruction when compiling to AMD64 and used Alpha-like
    techniques instead. My guess is: barely measurable.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 13:11:44 2025
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't
    exist in Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,

    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to
    increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide most
    of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    It seems, you got it backward. The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux. It's not that AT&T syntax is
    not used at all on Windows, but it's less dominant here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 16:21:44 2025
    On Tue, 01 Jul 2025 13:18:33 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that
    don't exist in Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,

    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to
    increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide
    most of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    It seems, you got it backward.

    Got what backwards?

    The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux.

    I've been using x86-64 AT&T syntax since 1989
    (e.g. SVR4). I've never considered it poorly documented.

    It's not that AT&T syntax is
    not used at all on Windows, but it's less dominant here.


    Your claim was that windows runs "almost all the servers on
    the planet", which is clearly incorrect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Tue Jul 1 13:18:33 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't
    exist in Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,

    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to
    increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide most
    of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    It seems, you got it backward.

    Got what backwards?

    The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux.

    I've been using x86-64 AT&T syntax since 1989
    (e.g. SVR4). I've never considered it poorly documented.

    It's not that AT&T syntax is
    not used at all on Windows, but it's less dominant here.


    Your claim was that windows runs "almost all the servers on
    the planet", which is clearly incorrect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue Jul 1 16:28:59 2025
    On Tue, 01 Jul 2025 13:18:33 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that
    don't exist in Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,

    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to
    increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide
    most of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    It seems, you got it backward.

    Got what backwards?

    The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux.

    I've been using x86-64 AT&T syntax since 1989
    (e.g. SVR4).

    No, you didn't, because x86-64 didn't exist until ~2001.
    My impression from reading Eric's post is that AT&T syntax for x386 is documented relatively better.

    I've never considered it poorly documented.

    It's not that AT&T syntax is
    not used at all on Windows, but it's less dominant here.


    Your claim was that windows runs "almost all the servers on
    the planet", which is clearly incorrect.

    First, the claim was not mine.
    Second, the claim was not about Windows, but about x86-64.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Tue Jul 1 14:08:27 2025
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 01 Jul 2025 13:18:33 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that
    don't exist in Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.


    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,



    The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux.

    I've been using x86-64 AT&T syntax since 1989
    (e.g. SVR4).

    No, you didn't, because x86-64 didn't exist until ~2001.

    Ah, I meant AT&T syntax, not necessarily related to the
    AMD64 extensions, which I used in 2004
    pretty extensively. Your point, however, is taken.

    I did have a document at the time from AMD that fully documented
    the extensions - I'll have to see if I can dig it up.

    $ grep -i abs boot/*
    boot/setup64.S: movabsq $PHYSMAP_BASE, %r12 # Base DVMM virtual address boot/setup64.S: movabsq $handlerlist, %r11 # List of interrupt handlers boot/setup64.S: movabsq $debugger, %rdi # Set this boot/setup64.S: movabsq $_ZN10c_debugger10early_initEv, %rcx # Call ::early_init
    boot/setup64.S: movabsq $__call_constructors, %rcx
    boot/setup64.S: movabsq $dvmm_bsp_start, %rcx # We use an indirect jump to invoke main
    boot/setup64.S: movabsq $dvmm_ap_start, %rcx # AP, use 'dvmm_ap_start'


    Your claim was that windows runs "almost all the servers on
    the planet", which is clearly incorrect.

    First, the claim was not mine.
    Second, the claim was not about Windows, but about x86-64.

    Yes, I see that now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Tue Jul 1 11:09:07 2025
    Michael S wrote:
    On Mon, 30 Jun 2025 17:13:51 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't
    exist in Intel or AMD docs, LEAQ and MOVABS.
    LEAQ is AT&T syntax for LEA with a "quadword" operand.
    And since this is the basis for the ABI design for a processor
    that runs almost all the desktops and servers on the planet,
    Desktops, yes. That's changing slowly, although now with
    the orange clown in charge - foreign government entities are
    moving away from software controlled by American companies,
    and I expect the desktop migration rate from windows to linux to
    increase considerably over the next decade.

    Servers, not by a large margin. Linux (and a handful of
    proprietary unix and linux servers, including Z-series) provide most
    of the servers (excepting on-prem exchange, sharepoint
    and AD systems). 60% of compute in Azure, for example,
    are linux cores. The ratio is even larger in google (90% linux)
    oracle and amazon(90% linux) cloud operations.

    In terms of pure server numbers, windows is likely less than
    20% globally.

    It seems, you got it backward. The bigger problem with poorly
    documented x86-64 AT&T syntax is on Linux. It's not that AT&T syntax is
    not used at all on Windows, but it's less dominant here.

    I meant x86-64 processor as a planetary class of machines and making the observation that if one is targeting that potential market then it seems
    in ones own self interest to have documentation of a quality level that,
    for example, saves everyone from having to disassemble the code to
    figure out what it actually does.

    Not that Windows doesn't need lots of disassembly too,
    but back then they already had most of that market.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to BGB on Tue Jul 1 15:23:30 2025
    BGB <cr88192@gmail.com> writes:
    On 6/18/2025 1:26 AM, Anton Ertl wrote:
    You can, however compare ARM T32 and A32 in the Debian results:

    bash grep gzip
    595204 107636 46744 armhf ARM T32
    599832 101102 46898 riscv64
    796501 144926 57729 amd64
    829776 134784 56868 arm64
    853892 152068 61124 i386
    891128 158544 68500 armel ARM A32
    892688 168816 64664 s390x
    1020720 170736 71088 mips64el
    1168104 194900 83332 ppc64el

    There may be additional differences between the two ARM 32-bit
    builds, however.

    What I could do relatively easily is to compile a file from gforth
    with different options. The file I used is what is compiled to
    engine/main-fast-ll.o

    text siz compiler options
    20242 -O2
    18146 -Os
    18146 -Os -march=rv64gc
    18444 -Os -march=rv64gc -mcmodel=medany
    23092 -Os -march=rv64g
    ...
    Not super impressed with the 'C' extension, as it is both a pain to
    decode and also the code size savings tend to be fairly modest.

    That may be the case. If you look at the bottom line, i.e., which
    platform has the smallest text size, in the table above RV64GC
    (riscv64) is #1 for two programs and #2 for one program. Maybe RV64G
    is only modestly worse and may be dense enough for your needs. OTOH,
    if you want to use existing binaries, you have found that many
    software distributions compile for RV64GC, and recompile the
    distribution yourself may be more pain than implementing the C
    extension.

    IA-64 code density is bad, but one wouldn't expect it to be quite *that*
    bad.

    Maybe around 3-5x bigger than a RISC with 32-bit instructions.

    You find IA-64 results in earlier code density measurements by me.
    E.g., from <2017Aug9.140559@mips.complang.tuwien.ac.at>:

    bash grep gzip
    398384 88084 47944 armhf
    584340 130872 68276 armel
    588972 129096 66892 amd64
    604656 131804 66268 i386
    637620 133868 72712 s390
    638912 140544 71744 sparc
    674912 141120 74032 mipsel
    674912 141168 74112 mips
    680928 139664 74272 powerpc
    688052 150680 75908 s390x
    1539872 322432 158656 ia64

    armel is probably ARM A32 (32-bit instructions), armhf is probably ARM
    T32 (16-bit and 32-bit instructions). ia64 is ~2.5x bigger than armel.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Anton Ertl on Tue Jul 1 16:08:09 2025
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [Alpha]
    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,

    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the
    global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up
    the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.

    Yes, it is (also) using an extra memory load to pick up large immediates. >>It also requires a BAL to get the IP into a register.

    I have finally gotten around to turning on our working Alpha and
    compiled the following program on it:

    Now with a large constant:

    #include <math.h>
    extern int a, b;
    extern long c;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    c = 0x5110de94f393f9ceL;
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, c, x, y, and foo()
    with gcc -Wall -O -FPIC and then linked the two files with the same
    options.

    The result on Alpha is:

    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 58 85 bd 23 lda gp,-31400(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: fe ff 3d 24 ldah t0,-2(gp)
    120000634: f8 7c 41 a4 ldq t1,31992(t0)
    120000638: 78 80 3d 20 lda t0,-32648(gp)
    12000063c: 00 00 41 b4 stq t1,0(t0)
    120000640: 00 00 fe 2f unop
    120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
    120000648: 00 00 fe 2f unop
    12000064c: 00 00 fe 2f unop
    120000650: 68 80 3d 20 lda t0,-32664(gp)
    120000654: 00 00 01 8e ldt $f16,0(t0)
    120000658: 08 80 7d a7 ldq t12,-32760(gp)
    12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
    120000660: 02 00 ba 27 ldah gp,2(ra)
    120000664: 18 85 bd 23 lda gp,-31464(gp)
    120000668: 60 80 3d 20 lda t0,-32672(gp)
    12000066c: 00 00 01 9c stt $f0,0(t0)
    120000670: 58 80 7d 20 lda t2,-32680(gp)
    120000674: 00 00 43 a0 ldl t1,0(t2)
    120000678: 10 80 3d a4 ldq t0,-32752(gp)
    12000067c: 00 00 21 a0 ldl t0,0(t0)
    120000680: 02 04 41 40 addq t1,t0,t1
    120000684: 5c 80 3d 20 lda t0,-32676(gp)
    120000688: 00 00 21 a0 ldl t0,0(t0)
    12000068c: 02 04 41 40 addq t1,t0,t1
    120000690: 00 00 43 b0 stl t1,0(t2)
    120000694: 00 04 ff 47 clr v0
    120000698: 00 00 5e a7 ldq ra,0(sp)
    12000069c: 10 00 de 23 lda sp,16(sp)
    1200006a0: 01 80 fa 6b ret

    The load of signgam is achieved with the following sequence

    120000678: 10 80 3d a4 ldq t0,-32752(gp)
    12000067c: 00 00 21 a0 ldl t0,0(t0)

    I.e., 2 instructions, not 6.

    It's interesting that a, b, x, y (which end up in the same linked unit
    as main()) result in code like

    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)

    which could be implemented more efficiently as

    ldl t1,-32680(gp)

    but apparently the linker just fixes up the first instruction of the
    pair (either as ldq or as lda, maybe also as ldah), and maybe the
    offset of the second instruction (but not in this example); the
    benefit is that the linker just has to replace some instructions, but
    it does not have to shrink or expand the code (which would require
    changing even more instructions).

    We also see that the call to foo() within the same linked unit is
    linked as

    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop

    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    One can see again that both code sequences have the same size, to
    avoid shrinking or expanding.
    ...
    For example, for Alpha to load a 64-bit constant requires 6 instructions, >>24 bytes.

    The load of the large constant looks as follows:

    120000630: fe ff 3d 24 ldah t0,-2(gp)
    120000634: f8 7c 41 a4 ldq t1,31992(t0)

    Two instructions (8 bytes) plus 8 bytes of data. Interestingly, while
    the global pointer points to $120018B78 (99672 bytes after the start
    of main()) if I compute it correctly, the constant is placed at
    $120000870 (99090 bytes before the gp, and 592 bytes after the start
    of main(), and it's on the same 8KB page as main(); so in this case,
    there is indeed a DTLB entry that points to the same page as an ITLB
    entry.

    However, the code, rodata, and data could be on separate pages, and
    AFAICS, these pages could be close enough to each other to make the
    ldah instructions unnecessary. And if we look at the RISC-V code,
    that's what happens there.

    That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table
    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.

    The constant is actually located behind the code (not just the code
    for main(), but all the code), and there is no BAL (actually, Alpha
    has no instruction named BAL; do you mean BSR?).

    By making it a priority for relatively cheap access to the full 64-bit >>address space during an ISA design, what alternatives might have minimized >>its extra cost?

    It's interesting to look at how the same C code comes out on other
    instruction sets. You can find the source code and disassembly output
    for main() for Alpha, AMD64, ARM A64, and RV64GC (both default and
    with -mcmodel=medany) on <http://www.complang.tuwien.ac.at/anton/memory-references/>. I'll
    present RISC-V in full and highlights from the others.

    Here's the output for RISC-V with -mcmodel=medany:

    0000000000010570 <main>:
    10570: 1141 addi sp,sp,-16
    10572: e406 sd ra,8(sp)
    10574: 00002797 auipc a5,0x2
    10578: ab47b783 ld a5,-1356(a5) # 12028 <__SDATA_BEGIN__>
    1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
    10580: 02c000ef jal ra,105ac <foo>
    10584: 8401b507 fld fa0,-1984(gp) # 12040 <y>
    10588: f29ff0ef jal ra,104b0 <lgamma@plt>
    1058c: 84a1b427 fsd fa0,-1976(gp) # 12048 <x>
    10590: 85418713 addi a4,gp,-1964 # 12054 <a>
    10594: 431c lw a5,0(a4)
    10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
    1059a: 9fb5 addw a5,a5,a3
    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
    105a0: 9fb5 addw a5,a5,a3
    105a2: c31c sw a5,0(a4)
    105a4: 4501 li a0,0
    105a6: 60a2 ld ra,8(sp)
    105a8: 0141 addi sp,sp,16
    105aa: 8082 ret

    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    The medany variant accesses a by first computing its address in a4,
    Alpha style:

    10590: 85418713 addi a4,gp,-1964 # 12054 <a>
    10594: 431c lw a5,0(a4)
    ...
    105a2: c31c sw a5,0(a4)

    By contrast, the default variant accesses a directly through the gp:

    1058c: 8541a783 lw a5,-1964(gp) # 12054 <a>
    ...
    1059c: 84f1aa23 sw a5,-1964(gp) # 12054 <a>

    It's not clear why the memory model should make a difference here, but
    it does.

    The two calls are both using pc-relative jal instead or
    register-indirect jalr. This works for lgamma() by generating code
    for a trampoline in the binary that contains main():

    00000000000104b0 <lgamma@plt>:
    104b0: 00002e17 auipc t3,0x2
    104b4: b60e3e03 ld t3,-1184(t3) # 12010 <lgamma@GLIBC_2.27>
    104b8: 000e0367 jalr t1,t3

    RISC-V is quite similar to Alpha, yet produces much more compact code.
    My guess is that the linker actually does growing or shrinking here
    (and I have read complaints about the slowness of RISC-V linking), and
    this pays off in the instruction count and code size.

    ARM A64 synthesizes the large constant instead of loading it from
    memory:

    894: d29f39c1 mov x1, #0xf9ce // #63950
    898: f2be7261 movk x1, #0xf393, lsl #16
    89c: f2dbd281 movk x1, #0xde94, lsl #32
    8a0: f2ea2201 movk x1, #0x5110, lsl #48

    The accesses to the global variables are quite long-winded, e.g., here
    we have an access to b or signgam:

    8d4: 90000082 adrp x2, 10000 <__FRAME_END__+0xf510>
    8d8: f947ec42 ldr x2, [x2, #4056]
    8dc: b9400042 ldr w2, [x2]

    The calls work as on RISC-V.

    On AMD64 the constant is loaded as follows:

    113d: 48 b8 ce f9 93 f3 94 movabs $0x5110de94f393f9ce,%rax
    1144: de 10 51

    The calls seem to be handled as on RISC-V. The global variables are
    accessed using rip-relative addressing:

    1168: 8b 05 c2 2e 00 00 mov 0x2ec2(%rip),%eax # 4030 <__signgam@GLIBC_2.23>

    Again signgam is located close to a, and c.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 1 20:03:04 2025
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jul 1 21:07:13 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared library
    you access at run-time is the exact same one you linked your binary against.

    That's not necessarily the case - so long as the function signatures/API don't change, new versions of the shared library will be backward compatable with applications linked against earlier versions. So the data section requirements for the shared library could change after the application is linked if new
    data section symbols are defined in the newer version of the shared library.

    The run-time loader will know how much data space has been allocated
    to the executable itself, and will append (and relocate corresponding references in the shared object and executable) the '.data' sections from
    each dynamic library loaded by the application at the time the
    library is loaded - which may be at startup or via dlopen().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue Jul 1 21:49:47 2025
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    ---------------
    Now with a large constant:

    #include <math.h>
    extern int a, b;
    extern long c;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    c = 0x5110de94f393f9ceL;
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, c, x, y, and foo()
    with gcc -Wall -O -FPIC and then linked the two files with the same
    options.

    The result on Alpha is:

    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 58 85 bd 23 lda gp,-31400(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: fe ff 3d 24 ldah t0,-2(gp)
    120000634: f8 7c 41 a4 ldq t1,31992(t0)
    120000638: 78 80 3d 20 lda t0,-32648(gp)
    12000063c: 00 00 41 b4 stq t1,0(t0)
    120000640: 00 00 fe 2f unop
    120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
    120000648: 00 00 fe 2f unop
    12000064c: 00 00 fe 2f unop
    120000650: 68 80 3d 20 lda t0,-32664(gp)
    120000654: 00 00 01 8e ldt $f16,0(t0)
    120000658: 08 80 7d a7 ldq t12,-32760(gp)
    12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
    120000660: 02 00 ba 27 ldah gp,2(ra)
    120000664: 18 85 bd 23 lda gp,-31464(gp)
    120000668: 60 80 3d 20 lda t0,-32672(gp)
    12000066c: 00 00 01 9c stt $f0,0(t0)
    120000670: 58 80 7d 20 lda t2,-32680(gp)
    120000674: 00 00 43 a0 ldl t1,0(t2)
    120000678: 10 80 3d a4 ldq t0,-32752(gp)
    12000067c: 00 00 21 a0 ldl t0,0(t0)
    120000680: 02 04 41 40 addq t1,t0,t1
    120000684: 5c 80 3d 20 lda t0,-32676(gp)
    120000688: 00 00 21 a0 ldl t0,0(t0)
    12000068c: 02 04 41 40 addq t1,t0,t1
    120000690: 00 00 43 b0 stl t1,0(t2)
    120000694: 00 04 ff 47 clr v0
    120000698: 00 00 5e a7 ldq ra,0(sp)
    12000069c: 10 00 de 23 lda sp,16(sp)
    1200006a0: 01 80 fa 6b ret

    ------------------
    Here's the output for RISC-V with -mcmodel=medany:

    0000000000010570 <main>:
    10570: 1141 addi sp,sp,-16
    10572: e406 sd ra,8(sp)
    10574: 00002797 auipc a5,0x2
    10578: ab47b783 ld a5,-1356(a5) # 12028 <__SDATA_BEGIN__>
    1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
    10580: 02c000ef jal ra,105ac <foo>
    10584: 8401b507 fld fa0,-1984(gp) # 12040

    10588: f29ff0ef jal ra,104b0 <lgamma@plt>
    1058c: 84a1b427 fsd fa0,-1976(gp) # 12048

    10590: 85418713 addi a4,gp,-1964 # 12054 <a>
    10594: 431c lw a5,0(a4)
    10596: 8601a683 lw a3,-1952(gp) # 12060 <__signgam@@GLIBC_2.27>
    1059a: 9fb5 addw a5,a5,a3
    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
    105a0: 9fb5 addw a5,a5,a3
    105a2: c31c sw a5,0(a4)
    105a4: 4501 li a0,0
    105a6: 60a2 ld ra,8(sp)
    105a8: 0141 addi sp,sp,16
    105aa: 8082 ret

    --------------------
    My 66000::
    main: ; @main
    enter r0,r0,0,0
    std #5841413448022620622,[ip,c]
    call foo
    ldd r1,[ip,y]
    call lgamma
    std r1,[ip,x]
    call __signgam
    lduw r1,[r1]
    lduw r2,[ip,a]
    add r1,r2,r1
    lduw r2,[ip,b]
    add r1,r1,r2
    stw r1,[ip,a]
    mov r1,#0
    exit r0,r0,0,0

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Jul 1 23:20:46 2025
    On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>>
    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared
    library
    you access at run-time is the exact same one you linked your binary
    against.

    I am well aware of that.

    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    That's not necessarily the case - so long as the function signatures/API don't
    change, new versions of the shared library will be backward compatable
    with
    applications linked against earlier versions. So the data section requirements
    for the shared library could change after the application is linked if
    new
    data section symbols are defined in the newer version of the shared
    library.

    The run-time loader will know how much data space has been allocated
    to the executable itself, and will append (and relocate corresponding references in the shared object and executable) the '.data' sections
    from
    each dynamic library loaded by the application at the time the
    library is loaded - which may be at startup or via dlopen().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jul 1 23:26:05 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    ---------------
    Now with a large constant:

    #include <math.h>
    extern int a, b;
    extern long c;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    c = 0x5110de94f393f9ceL;
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, c, x, y, and foo()
    with gcc -Wall -O -FPIC and then linked the two files with the same
    options.

    The result on Alpha is:

    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 58 85 bd 23 lda gp,-31400(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: fe ff 3d 24 ldah t0,-2(gp)
    120000634: f8 7c 41 a4 ldq t1,31992(t0)
    120000638: 78 80 3d 20 lda t0,-32648(gp)
    12000063c: 00 00 41 b4 stq t1,0(t0)
    120000640: 00 00 fe 2f unop
    120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
    120000648: 00 00 fe 2f unop
    12000064c: 00 00 fe 2f unop
    120000650: 68 80 3d 20 lda t0,-32664(gp)
    120000654: 00 00 01 8e ldt $f16,0(t0)
    120000658: 08 80 7d a7 ldq t12,-32760(gp)
    12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
    120000660: 02 00 ba 27 ldah gp,2(ra)
    120000664: 18 85 bd 23 lda gp,-31464(gp)
    120000668: 60 80 3d 20 lda t0,-32672(gp)
    12000066c: 00 00 01 9c stt $f0,0(t0)
    120000670: 58 80 7d 20 lda t2,-32680(gp)
    120000674: 00 00 43 a0 ldl t1,0(t2)
    120000678: 10 80 3d a4 ldq t0,-32752(gp)
    12000067c: 00 00 21 a0 ldl t0,0(t0)
    120000680: 02 04 41 40 addq t1,t0,t1
    120000684: 5c 80 3d 20 lda t0,-32676(gp)
    120000688: 00 00 21 a0 ldl t0,0(t0)
    12000068c: 02 04 41 40 addq t1,t0,t1
    120000690: 00 00 43 b0 stl t1,0(t2)
    120000694: 00 04 ff 47 clr v0
    120000698: 00 00 5e a7 ldq ra,0(sp)
    12000069c: 10 00 de 23 lda sp,16(sp)
    1200006a0: 01 80 fa 6b ret

    ------------------
    Here's the output for RISC-V with -mcmodel=medany:

    0000000000010570 <main>:
    10570: 1141 addi sp,sp,-16
    10572: e406 sd ra,8(sp)
    10574: 00002797 auipc a5,0x2
    10578: ab47b783 ld a5,-1356(a5) # 12028
    <__SDATA_BEGIN__>
    1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c>
    10580: 02c000ef jal ra,105ac <foo>
    10584: 8401b507 fld fa0,-1984(gp) # 12040
    <y>
    10588: f29ff0ef jal ra,104b0 <lgamma@plt>
    1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
    <x>
    10590: 85418713 addi a4,gp,-1964 # 12054 <a>
    10594: 431c lw a5,0(a4)
    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>
    1059a: 9fb5 addw a5,a5,a3
    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>
    105a0: 9fb5 addw a5,a5,a3
    105a2: c31c sw a5,0(a4)
    105a4: 4501 li a0,0
    105a6: 60a2 ld ra,8(sp)
    105a8: 0141 addi sp,sp,16
    105aa: 8082 ret

    --------------------
    My 66000::
    main: ; @main
    enter r0,r0,0,0
    std #5841413448022620622,[ip,c]
    call foo
    ldd r1,[ip,y]
    call lgamma
    std r1,[ip,x]
    call __signgam

    __signgam is an "int" variable in the shared library, not a function.

    What is the purpose of 'call' here?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 2 00:04:55 2025
    On Tue, 1 Jul 2025 23:26:05 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    ---------------
    Now with a large constant:

    #include <math.h>
    extern int a, b;
    extern long c;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    c = 0x5110de94f393f9ceL;
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, c, x, y, and foo()
    with gcc -Wall -O -FPIC and then linked the two files with the same
    options.

    The result on Alpha is:

    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 58 85 bd 23 lda gp,-31400(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: fe ff 3d 24 ldah t0,-2(gp)
    120000634: f8 7c 41 a4 ldq t1,31992(t0)
    120000638: 78 80 3d 20 lda t0,-32648(gp)
    12000063c: 00 00 41 b4 stq t1,0(t0)
    120000640: 00 00 fe 2f unop
    120000644: 17 00 40 d3 bsr ra,1200006a4 <foo>
    120000648: 00 00 fe 2f unop
    12000064c: 00 00 fe 2f unop
    120000650: 68 80 3d 20 lda t0,-32664(gp)
    120000654: 00 00 01 8e ldt $f16,0(t0)
    120000658: 08 80 7d a7 ldq t12,-32760(gp)
    12000065c: 00 40 5b 6b jsr ra,(t12),120000660 <main+0x40>
    120000660: 02 00 ba 27 ldah gp,2(ra)
    120000664: 18 85 bd 23 lda gp,-31464(gp)
    120000668: 60 80 3d 20 lda t0,-32672(gp)
    12000066c: 00 00 01 9c stt $f0,0(t0)
    120000670: 58 80 7d 20 lda t2,-32680(gp)
    120000674: 00 00 43 a0 ldl t1,0(t2)
    120000678: 10 80 3d a4 ldq t0,-32752(gp)
    12000067c: 00 00 21 a0 ldl t0,0(t0)
    120000680: 02 04 41 40 addq t1,t0,t1
    120000684: 5c 80 3d 20 lda t0,-32676(gp)
    120000688: 00 00 21 a0 ldl t0,0(t0)
    12000068c: 02 04 41 40 addq t1,t0,t1
    120000690: 00 00 43 b0 stl t1,0(t2)
    120000694: 00 04 ff 47 clr v0
    120000698: 00 00 5e a7 ldq ra,0(sp)
    12000069c: 10 00 de 23 lda sp,16(sp)
    1200006a0: 01 80 fa 6b ret

    ------------------
    Here's the output for RISC-V with -mcmodel=medany:

    0000000000010570 <main>:
    10570: 1141 addi sp,sp,-16
    10572: e406 sd ra,8(sp)
    10574: 00002797 auipc a5,0x2
    10578: ab47b783 ld a5,-1356(a5) # 12028
    <__SDATA_BEGIN__>
    1057c: 84f1bc23 sd a5,-1960(gp) # 12058 <c> >>> 10580: 02c000ef jal ra,105ac <foo>
    10584: 8401b507 fld fa0,-1984(gp) # 12040
    <y>
    10588: f29ff0ef jal ra,104b0 <lgamma@plt>
    1058c: 84a1b427 fsd fa0,-1976(gp) # 12048
    <x>
    10590: 85418713 addi a4,gp,-1964 # 12054 <a>
    10594: 431c lw a5,0(a4)
    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>
    1059a: 9fb5 addw a5,a5,a3
    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b> >>> 105a0: 9fb5 addw a5,a5,a3
    105a2: c31c sw a5,0(a4)
    105a4: 4501 li a0,0
    105a6: 60a2 ld ra,8(sp)
    105a8: 0141 addi sp,sp,16
    105aa: 8082 ret

    --------------------
    My 66000::
    main: ; @main
    enter r0,r0,0,0
    std #5841413448022620622,[ip,c]
    call foo
    ldd r1,[ip,y]
    call lgamma
    std r1,[ip,x]
    call __signgam

    __signgam is an "int" variable in the shared library, not a function.

    What is the purpose of 'call' here?

    IBM's source code for signgam is:

    #define _XOPEN_SOURCE
    #include <math.h>

    int *__signgam(void);
    #define signgam (*__signgam())

    Which is what we fed into the compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jul 2 05:18:53 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    IBM's source code for signgam is:

    #define _XOPEN_SOURCE
    #include <math.h>

    int *__signgam(void);
    #define signgam (*__signgam())

    Which is what we fed into the compiler.

    The Linux systems on which I compiled the code used glibc, and that defines

    extern int signgam;

    in <math.h>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jul 2 05:25:57 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    My expectation was that the signgam variable would be in (the ELF
    equivalent of) the data or bss segment of libm.so.6, and would
    therefore need some indirection to access.

    When I do

    objdump -T /lib64/lp64d/libm.so.6|grep signgam

    I get

    000000000008b0a4 g DO .bss 0000000000000004 GLIBC_2.27 __signgam 000000000008b0a4 w DO .bss 0000000000000004 GLIBC_2.27 signgam

    So signgam is a weak symbol (w) that is neither global nor local, is
    not a constructor, not a warning, not indirect or evaluated during
    reloc processing, is a dynamic symbol (D) and an object (O).

    Not sure what the global __signgam has to do with it. When I do

    objdump -R /lib64/lp64d/libm.so.6|grep signgam

    (dynamic relocation information) I see

    000000000008b080 R_RISCV_64 __signgam@@GLIBC_2.27

    which is exactly the name we see in the disassembly code above.

    So now my theory is that libm.so.6 contains information that tells the
    linker of any executable that links to it to create the variable in
    the executable, and the library itself uses indirection to access that variable, and during dynamic linking, this indirection is initialized.
    If my theory is correct, any future versions of glibc will have to use
    a compatible mechanism for implementing signgam.

    In the glibc version used on our ancient Alpha, this mechanism has not
    been used, so there we saw the indirection.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jul 2 08:06:20 2025
    MitchAlsup1 wrote:
    On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist >>>> in
    Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.

    Yes, and an LEA instruction which just calculates an address
    needs an operand size suffix... why?

    LEA need to distinguish between::
    LEA Rd,[Rb+DISP16]
    LEA Rd,[Rb+Ri<<s]
    LEA Rd,[Rb+DISP32]
    LEA Rd,[Rb+DISP54]
    LEA Rd,[Rb+Ri<<s+DISP32]
    LEA Rd,[Rb+Ri<<s+DISP64]

    In 64-bit mode there is no disp64, just disp8 and disp32,
    There were no spare bits in the ModRM byte to indicate it.

    Had there been disp64 then x64 could have had a smooth expansion
    of address calculations into 64-bit space.

    AMD worked around the ModRM limitation to provide at least some way to
    access all of 64-bit address space. It did so by adding MOV opcodes,
    to load an imm64, and to LD/ST memory using abs64 addresses.

    And since those MOV's are different opcodes and not part of ModRM,
    LEA does not know about those 64-bit imm64 or abs64 values and cannot
    use them in general address calculations.

    And all of the various compiler code models and their addressing
    limitations follows from these discontinuities in addressing behavior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Wed Jul 2 11:20:09 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [Alpha]
    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,
    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the
    global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up
    the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.
    Yes, it is (also) using an extra memory load to pick up large immediates.
    It also requires a BAL to get the IP into a register.

    I have finally gotten around to turning on our working Alpha and
    compiled the following program on it:

    #include <math.h>
    extern int a, b;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, x, y, and foo with
    gcc -Wall -O -FPIC and then linked the two files with the same options.

    The result on Alpha is:

    GCC Alpha manual says that the limit with -mlarge-data (the default)
    is 2 GB of data. Larger data must use mmap or malloc.
    -mlarge-text (the default) is 4 MB code.

    GCC has no ability to generate access to Alpha's full 64-bit address space
    so there is no comparison with other ISA's.

    Perhaps using A64 or RV64 would be better examples.


    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 48 85 bd 23 lda gp,-31416(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop
    120000640: 68 80 3d 20 lda t0,-32664(gp)
    120000644: 00 00 01 8e ldt $f16,0(t0)
    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)
    120000658: 60 80 3d 20 lda t0,-32672(gp)
    12000065c: 00 00 01 9c stt $f0,0(t0)
    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)
    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)
    120000670: 02 04 41 40 addq t1,t0,t1
    120000674: 5c 80 3d 20 lda t0,-32676(gp)
    120000678: 00 00 21 a0 ldl t0,0(t0)
    12000067c: 02 04 41 40 addq t1,t0,t1
    120000680: 00 00 43 b0 stl t1,0(t2)
    120000684: 00 04 ff 47 clr v0
    120000688: 00 00 5e a7 ldq ra,0(sp)
    12000068c: 10 00 de 23 lda sp,16(sp)
    120000690: 01 80 fa 6b ret

    The load of signgam is achieved with the following sequence

    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)

    I.e., 2 instructions, not 6.

    Yes, that is the same GOT table indirect two load sequence as x64.

    I wasn't saying it had to use 6 instructions. I'm saying that if Alpha
    wanted to full access to its 64-bit address space, then its options
    are either to use 6 instructions to build a 64-bit immediate
    OR do two loads (maybe plus other overhead). Both those options are poor.

    RV64 and A64 are in a similar boat.

    I wanted an ISA option that doesn't need two dependent LD's or
    6 instructions to access the whole 64-bit address space.

    It's interesting that a, b, x, y (which end up in the same linked unit
    as main()) result in code like

    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)

    which could be implemented more efficiently as

    ldl t1,-32680(gp)

    but apparently the linker just fixes up the first instruction of the
    pair (either as ldq or as lda, maybe also as ldah), and maybe the
    offset of the second instruction (but not in this example); the
    benefit is that the linker just has to replace some instructions, but
    it does not have to shrink or expand the code (which would require
    changing even more instructions).

    Nop'ing the first and using the second would be better because it doesn't
    use a temp register and hardware can optimize a unop away.

    We also see that the call to foo() within the same linked unit is
    linked as

    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop

    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    One can see again that both code sequences have the same size, to
    avoid shrinking or expanding.

    I wondered about those nop's.

    I would be cautious about depending on code expansion for optimization.
    I read some online remarks about a developer adding "relaxation" to the
    RV64 linker and causing it to take over an hour to run.
    It is possible that the basic algorithm is relaxation factorial O(n!)
    (it has that smell to me).

    https://sourceware.org/binutils/docs/as/Xtensa-Relaxation.html#index-relaxation

    Always starting with largest size and compacting down may be best.
    That's why I was investigating the compacting linker algorithm.
    If my ISA is going to depend on it for optimization, I want to
    make sure it could be implemented easily and would not have
    uncontrollable pathological behavior. And it does look acceptable.

    The gcc manual says about these:

    | When '-msmall-data' is used,
    | the compiler can assume that all local symbols share the same '$gp'
    | value, and thus reduce the number of instructions required for a
    | function call from 4 to 1.

    We see the 4 and 1 instructions above, but it's not clear to me that
    there is a real benefit. The compiler cannot assume that an external reference is local, and the linker knows, but does not benefit from
    it. And for references within a compilation unit, I would hope that
    the compiler/assembler manages to use the smallest variant based on
    actual size.

    Yes but they also have their 2 GB limit which avoids all large
    address space 'issues' (ie, fobs them off onto the programmer).

    The question is what happens to the Alpha code (or RV64 or A64) when
    you remove the address space compile limits and go for the Full Monty.


    Why burden all programs with the costs of large programs the way it
    is done by default on Alpha?
    ....
    I'm not saying there shouldn't be optimizations for smaller sizes.
    I'm pointing to the fact that to actually USE the 64-bit address space
    there is a large increase in code size and execute cost,
    and asking if that had to be so.

    I can use 100 GB arrays with code that is the same size as code that
    limits itself to the lower 2GB of address space (there is an option on
    Alpha compilers and linkers for that).

    100 GB is 37 bits of address. Where does that 37 number come from?
    And the manual says the Alpha data limit is 2 GB.

    For example, for Alpha to load a 64-bit constant requires 6 instructions,
    24 bytes.

    I forgot to add this to the program, maybe tomorrow.

    Or pull it from the constant table just prior to the routine entry.

    Long ago I read Alpha code standard puts constants into a table just
    before the routine entry, does a BAL rTmp,+0 to copy the PC into rTmp,
    then can access the constants in the table at negative rTmp offsets.

    That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table
    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.

    That's not necessary. The global pointer is derived from the function address (in t12) on entry to the function and from the return address
    (in ra) after a jsr.

    Yes, I see that now.
    t12 is the link register specified by the caller in its JAL to above code.
    That saves it a BAL rTmp,+0 to copy the PC as a PC-rel base.

    The LDQ touches the same address space as the code but now as data
    so it has to load the D-TLB with an entry redundant with I-TBL,
    and bring in a data cache line with the constants.

    No, the global table is elsewhere; in the case above we it's about
    96KB behind the start of main(). The text ends a few hundred bytes
    later, so there is no page that contains both code and data (i.e., no
    TLB entries that describe the same page, not that this would be a
    problem).

    I'm referring to the routine's constants table I describe above.

    And after the constant is loaded it must be manually added to the base
    because there is no LD/ST combined with a scaled index.

    Which base? You were only mentioning constants up to now.

    The base for the constant table is the PC copied by the BAL+0.

    Furthermore, the actual load or store of the target value is serially
    dependent on LDQ offset and the ADD. Back when the load-to-use latency
    for a cache hit 1 clock that might look ok but now that it is 3 or 4 clocks >> it is serious penalty.

    This century, you use an OoO CPU (even the last Alpha was OoO), and
    the 4-5 clocks latency of a load are added to the ready time of the
    base address, i.e., gp in this case. gp only changes on far calls, so loading from a gp-relative address is rarely in the critical
    dependence path.

    We are comparing apples and oranges because GCC for Alpha only supports
    direct access to a resticted 2 GB address space.

    Probably the reason they don't support the direct 64-bit access is
    (a) all the things I said would happen would, and
    (b) back then no one would have had code requiring such large data sets
    so no backwards compatability issues, and
    (c) by the time programmers wanted >2GB data sets the Alpha was dead.

    By making it a priority for relatively cheap access to the full 64-bit
    address space during an ISA design, what alternatives might have minimized >> its extra cost?

    It would certainly be an interesting experiment to see how much size
    and speed difference we would get if we eliminated the "mov r64,
    imm64" instruction when compiling to AMD64 and used Alpha-like
    techniques instead. My guess is: barely measurable.

    - anton

    Its probably better to use A64 and RV64 for such comparisons
    as at least their active market might drive compiler enhancements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jul 2 15:38:26 2025
    On Wed, 2 Jul 2025 12:06:20 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Mon, 30 Jun 2025 16:51:13 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [System V ABI for AMD64]
    Also it makes multiple references to x64 instructions that don't exist >>>>> in
    Intel or AMD docs, LEAQ and MOVABS.

    LEAQ is AT&T syntax for LEA with a "quadword" operand.

    Yes, and an LEA instruction which just calculates an address
    needs an operand size suffix... why?

    LEA need to distinguish between::
    LEA Rd,[Rb+DISP16]
    LEA Rd,[Rb+Ri<<s]
    LEA Rd,[Rb+DISP32]
    LEA Rd,[Rb+DISP54]
    LEA Rd,[Rb+Ri<<s+DISP32]
    LEA Rd,[Rb+Ri<<s+DISP64]

    In 64-bit mode there is no disp64, just disp8 and disp32,
    There were no spare bits in the ModRM byte to indicate it.

    In My 66000, there is.

    Had there been disp64 then x64 could have had a smooth expansion
    of address calculations into 64-bit space.

    Something RISC-V should have learned.

    AMD worked around the ModRM limitation to provide at least some way to
    access all of 64-bit address space. It did so by adding MOV opcodes,
    to load an imm64, and to LD/ST memory using abs64 addresses.

    And since those MOV's are different opcodes and not part of ModRM,
    LEA does not know about those 64-bit imm64 or abs64 values and cannot
    use them in general address calculations.

    And all of the various compiler code models and their addressing
    limitations follows from these discontinuities in addressing behavior.

    And it hurts--so, don't do it your ISA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jul 2 15:45:54 2025
    On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    [Alpha]
    I suspect this is because almost every code or data address would
    require a 6 instruction sequence to load it into a register for use,
    How do you compute that? When I looked at the code produced for
    Alpha, I got the impression that they wanted to support arbitrarily
    large programs and they generated such code by default, but IIRC the
    typical code for loading an absolute addres was by loading it from the >>>> global table of the current function; so it requires typically 1 load
    (and a 64-bit value in the global table). It also requires setting up >>>> the global pointer on every function entry and after every call, but
    that can be amortized over several accesses to the global table.
    Yes, it is (also) using an extra memory load to pick up large
    immediates.
    It also requires a BAL to get the IP into a register.

    I have finally gotten around to turning on our working Alpha and
    compiled the following program on it:

    #include <math.h>
    extern int a, b;
    extern double x, y;
    extern void foo(void);

    int main()
    {
    foo();
    x = lgamma(y);
    a += signgam;
    a += b;
    return 0;
    }

    I have compiled this and the file containing a, b, x, y, and foo with
    gcc -Wall -O -FPIC and then linked the two files with the same options.

    The result on Alpha is:

    GCC Alpha manual says that the limit with -mlarge-data (the default)
    is 2 GB of data. Larger data must use mmap or malloc.
    -mlarge-text (the default) is 4 MB code.

    GCC has no ability to generate access to Alpha's full 64-bit address
    space
    so there is no comparison with other ISA's.

    Perhaps using A64 or RV64 would be better examples.


    0000000120000620 <main>:
    120000620: 02 00 bb 27 ldah gp,2(t12)
    120000624: 48 85 bd 23 lda gp,-31416(gp)
    120000628: f0 ff de 23 lda sp,-16(sp)
    12000062c: 00 00 5e b7 stq ra,0(sp)
    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop
    120000640: 68 80 3d 20 lda t0,-32664(gp)
    120000644: 00 00 01 8e ldt $f16,0(t0)
    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)
    120000658: 60 80 3d 20 lda t0,-32672(gp)
    12000065c: 00 00 01 9c stt $f0,0(t0)
    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)
    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)
    120000670: 02 04 41 40 addq t1,t0,t1
    120000674: 5c 80 3d 20 lda t0,-32676(gp)
    120000678: 00 00 21 a0 ldl t0,0(t0)
    12000067c: 02 04 41 40 addq t1,t0,t1
    120000680: 00 00 43 b0 stl t1,0(t2)
    120000684: 00 04 ff 47 clr v0
    120000688: 00 00 5e a7 ldq ra,0(sp)
    12000068c: 10 00 de 23 lda sp,16(sp)
    120000690: 01 80 fa 6b ret

    The load of signgam is achieved with the following sequence

    120000668: 10 80 3d a4 ldq t0,-32752(gp)
    12000066c: 00 00 21 a0 ldl t0,0(t0)

    I.e., 2 instructions, not 6.

    Yes, that is the same GOT table indirect two load sequence as x64.

    I wasn't saying it had to use 6 instructions. I'm saying that if Alpha
    wanted to full access to its 64-bit address space, then its options
    are either to use 6 instructions to build a 64-bit immediate
    OR do two loads (maybe plus other overhead). Both those options are
    poor.

    Both waste instructions and registers.

    RV64 and A64 are in a similar boat.

    Whereas My 66000 is not.

    I wanted an ISA option that doesn't need two dependent LD's or
    6 instructions to access the whole 64-bit address space.

    My 66000 has what you want.

    It's interesting that a, b, x, y (which end up in the same linked unit
    as main()) result in code like

    120000660: 58 80 7d 20 lda t2,-32680(gp)
    120000664: 00 00 43 a0 ldl t1,0(t2)

    which could be implemented more efficiently as

    ldl t1,-32680(gp)

    but apparently the linker just fixes up the first instruction of the
    pair (either as ldq or as lda, maybe also as ldah), and maybe the
    offset of the second instruction (but not in this example); the
    benefit is that the linker just has to replace some instructions, but
    it does not have to shrink or expand the code (which would require
    changing even more instructions).

    Nop'ing the first and using the second would be better because it
    doesn't
    use a temp register and hardware can optimize a unop away.

    We also see that the call to foo() within the same linked unit is
    linked as

    120000630: 00 00 fe 2f unop
    120000634: 17 00 40 d3 bsr ra,120000694 <foo>
    120000638: 00 00 fe 2f unop
    12000063c: 00 00 fe 2f unop

    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    This 4 instruction sequence becomes::

    CALX [IP,,GOT[n]-.]

    In my 66000 ISA.

    One can see again that both code sequences have the same size, to
    avoid shrinking or expanding.

    I wondered about those nop's.

    I would be cautious about depending on code expansion for optimization.
    I read some online remarks about a developer adding "relaxation" to the
    RV64 linker and causing it to take over an hour to run.
    It is possible that the basic algorithm is relaxation factorial O(n!)
    (it has that smell to me).

    https://sourceware.org/binutils/docs/as/Xtensa-Relaxation.html#index-relaxation

    Always starting with largest size and compacting down may be best.
    That's why I was investigating the compacting linker algorithm.
    If my ISA is going to depend on it for optimization, I want to
    make sure it could be implemented easily and would not have
    uncontrollable pathological behavior. And it does look acceptable.

    The gcc manual says about these:

    | When '-msmall-data' is used,
    | the compiler can assume that all local symbols share the same '$gp'
    | value, and thus reduce the number of instructions required for a
    | function call from 4 to 1.

    We see the 4 and 1 instructions above, but it's not clear to me that
    there is a real benefit. The compiler cannot assume that an external
    reference is local, and the linker knows, but does not benefit from
    it. And for references within a compilation unit, I would hope that
    the compiler/assembler manages to use the smallest variant based on
    actual size.

    Yes but they also have their 2 GB limit which avoids all large
    address space 'issues' (ie, fobs them off onto the programmer).

    The question is what happens to the Alpha code (or RV64 or A64) when
    you remove the address space compile limits and go for the Full Monty.

    Bad things (well unexpected at best).


    Why burden all programs with the costs of large programs the way it
    is done by default on Alpha?
    ....
    I'm not saying there shouldn't be optimizations for smaller sizes.
    I'm pointing to the fact that to actually USE the 64-bit address space
    there is a large increase in code size and execute cost,
    and asking if that had to be so.

    I can use 100 GB arrays with code that is the same size as code that
    limits itself to the lower 2GB of address space (there is an option on
    Alpha compilers and linkers for that).

    100 GB is 37 bits of address. Where does that 37 number come from?
    And the manual says the Alpha data limit is 2 GB.

    For example, for Alpha to load a 64-bit constant requires 6
    instructions,
    24 bytes.

    I forgot to add this to the program, maybe tomorrow.

    Or pull it from the constant table just prior to the routine entry.

    Long ago I read Alpha code standard puts constants into a table just
    before the routine entry, does a BAL rTmp,+0 to copy the PC into rTmp,
    then can access the constants in the table at negative rTmp offsets.

    That sequence is too large so they are pretty much forced
    to use an extra LDQ to pull the offset from the constant table
    located just prior to the routine entry point and requires an extra
    BAL to copy the RIP into a register as a base.

    That's not necessary. The global pointer is derived from the function
    address (in t12) on entry to the function and from the return address
    (in ra) after a jsr.

    Yes, I see that now.
    t12 is the link register specified by the caller in its JAL to above
    code.
    That saves it a BAL rTmp,+0 to copy the PC as a PC-rel base.

    A wasted register when you have Universal Constants.



    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 2 17:06:37 2025
    On Wed, 2 Jul 2025 16:56:42 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:


    <snip>


    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30> >>>> 120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    This 4 instruction sequence becomes::

    CALX [IP,,GOT[n]-.]

    In my 66000 ISA.

    For current architectures, function calls use the
    procedure linkage table (PLT). The Global Offset Table
    is only used for certain static global variables.

    We figured out how to do this without a PLT and remain PIC
    using just GOT; and adjusted ISA so that the needed parts
    are present.

    If you want to leverage standard tools, you may wish
    to follow that paradigm in 66000.

    In addition, we do not need 4 control transfers to get to
    and back from an external subroutine call--just 2--the
    CALX and the RET.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Wed Jul 2 17:14:13 2025
    According to MitchAlsup1 <mitchalsup@aol.com>:
    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    I don't think so. The code points to the GOT which is a fixed distance away but
    in another page, so the code can be read-only and the GOT is patched for the current process.

    This sort of arrangement goes way back. It was in all versions of ELF libraries
    and something like it was in TSS/360 in 1966.

    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 2 16:56:42 2025
    mitchalsup@aol.com (MitchAlsup1) writes:
    On Wed, 2 Jul 2025 15:20:09 +0000, EricP wrote:


    <snip>


    whereas the call to lgamma() in a shared library is linked as

    120000648: 08 80 7d a7 ldq t12,-32760(gp)
    12000064c: 00 40 5b 6b jsr ra,(t12),120000650 <main+0x30>
    120000650: 02 00 ba 27 ldah gp,2(ra)
    120000654: 18 85 bd 23 lda gp,-31464(gp)

    This 4 instruction sequence becomes::

    CALX [IP,,GOT[n]-.]

    In my 66000 ISA.

    For current architectures, function calls use the
    procedure linkage table (PLT). The Global Offset Table
    is only used for certain static global variables.

    If you want to leverage standard tools, you may wish
    to follow that paradigm in 66000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed Jul 2 15:38:48 2025
    MitchAlsup1 wrote:
    On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) # 12050
    <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060
    <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it >>>> is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared
    library
    you access at run-time is the exact same one you linked your binary
    against.

    I am well aware of that.

    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    If such references in the module (exe/dll) are fixed size RIP-rel disp64
    plus a reference reloc in case the target moves it should work.
    If loader finds target is different than default at link time then
    the fixed disp64 field is large enough to hold any changed value.

    The compiler always emits RIP-rel disp64 data references. If the linker
    finds the offset is intra-module then it sets its assigned offset value,
    and marks it as compactable in a later linker stage.
    If inter-module then sets it to its target's default offset,
    marks it as non-compactable and emits a DISP64 reloc entry
    in case the target moves.

    The loader should have the code and default GOT and ro-data marked
    as Copy On Write (COW) during load-reloc so any patches fault into
    the page file, then switch the protection to Read-Only-Execute and
    Read-Only after it is finished. Pages that do require patches get their
    own private code-GOT-ro-data page copies and don't bugger up other users.

    If the address space forks afterward loading then the children all
    inherit the patched view of code-GOT-ro-data pages.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Thu Jul 3 08:41:06 2025
    EricP wrote:
    MitchAlsup1 wrote:
    On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany,
    this is addressed using auipc (add upper immediate to PC), whereas
    with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) #
    12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>> <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it >>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared
    library
    you access at run-time is the exact same one you linked your binary
    against.

    I am well aware of that.

    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    If such references in the module (exe/dll) are fixed size RIP-rel disp64
    plus a reference reloc in case the target moves it should work.
    If loader finds target is different than default at link time then
    the fixed disp64 field is large enough to hold any changed value.

    The compiler always emits RIP-rel disp64 data references. If the linker
    finds the offset is intra-module then it sets its assigned offset value,
    and marks it as compactable in a later linker stage.
    If inter-module then sets it to its target's default offset,
    marks it as non-compactable and emits a DISP64 reloc entry
    in case the target moves.

    The loader should have the code and default GOT and ro-data marked
    as Copy On Write (COW) during load-reloc so any patches fault into
    the page file, then switch the protection to Read-Only-Execute and
    Read-Only after it is finished. Pages that do require patches get their
    own private code-GOT-ro-data page copies and don't bugger up other users.

    If the address space forks afterward loading then the children all
    inherit the patched view of code-GOT-ro-data pages.

    The downside of this approach is the two PIC modules are bound to each
    other at a fixed offset and unless they relocate together then all the inter-module references have to be patched, however many there are.
    And there are likely more than just two modules involved.

    The advantage of the GOT-indirect approach is that only the
    one location needs to be patched. Plus it can use a DISP32 offset
    to access the GOT. The disadvantage is that you don't discover
    that you need GOT-indirect addressing until link time,
    and then you need to insert an extra LD with a temp register.
    And in general the compiler doesn't know whether it needs a DISP64
    or DISP32 offset to access any variable, so it is already dealing
    with variable sized references.

    The option behind door number 3 is an indirect address mode.
    That simplifies the software as the compiler only emits DISP32 offsets,
    the linker only needs to set the offset to the GOT entry and flip an
    indirect bit (so no extra LD insertion or temp register).
    But indirect addressing is an ISA feature that once added cannot be removed.
    It adds hardware complexity in the LSQ which is already
    probably the most complex module in the core.
    Some of it hardware to deal with worst case situations that likely
    never occur, like 4 page or cache line straddles.

    The conclusion I've come to is option two when combined with a
    compacting linker is best.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed Jul 16 01:18:59 2025
    On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

    EricP wrote:
    MitchAlsup1 wrote:
    On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>> with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) #
    12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>> <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even though it >>>>>> is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared
    library
    you access at run-time is the exact same one you linked your binary
    against.

    I am well aware of that.

    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    If such references in the module (exe/dll) are fixed size RIP-rel disp64
    plus a reference reloc in case the target moves it should work.
    If loader finds target is different than default at link time then
    the fixed disp64 field is large enough to hold any changed value.

    The compiler always emits RIP-rel disp64 data references. If the linker
    finds the offset is intra-module then it sets its assigned offset value,
    and marks it as compactable in a later linker stage.
    If inter-module then sets it to its target's default offset,
    marks it as non-compactable and emits a DISP64 reloc entry
    in case the target moves.

    The loader should have the code and default GOT and ro-data marked
    as Copy On Write (COW) during load-reloc so any patches fault into
    the page file, then switch the protection to Read-Only-Execute and
    Read-Only after it is finished. Pages that do require patches get their
    own private code-GOT-ro-data page copies and don't bugger up other
    users.

    If the address space forks afterward loading then the children all
    inherit the patched view of code-GOT-ro-data pages.

    The downside of this approach is the two PIC modules are bound to each
    other at a fixed offset and unless they relocate together then all the inter-module references have to be patched, however many there are.
    And there are likely more than just two modules involved.

    The advantage of the GOT-indirect approach is that only the
    one location needs to be patched. Plus it can use a DISP32 offset
    to access the GOT. The disadvantage is that you don't discover
    that you need GOT-indirect addressing until link time,
    and then you need to insert an extra LD with a temp register.

    Not if you have a CALX instruction. You predict GOT access at
    compile time, and when the linker resolves an extern, it can
    change CALX into CALA by flipping 1 bit making it the same size
    as predicted, but now control transfers to the AGEN address.
    When the linker does not resolve, ld.so can do this at run time.

    And in general the compiler doesn't know whether it needs a DISP64
    or DISP32 offset to access any variable, so it is already dealing
    with variable sized references.

    It is unlikely that the address space is so cluttered that GOT
    cannot be placed within ±2GB of IP--and still transfer control
    to anywhere in 64-bit VAS.

    The option behind door number 3 is an indirect address mode.

    BINGO--that is effectively what CALX and CALA are.

    That simplifies the software as the compiler only emits DISP32 offsets,
    the linker only needs to set the offset to the GOT entry and flip an
    indirect bit (so no extra LD insertion or temp register).
    But indirect addressing is an ISA feature that once added cannot be
    removed.

    Note: CALX and CALA are indirect only so far as they load IP
    and not any register. Also note: they are CALLs not BRs.

    It adds hardware complexity in the LSQ which is already
    probably the most complex module in the core.

    CALX performs through ICache not DCache.

    Some of it hardware to deal with worst case situations that likely
    never occur, like 4 page or cache line straddles.

    The conclusion I've come to is option two when combined with a
    compacting linker is best.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jul 18 13:23:04 2025
    MitchAlsup1 wrote:
    On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

    EricP wrote:
    MitchAlsup1 wrote:
    On Tue, 1 Jul 2025 21:07:13 +0000, Scott Lurndal wrote:

    mitchalsup@aol.com (MitchAlsup1) writes:
    On Tue, 1 Jul 2025 16:08:09 +0000, Anton Ertl wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    --------------
    Here the constant is located at 0x12028, and with -mcmodel=medany, >>>>>>> this is addressed using auipc (add upper immediate to PC), whereas >>>>>>> with the default model it is located at the same address, but
    addressed using gp.

    The global variables are accessed through gp using a single
    instruction in nearly all cases, e.g.:

    1059c: 8501a683 lw a3,-1968(gp) #
    12050 <b>

    This includes signgam:

    10596: 8601a683 lw a3,-1952(gp) # 12060 >>>>>>> <__signgam@@GLIBC_2.27>

    It's unclear to me how signgam ends up right besides c, even
    though it
    is defined in libm.so.6, i.e., in a separate binary (and I have
    checked that is libm actually linked dynamically).

    Is there any way other than having the static linker preassign
    extern variable to static link resolved addresses.

    The problem with that approach is that it assumes that the shared
    library
    you access at run-time is the exact same one you linked your binary
    against.

    I am well aware of that.

    But is there any way for the code to be emitted without indirection
    and standard ISA displacement fields without those being resolved
    by the linker (ld) and remain PIC ?!?

    If such references in the module (exe/dll) are fixed size RIP-rel disp64 >>> plus a reference reloc in case the target moves it should work.
    If loader finds target is different than default at link time then
    the fixed disp64 field is large enough to hold any changed value.

    The compiler always emits RIP-rel disp64 data references. If the linker
    finds the offset is intra-module then it sets its assigned offset value, >>> and marks it as compactable in a later linker stage.
    If inter-module then sets it to its target's default offset,
    marks it as non-compactable and emits a DISP64 reloc entry
    in case the target moves.

    The loader should have the code and default GOT and ro-data marked
    as Copy On Write (COW) during load-reloc so any patches fault into
    the page file, then switch the protection to Read-Only-Execute and
    Read-Only after it is finished. Pages that do require patches get their
    own private code-GOT-ro-data page copies and don't bugger up other
    users.

    If the address space forks afterward loading then the children all
    inherit the patched view of code-GOT-ro-data pages.

    The downside of this approach is the two PIC modules are bound to each
    other at a fixed offset and unless they relocate together then all the
    inter-module references have to be patched, however many there are.
    And there are likely more than just two modules involved.

    The advantage of the GOT-indirect approach is that only the
    one location needs to be patched. Plus it can use a DISP32 offset
    to access the GOT. The disadvantage is that you don't discover
    that you need GOT-indirect addressing until link time,
    and then you need to insert an extra LD with a temp register.

    Not if you have a CALX instruction. You predict GOT access at
    compile time, and when the linker resolves an extern, it can
    change CALX into CALA by flipping 1 bit making it the same size
    as predicted, but now control transfers to the AGEN address.
    When the linker does not resolve, ld.so can do this at run time.

    And in general the compiler doesn't know whether it needs a DISP64
    or DISP32 offset to access any variable, so it is already dealing
    with variable sized references.

    It is unlikely that the address space is so cluttered that GOT
    cannot be placed within ±2GB of IP--and still transfer control
    to anywhere in 64-bit VAS.

    You misunderstand - in the full 64-bit address space (the large memory model)
    I want to eliminate the extra address load for intra-module references
    so it only indirects through GOT for inter-module references.

    To support full 64-bit addresses, the approach chosen was to
    turn all program memory references into two, a LD of a disp64 or
    an absolute GOT address, then the data access.

    That first extra memory load is an unnecessary 64-bit "tax".

    Getting rid of this requires the compiler know the different between an "extern" intra-module reference, which can use direct RIP-disp64 addressing, and a "dllimport" inter-module reference, which must load the absolute
    address from the GOT table first, then use that.
    But GCC has no "dllimport" attribute for declarations, only MSVC does.

    OR it requires the compiler emit a worst-case access sequence for every
    global variable access, and have the linker edit and compact the code
    as it discovers which are "extern" and which are "dllimport" references,
    the compacting linker approach.

    The option behind door number 3 is an indirect address mode.

    BINGO--that is effectively what CALX and CALA are.

    That simplifies the software as the compiler only emits DISP32 offsets,
    the linker only needs to set the offset to the GOT entry and flip an
    indirect bit (so no extra LD insertion or temp register).
    But indirect addressing is an ISA feature that once added cannot be
    removed.

    Note: CALX and CALA are indirect only so far as they load IP
    and not any register. Also note: they are CALLs not BRs.

    It adds hardware complexity in the LSQ which is already
    probably the most complex module in the core.

    CALX performs through ICache not DCache.

    Some of it hardware to deal with worst case situations that likely
    never occur, like 4 page or cache line straddles.

    The conclusion I've come to is option two when combined with a
    compacting linker is best.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri Jul 18 19:54:21 2025
    On Fri, 18 Jul 2025 17:23:04 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

    EricP wrote:
    <snip>
    And in general the compiler doesn't know whether it needs a DISP64
    or DISP32 offset to access any variable, so it is already dealing
    with variable sized references.

    It is unlikely that the address space is so cluttered that GOT
    cannot be placed within ±2GB of IP--and still transfer control
    to anywhere in 64-bit VAS.

    You misunderstand - in the full 64-bit address space (the large memory
    model)
    I want to eliminate the extra address load for intra-module references
    so it only indirects through GOT for inter-module references.

    Yes, what we did was to make GOT 32-bit addressable from the current
    module to reduce the size of the indirecting LD. But in My 66000
    the indirect remains a single instruction instead of AUPIC+LDD.

    To support full 64-bit addresses, the approach chosen was to
    turn all program memory references into two, a LD of a disp64 or
    an absolute GOT address, then the data access.

    That first extra memory load is an unnecessary 64-bit "tax".

    Agreed; but I would label this as the "extern" tax as it is still
    required in dynamically loaded modules in the small (32-bit) model.

    Getting rid of this requires the compiler know the different between an "extern" intra-module reference, which can use direct RIP-disp64
    addressing,
    and a "dllimport" inter-module reference, which must load the absolute address from the GOT table first, then use that.
    But GCC has no "dllimport" attribute for declarations, only MSVC does.

    What the compiler/linker pair needs to know is that the variable is
    "extern" but will be "resolved" at link time.

    OR it requires the compiler emit a worst-case access sequence for every global variable access, and have the linker edit and compact the code
    as it discovers which are "extern" and which are "dllimport" references,
    the compacting linker approach.

    Compacting is a lot better than expanding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Jul 20 13:05:21 2025
    MitchAlsup1 wrote:
    On Fri, 18 Jul 2025 17:23:04 +0000, EricP wrote:

    MitchAlsup1 wrote:
    On Thu, 3 Jul 2025 12:41:06 +0000, EricP wrote:

    EricP wrote:
    <snip>
    And in general the compiler doesn't know whether it needs a DISP64
    or DISP32 offset to access any variable, so it is already dealing
    with variable sized references.

    It is unlikely that the address space is so cluttered that GOT
    cannot be placed within ±2GB of IP--and still transfer control
    to anywhere in 64-bit VAS.

    You misunderstand - in the full 64-bit address space (the large memory
    model)
    I want to eliminate the extra address load for intra-module references
    so it only indirects through GOT for inter-module references.

    Yes, what we did was to make GOT 32-bit addressable from the current
    module to reduce the size of the indirecting LD. But in My 66000
    the indirect remains a single instruction instead of AUPIC+LDD.

    To support full 64-bit addresses, the approach chosen was to
    turn all program memory references into two, a LD of a disp64 or
    an absolute GOT address, then the data access.

    That first extra memory load is an unnecessary 64-bit "tax".

    Agreed; but I would label this as the "extern" tax as it is still
    required in dynamically loaded modules in the small (32-bit) model.

    Getting rid of this requires the compiler know the different between an
    "extern" intra-module reference, which can use direct RIP-disp64
    addressing,
    and a "dllimport" inter-module reference, which must load the absolute
    address from the GOT table first, then use that.
    But GCC has no "dllimport" attribute for declarations, only MSVC does.

    What the compiler/linker pair needs to know is that the variable is
    "extern" but will be "resolved" at link time.

    I wanted to avoid the traditional approach of editing the language source
    code to add all sorts of implementation specific compiler attributes for
    global variables, like dllimport, stdcall, etc for MS, (GCC has its own
    list of attributes, as do all compilers for all languages).

    One way to solve these kinds of issues could be a compile command line
    option to specify a definitions file(s) that provides all the extra symbol attribute information a compiler needs to generate optimal code
    for different circumstances, not just global references but also
    optimal ABI's for specific routines:
    -sym_attrib=<file_name>,<file_name>,<file_name>,...

    For example, if I am compiling code the references the C RTL and I
    intend to link with the shared CRTL.DLL then I compile with
    -sym_attrib=CRTLSHR.DEF
    and that tells compiler more information about all the symbols in C RTL specific for the shared DLL version.

    A different file would be used when the C RTL is included by the linker.

    This -sym_attrib file could also include ABI info for specific routines,
    like changing the register assignments for specific arguments of specific routines.

    OR it requires the compiler emit a worst-case access sequence for every
    global variable access, and have the linker edit and compact the code
    as it discovers which are "extern" and which are "dllimport" references,
    the compacting linker approach.

    Compacting is a lot better than expanding.

    Yes. Compacting is better as you start with working (functionally correct)
    but possibly oversize code and tries to make it smaller working code. Compacting can stop at any point as it is always dealing with working code.

    Expanding starts with possible broken code because a branch, call,
    or global ref is out of range or the wrong kind of reference,
    then expands each broken item to make it function correctly,
    and then deals with all the things that broke because of those expansions,
    and so on. Expansion can't stop until all broken items are fixed.

    In theory these both deal with the same number of items and should
    produce the same optimal result. The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sun Jul 20 17:33:29 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Yes. Compacting is better as you start with working (functionally correct) >but possibly oversize code and tries to make it smaller working code. >Compacting can stop at any point as it is always dealing with working code.

    Expanding starts with possible broken code because a branch, call,
    or global ref is out of range or the wrong kind of reference,
    then expands each broken item to make it function correctly,
    and then deals with all the things that broke because of those expansions, >and so on. Expansion can't stop until all broken items are fixed.

    In theory these both deal with the same number of items and should
    produce the same optimal result.

    Which theory is that? In theory the general case (where some things
    can need more space as other things need less) is NP-complete (Thomas
    G. Szymanski: Assembling Code for Machines with Span-Dependent
    Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
    have to make sure that you never compact, whereas with "compacting"
    you could compact some things that look compactable, but find that the
    result is no longer correct, because an earlier-compacted think needs
    to expand.

    But let's rule out the shrinking-this-causes-growth-elsewhere cases,
    then the "compacting" approach can be caughtin a steady state where it
    sees no opportunity for shrinking, but one or more span-dependent
    instructions can be compacted. So the "expanding" approach can
    produce a smaller result than the "compacting" approach.

    The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.

    And what's the problem with that?

    Read more about "Assembling Span-Dependent Instructions", and
    misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Jul 20 20:09:09 2025
    On Sun, 20 Jul 2025 17:33:29 +0000, Anton Ertl wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Yes. Compacting is better as you start with working (functionally
    correct)
    but possibly oversize code and tries to make it smaller working code. >>Compacting can stop at any point as it is always dealing with working
    code.

    Expanding starts with possible broken code because a branch, call,
    or global ref is out of range or the wrong kind of reference,
    then expands each broken item to make it function correctly,
    and then deals with all the things that broke because of those
    expansions,
    and so on. Expansion can't stop until all broken items are fixed.

    In theory these both deal with the same number of items and should
    produce the same optimal result.

    Which theory is that? In theory the general case (where some things
    can need more space as other things need less) is NP-complete (Thomas
    G. Szymanski: Assembling Code for Machines with Span-Dependent
    Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
    have to make sure that you never compact, whereas with "compacting"
    you could compact some things that look compactable, but find that the
    result is no longer correct, because an earlier-compacted think needs
    to expand.
    -------------------
    Mc 88100 used a compacting linker. The compiler would produce "large"
    code, and the linker would compact it. There were several properties
    to observe::

    a) the linker would make a pass over the code and assign preliminary
    addresses to code and to data using the large model. The displacements
    can only shrink from this point.

    b) a second pass would compact instructions when the code or data
    could be addresses with the small model. Code size can only shrink
    by performing this procedure.

    c) once a piece of code had been (b)ed its addresses became fixed
    so there were certain pieces that were not optimal but something
    like 98% of all dynamic references were as optimal as they could be.

    d) at no step along the 2 passes is any of the code non-executable.
    The only things possibly left out are long references that could
    have been compacted.

    e) I am willing to live with that 2% degredation.
    -------------------
    We are now living in a world where certain ISAs have great difficulty
    in accessing sections/segments that are "very far away". RISC-V as an
    example accessing a statically positioned piece of data that is
    farther away than 4GB. RISC-V is faced with building a pointer to
    that reference and loading that pointer in order to access that
    datum, or pasting a bunch of bit together in order to perform
    such an access. On the former, you get a doubleword pointer in
    DATA within reach of AUPIC+LDD so the cost if 5 words (2 .data
    3 instructions (AUPIC+LDD+LD/ST). On the later, one has to create
    the top 32-bits and then merge with AUPIC+LDA(lo()) IIRC this is 6 instructions.

    Access to dynamically linked subroutines that can be placed "way
    far away" have similar problems we generally solve with GOT and
    PLTs.

    In RISC-Vs case:: the compacting linker has to find the DW in .DATA
    and the 3 instructions and express the same semantic content in fewer instructions, and then eliminate the DW in .DATA.
    -------------------
    With My 66000 ISA the instructions stay the same, but the size of the displacement changes--except in the case where CALX is converted into
    CALA and this is performed by flipping a single bit in the minor opcode.

    CALL can reach ±2^27 bytes
    CALA can reach ±2^33 bytes or 2^64 bytes with a long displacement
    CALX can reach ±2^33 bytes or 2^64 bytes with a long displacement

    My 66000 SW model does not use a PLT and avoids the delay of the
    Trampoline. CALX [IP,,GOT[name#]-.] transfers control to the
    subroutine at the address contained in that GOT entry.

    But let's rule out the shrinking-this-causes-growth-elsewhere cases,

    Mc 88100 and My 66000 do not have this problem. Compacting A can only
    add opportunities the ability to compact B.

    then the "compacting" approach can be caughtin a steady state where it
    sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
    produce a smaller result than the "compacting" approach.

    Automatic (wire) routers have this problem too. I once "river routed"
    the 138 wires between the multiplier array and the accumulator tree
    causing the layout of that block to be only 40% the size of the auto
    routed equivalent.
    -------------------
    A 2-pass linker using an always correct code:: that gets within a couple
    of percent of optimal; is about as good as one can expect/need. We
    (the Mc 88100 designers) never found any opportunities on any of the
    SPEC-like benchmarks and resulting performance to doubt our choices.


    The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.

    And what's the problem with that?

    Read more about "Assembling Span-Dependent Instructions", and
    misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Mon Jul 21 09:01:53 2025
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Which theory is that? In theory the general case (where some things
    can need more space as other things need less) is NP-complete (Thomas
    G. Szymanski: Assembling Code for Machines with Span-Dependent
    Instructions, CACM 21(4), 1978, p. 300-308).

    That paper involves compacting A->B->C branch chains which is NP-complete.

    It's been about 40 years since I wrote an assembler that did compacting for the ROMP, but it started with all A->B branches long, and made passes over the code compacting what it could until it didn't find any more. It didn't try to handle branch chains, so compacting never made anything out of range.

    That worked well enough and as I recall two passes was invariably enough.

    R's,
    John
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Levine on Mon Jul 21 12:33:14 2025
    John Levine <johnl@taugh.com> writes:
    That paper involves compacting A->B->C branch chains which is NP-complete.

    If you want to say that the paper tries to transform such a chain into
    C, then no, that's not the case.

    And actually in the case:

    A: jbr B
    ...
    B: jbr C
    ...
    C:

    there are only /simple expressions/ and the jbrs are non-pathological
    in terms of the paper, and the problem of minimizing the size of a
    program with only simple expressions and non-pathological
    span-dependent instructions is solvable in polynomial time (the paper
    gives an algorithm for doing that in section 3).

    It's been about 40 years since I wrote an assembler that did compacting for the
    ROMP, but it started with all A->B branches long, and made passes over the code
    compacting what it could until it didn't find any more. It didn't try to handle
    branch chains, so compacting never made anything out of range.

    It probably did not deal with nonsimple or nor with pathological
    span-dependent instructions, or it recognized them and always used the
    long form for them (theoretically suboptimal, but rarely occurs in
    practice), the way that gas does it to this day. Of course, you have
    to write code to recognize when a span-dependent instruction has a
    non-simple expression or is pathological, which you avoid with the
    expanding approach.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Jul 21 10:50:20 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Yes. Compacting is better as you start with working (functionally correct) >> but possibly oversize code and tries to make it smaller working code.
    Compacting can stop at any point as it is always dealing with working code. >>
    Expanding starts with possible broken code because a branch, call,
    or global ref is out of range or the wrong kind of reference,
    then expands each broken item to make it function correctly,
    and then deals with all the things that broke because of those expansions, >> and so on. Expansion can't stop until all broken items are fixed.

    In theory these both deal with the same number of items and should
    produce the same optimal result.

    Which theory is that? In theory the general case (where some things
    can need more space as other things need less) is NP-complete (Thomas
    G. Szymanski: Assembling Code for Machines with Span-Dependent
    Instructions, CACM 21(4), 1978, p. 300-308). Neither "compacting" nor "expanding" is guaranteed to be optimal, but for "expanding" you just
    have to make sure that you never compact, whereas with "compacting"
    you could compact some things that look compactable, but find that the
    result is no longer correct, because an earlier-compacted think needs
    to expand.

    Thanks for the reference, I'll have a look.
    I wasn't thinking of those shrink-then-reexpand or expand-then-reshrink approaches as they are obviously potentially meta-stable.
    Those are also the ones I thought smelled factorial O(N!) cost.

    I'm comparing compact-only and expand-only approaches.

    The compacting approach I was thinking of, which I described a while back,
    is sweep over all items, start large, calculate lower and upper possible address range, shrink a reference when you know it will always work,
    iterate until it stops changing, freeze at those sizes.
    The worst case is that each sweep shrinks one item
    so requires N sweeps of N items or O(N^2) cost.
    But it can stop at any point.

    The expand approach is similar but starts small and expands when a reference
    is broken (out of range). Its worst case is when fixing a reference breaks
    one more others (but crucially this can only happen at most once)
    so also requires N sweeps for N items or O(N^2) cost.
    But it can't stop iterating until all references are fixed.

    In reality, both would likely terminate after 3 or 4 sweeps.

    But let's rule out the shrinking-this-causes-growth-elsewhere cases,
    then the "compacting" approach can be caughtin a steady state where it
    sees no opportunity for shrinking, but one or more span-dependent instructions can be compacted. So the "expanding" approach can
    produce a smaller result than the "compacting" approach.

    Yes, the compact approach I described will miss co-dependent shrinks
    where either both shrink or neither shrink, where expand will catch those. Those should only occur when there are two or more references that cross
    ranges that both happen to be right of the packing boundary,
    which should be infrequent and is harmless for compacting as it only
    means the occasional reference will be too big but still working.

    The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.

    And what's the problem with that?

    Given a big job, your expanding linker goes away and never comes back.
    You can't produce a working product, your company goes bankrupt,
    you die a penniless pauper in a homeless shelter.

    My compacting linker always produces a product, which I ship to customers,
    rake in the big bucks, and live happily ever after.

    Read more about "Assembling Span-Dependent Instructions", and
    misconceptions about this topic, at <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>

    - anton

    Right, but I was never even considering those metastable approaches.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 17:01:08 2025
    Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    That paper involves compacting A->B->C branch chains which is NP-complete.

    If you want to say that the paper tries to transform such a chain into
    C, then no, that's not the case.

    And actually in the case:

    A: jbr B
    ...
    B: jbr C
    ...
    C:

    there are only /simple expressions/ and the jbrs are non-pathological
    in terms of the paper, and the problem of minimizing the size of a
    program with only simple expressions and non-pathological
    span-dependent instructions is solvable in polynomial time (the paper
    gives an algorithm for doing that in section 3).

    It's been about 40 years since I wrote an assembler that did compacting for the
    ROMP, but it started with all A->B branches long, and made passes over the code
    compacting what it could until it didn't find any more. It didn't try to handle
    branch chains, so compacting never made anything out of range.

    It probably did not deal with nonsimple or nor with pathological span-dependent instructions, or it recognized them and always used the
    long form for them (theoretically suboptimal, but rarely occurs in
    practice), the way that gas does it to this day. Of course, you have
    to write code to recognize when a span-dependent instruction has a
    non-simple expression or is pathological, which you avoid with the
    expanding approach.

    This is a recurring subject, personally I've found that the algorithm
    used by Ivan G for Mill is "Good Enough" (TM).

    For every branch or RIP-dependent load instruction with multiple
    possible encodings, create both/all versions and determine how long each
    will be.

    For each branch, calculate both the maximum and minimum length across
    all intermediate variable-length instructions: If the max is <= shortest
    form, use that and lock it down (remove from variable list), if min >=
    shortest long span, use that and also remove it from the list.

    After a very short number of passes, most code will have settled at or
    very near the theoretical optimum.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jul 21 15:26:55 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    For each branch, calculate both the maximum and minimum length across
    all intermediate variable-length instructions: If the max is <= shortest >form, use that and lock it down (remove from variable list), if min >= >shortest long span, use that and also remove it from the list.

    After a very short number of passes, most code will have settled at or
    very near the theoretical optimum.

    What do you do with those that stay in the variable list?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Jul 21 15:28:41 2025
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    The compacting approach I was thinking of, which I described a while back,
    is sweep over all items, start large, calculate lower and upper possible >address range, shrink a reference when you know it will always work,
    iterate until it stops changing, freeze at those sizes.

    Taking the example from <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>:

    foo:
    movl foo+133-bar(%rdi),%eax
    bar:

    what does your approach do? What does "lower and upper possible
    address range" mean? How do you know it will always work?

    The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.

    And what's the problem with that?

    Given a big job, your expanding linker goes away and never comes back.

    As somone wrote in this thread:

    In reality, both would likely terminate after 3 or 4 sweeps.

    Plus the expanding approach does not need to "calculate lower and
    upper possible address range", nor determin whether "it will always
    work". If the operand needs to much space, expand (and remember to do
    another sweep); that's all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 18:06:33 2025
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    For each branch, calculate both the maximum and minimum length across
    all intermediate variable-length instructions: If the max is <= shortest
    form, use that and lock it down (remove from variable list), if min >=
    shortest long span, use that and also remove it from the list.

    After a very short number of passes, most code will have settled at or
    very near the theoretical optimum.

    What do you do with those that stay in the variable list?

    Still don't know if it can use short or long form.


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Mon Jul 21 17:26:28 2025
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    For each branch, calculate both the maximum and minimum length across
    all intermediate variable-length instructions: If the max is <= shortest >>> form, use that and lock it down (remove from variable list), if min >=
    shortest long span, use that and also remove it from the list.

    After a very short number of passes, most code will have settled at or
    very near the theoretical optimum.

    What do you do with those that stay in the variable list?

    Still don't know if it can use short or long form.

    So what happens if some are never removed from the variable list?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon Jul 21 19:31:42 2025
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    For each branch, calculate both the maximum and minimum length across
    all intermediate variable-length instructions: If the max is <= shortest >>>> form, use that and lock it down (remove from variable list), if min >= >>>> shortest long span, use that and also remove it from the list.

    After a very short number of passes, most code will have settled at or >>>> very near the theoretical optimum.

    What do you do with those that stay in the variable list?

    Still don't know if it can use short or long form.

    So what happens if some are never removed from the variable list?

    Just use the long form since the code will always work that way.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Mon Jul 21 17:08:49 2025
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    The compacting approach I was thinking of, which I described a while back, >> is sweep over all items, start large, calculate lower and upper possible
    address range, shrink a reference when you know it will always work,
    iterate until it stops changing, freeze at those sizes.

    Taking the example from <http://www.complang.tuwien.ac.at/anton/assembling-span-dependent.html>:

    foo:
    movl foo+133-bar(%rdi),%eax
    bar:

    what does your approach do? What does "lower and upper possible
    address range" mean? How do you know it will always work?

    This is the method Terje is describing, that Ivan uses.

    Assuming that there are only two offset sizes...

    Each item has two potential assigned addresses.
    The lower address is the sum of all prior smaller object sizes.
    The upper address is the sum of all prior larger object sizes.

    (remembering that forward and backward offsets have different ranges),
    if the largest offset difference between a reference and its target
    fits into the small offset size, or the smallest offset difference
    only fits into the largest offset size, then mark the item resolved
    and fix it at that size.

    The above compacting also works with alignment directives.
    Alignments have as much or more effect on the results as the variable
    sized offsets. Alignments behave like variable size objects
    from 0..(A-1) bytes, whose size depends on what address it starts on.
    These also have two sizes, for the lower and upper address.
    Alignments can change size each time new addresses are assigned.

    The difference is that when faced
    with a pathological case compacting can just give up at any point
    while expansion must run to completion.
    And what's the problem with that?
    Given a big job, your expanding linker goes away and never comes back.

    As somone wrote in this thread:

    In reality, both would likely terminate after 3 or 4 sweeps.

    For non-pathological cases.

    Plus the expanding approach does not need to "calculate lower and
    upper possible address range",

    two subtracts

    nor determin whether "it will always work".

    two compares

    If the operand needs to much space, expand (and remember to do
    another sweep); that's all.

    - anton

    I have no control over whether a pathological case can occur.
    But I can see that they are possible.

    If doing a link with a large number of items, my preference would be
    to have a link optimizer method that can be terminated after my specified number of iterations and still produce a working exe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)