• Compiler use of instructions (was: Oops)

    From Anton Ertl@21:1/5 to mitchalsup@aol.com on Sun May 12 06:36:33 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    Anton Ertl wrote:
    1) The compiler writers found it too hard to use the complex
    instructions or addressing modes. For some kinds of instructions that
    is the case (e.g, for the AES instructions in Intel and AMD CPUs), but
    at least these days such instructions are there for use in libraries
    written in assembly language/with intrinsics.

    The 801 was correct on this::

    The compiler must be developed at the same time as ISA, if ISA has it
    and the compiler cannot use it then why is it there {yes there are
    certain privileged instructions lacking this property}

    In case of the AMD64+ AES instructions, they are there to support
    efficient AES libraries that do not have the cache side channel for which Daniel Bernstein presented an exploit in:

    @Unpublished{bernstein05,
    author = {Bernstein, Daniel J.},
    title = {Cache-timing attacks on {AES}},
    note = {},
    year = {2005},
    url={https://cr.yp.to/antiforgery/cachetiming-20050414.pdf},
    OPTannote = {}
    }

    So they are there because AES consumes significant amounts of CPU in
    some application scenarios, and because providing these instructions
    is helpful for security.

    In instruction sets like the S/360, VAX and 8086 that were designed
    when significant amounts of software were still written in assembly
    language, there are some instructions that are designed for use by
    assembly language programmers that compilers tend not to use or not
    use well (e.g., LODS on the 8086, IA-32 and AMD64). But these kinds
    of instructions have not been added in the last decades, since assembly-language programming has become a tiny niche.

    Conversely is
    compiler could almost use an instruction but does not, then adjust
    the instruction specification so the compiler can !!

    Do you have something specific in mind?

    2) Some instructions are slower than a sequence of simpler
    instructions, so compilers will avoid them even if they would
    otherwise use them.

    VAX CALL instructions did more work than what was required, it did
    the work it was specified to perform as rapidly as the HW could perform
    the specified task. It took 10 years to figure out that the CALL/RET
    overhead was excessive and wasteful.

    Interestingly, when I look into <https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>, the
    point 2) is mentioned on page 27 as "Irrational Implementations", and
    the examples given are the S/370 load-multiple instruction for fewer
    than 4 registers and the VAX INDEX instruction. On page 30 they
    mention that PUSHL R0 is slower than MOVL R0, -(SP) on the VAX 11/780.

    In <https://dl.acm.org/doi/pdf/10.1145/2465.214917> from 1985 (<8
    years after the VAX was introduced) the authors already report that
    Michael L. Powell of DEC found the following for his experimental
    Modula-2 compiler: "By replacing the CALL instruction with a sequence
    of simple instructions that do only what is necessary, Powell was able
    to improve performance by 20 percent." I found <https://dl.acm.org/doi/pdf/10.1145/502874.502905> about this
    compiler, published in June 1984; it says:

    |On most processors, the procedure calling sequence defines a standard |interface between languages. As such, it is often more general than
    |required by a particular programming language. The compiler can detect |procedures that are called only by Modula-2 routines in the current |compilation, and replace the more expensive procedure call mechanism
    |with a simpler, faster one.

    Given how often the VAX call is mentioned, I would expect to find some
    paper with a more elaborate analysis (and I dimly remember reading
    such), but came up empty in short web searches. Anyway, let's look at
    the VAX CALLS instruction <http://odl.sysworks.biz/disk$vaxdocmar002/opsys/vmsos721/4515/4515pro_020.html>
    <https://people.computing.clemson.edu/~mark/subroutines/vax.html>.

    What it pushes on the stack:

    argument count
    registers specified in a mask that is at the start of the callee
    old pc
    old fp
    old ap (argument pointer)
    mask|psw
    condition handler (initially 0)

    Of these:

    * Pushing the argument count is not done in modern calling
    conventions. Instead, for languages like C with variable argument
    numbers, the caller is responsible for removing the arguments from
    the stack (if there are any on the stack, and in C the callee has to
    know how many arguments there are (e.g., from the format string for
    printf()). I guess VAX CALLS pushes the argument count in order to
    have a common base for these things.

    * The registers specified in the mask would be the callee-saved
    registers that the callee is using (and apparently in the usual VAX
    calling convention all registers are considered to be callee-saved
    to minimize the code size of the call and function entry code.
    Modern calling conventions also have caller-saved registers which
    makes leaf calls cheaper (up to a point). The mask feature could
    also be used for such calling conventions, but for many leaf
    functions the mask would then be empty.

    * Old PC is also stored by the simpler JSB instruction. RISCs store
    the old PC in a register and leave the saving to a separate store
    instruction (which is unnecessary for a leaf function).

    * Old FP is stored in frame-pointer-based calling conventions, but a
    frame pointer tends to be optional and is only used for functions
    where the stack grows by a variable amount (e.g., alloca()) or is
    needed for introspective purposes (debugging and such).

    * Old AP: an argument pointer seems unnecessary given that the
    compiler knows how big the data saved by the CALLS instruction is.
    I also have never seen it in a calling convention other than that of
    the VAX. Maybe they added this so that the backtrace can easily
    access the arguments without having to know anything about the stack
    frames.

    * mask|psw: Given the use of the mask for saving the registers on
    call, the return instruction also needs the mask; either as
    immediate argument, or on the stack; the latter is better for stack
    unwinding (throwing exceptions, debugging, and such). Modern
    calling conventions treat flags (if present) as caller-saved, but
    given that they had space left, saving the psw seems like a good use
    of the space.

    * condition handler: no idea what that was good for.

    It seems to me that a lot of this caters to having a good software
    ecosystem where lots of languages can call each other and stuff like
    debuggers easily know lots of things about the program. As the RISCs
    have demonstrated, this has a cost in performance, but OTOH, have we
    seen comparisons of the size and capabilities of backtrace-generating
    code and the like on VAX and RISCs?

    Of course, if the performance cost of these features is so high that
    compiler writers prefer to avoid the full-featured call instruction,
    the instruction misses the target, too.

    That has been reported by both the IBM 801
    project about some S/370 instructions and by the Berkeley RISC project
    about the VAX. I don't remember any reports about addressing modes
    with that problem.

    The problem with address modes is their serial decode, not with the ability >to craft any operand the instruction needs. The second problem with VAX-like >addressing modes is that it is overly expressive, all operands can be >constants, whereas a good compiler will never need more than 1 constant
    per instruction (because otherwise some constant arithmetic could be >performed at compile (or link) time.)

    It's not just the VAX addressing modes. Consider how frequently the
    addressing modes that the 68020 had but the 68000 did not were used by compilers; I don't think they were used often. As for decoding, AFAIK
    parallel decoding is a solved problem (at a cost, but still).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Sun May 12 13:28:26 2024
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    mitchalsup@aol.com (MitchAlsup1) writes:

    [snip description of VAX CALL instruction]


    * condition handler: no idea what that was good for.

    When unwinding the stack, if a condition handler was
    defined for a stack frame, it would be invoked to allow
    the function at that stack level to do any necessary
    cleanup (like closing files, deallocating memory, etc.)

    Sort of like C++ exceptions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Levine@21:1/5 to All on Sun May 12 21:46:34 2024
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Of course, if the performance cost of these features is so high that
    compiler writers prefer to avoid the full-featured call instruction,
    the instruction misses the target, too.

    The Vax suffered badly from the second system syndrome, in which its
    designers loaded it up with all the stuff that wouldn't fit on a
    PDP-11. Its design also seems to assume a world where main memory is
    expensive and microcode ROM is much faster, hence the dense byte coded instructions that demand to be decoded one operand at a time and the
    tiny 512 byte pages.

    I also don't think they thought through the costs of implementing
    their design. The Vax chapter in the Computer Engineering book says
    that they wanted to have a single system wide procedure call, which is
    fine, but it should have been evident even at the time that all of the
    memory traffic for saving and restoring registers, including ones the
    routine doesn't need, would be really slow.

    It's perfectly possible to have a standard calling sequence without
    baking it all into microcode. All of the languages on S/360 use the
    same calling sequence, but it's made from several instructions, and
    you can leave out the ones you don't need, e.g., providing a new save
    area in leaf routines, or stack management in non-recursive routines.





    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)