Forum: >>> Magnum BBS <<<

Compiler use of instructions (was: Oops)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Sun May 12 06:36:33 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Anton Ertl wrote:

1) The compiler writers found it too hard to use the complex
instructions or addressing modes. For some kinds of instructions that
is the case (e.g, for the AES instructions in Intel and AMD CPUs), but
at least these days such instructions are there for use in libraries
written in assembly language/with intrinsics.

The 801 was correct on this::

The compiler must be developed at the same time as ISA, if ISA has it
and the compiler cannot use it then why is it there {yes there are
certain privileged instructions lacking this property}

In case of the AMD64+ AES instructions, they are there to support
efficient AES libraries that do not have the cache side channel for which Daniel Bernstein presented an exploit in:

@Unpublished{bernstein05,
author = {Bernstein, Daniel J.},
title = {Cache-timing attacks on {AES}},
note = {},
year = {2005},
url={https://cr.yp.to/antiforgery/cachetiming-20050414.pdf},
OPTannote = {}
}

So they are there because AES consumes significant amounts of CPU in
some application scenarios, and because providing these instructions
is helpful for security.

In instruction sets like the S/360, VAX and 8086 that were designed
when significant amounts of software were still written in assembly
language, there are some instructions that are designed for use by
assembly language programmers that compilers tend not to use or not
use well (e.g., LODS on the 8086, IA-32 and AMD64). But these kinds
of instructions have not been added in the last decades, since assembly-language programming has become a tiny niche.

Conversely is
compiler could almost use an instruction but does not, then adjust
the instruction specification so the compiler can !!

Do you have something specific in mind?

2) Some instructions are slower than a sequence of simpler
instructions, so compilers will avoid them even if they would
otherwise use them.

VAX CALL instructions did more work than what was required, it did
the work it was specified to perform as rapidly as the HW could perform
the specified task. It took 10 years to figure out that the CALL/RET
overhead was excessive and wasteful.

Interestingly, when I look into <https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pdf>, the
point 2) is mentioned on page 27 as "Irrational Implementations", and
the examples given are the S/370 load-multiple instruction for fewer
than 4 registers and the VAX INDEX instruction. On page 30 they
mention that PUSHL R0 is slower than MOVL R0, -(SP) on the VAX 11/780.

In <https://dl.acm.org/doi/pdf/10.1145/2465.214917> from 1985 (<8
years after the VAX was introduced) the authors already report that
Michael L. Powell of DEC found the following for his experimental
Modula-2 compiler: "By replacing the CALL instruction with a sequence
of simple instructions that do only what is necessary, Powell was able
to improve performance by 20 percent." I found <https://dl.acm.org/doi/pdf/10.1145/502874.502905> about this
compiler, published in June 1984; it says:

|On most processors, the procedure calling sequence defines a standard |interface between languages. As such, it is often more general than
|required by a particular programming language. The compiler can detect |procedures that are called only by Modula-2 routines in the current |compilation, and replace the more expensive procedure call mechanism
|with a simpler, faster one.

Given how often the VAX call is mentioned, I would expect to find some
paper with a more elaborate analysis (and I dimly remember reading
such), but came up empty in short web searches. Anyway, let's look at
the VAX CALLS instruction <http://odl.sysworks.biz/disk$vaxdocmar002/opsys/vmsos721/4515/4515pro_020.html>
<https://people.computing.clemson.edu/~mark/subroutines/vax.html>.

What it pushes on the stack:

argument count
registers specified in a mask that is at the start of the callee
old pc
old fp
old ap (argument pointer)
mask|psw
condition handler (initially 0)

Of these:

* Pushing the argument count is not done in modern calling
conventions. Instead, for languages like C with variable argument
numbers, the caller is responsible for removing the arguments from
the stack (if there are any on the stack, and in C the callee has to
know how many arguments there are (e.g., from the format string for
printf()). I guess VAX CALLS pushes the argument count in order to
have a common base for these things.

* The registers specified in the mask would be the callee-saved
registers that the callee is using (and apparently in the usual VAX
calling convention all registers are considered to be callee-saved
to minimize the code size of the call and function entry code.
Modern calling conventions also have caller-saved registers which
makes leaf calls cheaper (up to a point). The mask feature could
also be used for such calling conventions, but for many leaf
functions the mask would then be empty.

* Old PC is also stored by the simpler JSB instruction. RISCs store
the old PC in a register and leave the saving to a separate store
instruction (which is unnecessary for a leaf function).

* Old FP is stored in frame-pointer-based calling conventions, but a
frame pointer tends to be optional and is only used for functions
where the stack grows by a variable amount (e.g., alloca()) or is
needed for introspective purposes (debugging and such).

* Old AP: an argument pointer seems unnecessary given that the
compiler knows how big the data saved by the CALLS instruction is.
I also have never seen it in a calling convention other than that of
the VAX. Maybe they added this so that the backtrace can easily
access the arguments without having to know anything about the stack
frames.

* mask|psw: Given the use of the mask for saving the registers on
call, the return instruction also needs the mask; either as
immediate argument, or on the stack; the latter is better for stack
unwinding (throwing exceptions, debugging, and such). Modern
calling conventions treat flags (if present) as caller-saved, but
given that they had space left, saving the psw seems like a good use
of the space.

* condition handler: no idea what that was good for.

It seems to me that a lot of this caters to having a good software
ecosystem where lots of languages can call each other and stuff like
debuggers easily know lots of things about the program. As the RISCs
have demonstrated, this has a cost in performance, but OTOH, have we
seen comparisons of the size and capabilities of backtrace-generating
code and the like on VAX and RISCs?

Of course, if the performance cost of these features is so high that
compiler writers prefer to avoid the full-featured call instruction,
the instruction misses the target, too.

That has been reported by both the IBM 801
project about some S/370 instructions and by the Berkeley RISC project
about the VAX. I don't remember any reports about addressing modes
with that problem.

The problem with address modes is their serial decode, not with the ability >to craft any operand the instruction needs. The second problem with VAX-like >addressing modes is that it is overly expressive, all operands can be >constants, whereas a good compiler will never need more than 1 constant
per instruction (because otherwise some constant arithmetic could be >performed at compile (or link) time.)

It's not just the VAX addressing modes. Consider how frequently the
addressing modes that the 68020 had but the 68000 did not were used by compilers; I don't think they were used often. As for decoding, AFAIK
parallel decoding is a solved problem (at a cost, but still).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Sun May 12 13:28:26 2024

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

[snip description of VAX CALL instruction]

* condition handler: no idea what that was good for.

When unwinding the stack, if a condition handler was
defined for a stack frame, it would be invoked to allow
the function at that stack level to do any necessary
cleanup (like closing files, deallocating memory, etc.)

Sort of like C++ exceptions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Sun May 12 21:46:34 2024

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

Of course, if the performance cost of these features is so high that
compiler writers prefer to avoid the full-featured call instruction,
the instruction misses the target, too.

The Vax suffered badly from the second system syndrome, in which its
designers loaded it up with all the stuff that wouldn't fit on a
PDP-11. Its design also seems to assume a world where main memory is
expensive and microcode ROM is much faster, hence the dense byte coded instructions that demand to be decoded one operand at a time and the
tiny 512 byte pages.

I also don't think they thought through the costs of implementing
their design. The Vax chapter in the Computer Engineering book says
that they wanted to have a single system wide procedure call, which is
fine, but it should have been evident even at the time that all of the
memory traffic for saving and restoring registers, including ones the
routine doesn't need, would be really slow.

It's perfectly possible to have a standard calling sequence without
baking it all into microcode. All of the languages on S/360 use the
same calling sequence, but it's made from several instructions, and
you can leave out the ones you don't need, e.g., providing a new save
area in leaf routines, or stack management in non-recursive routines.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:09
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Compiler use of instructions (was: Oops)

Who's Online

Recent Visitors

System Info