Forum: >>> Magnum BBS <<<

Re: Caching exception vectors

From MitchAlsup1@21:1/5 to Robert Finch on Sun Feb 4 16:05:08 2024

Robert Finch wrote:

Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the vector can be loaded. Using a cache of vector addresses might help.

There are only 256 vectors in a vector table. It’s small. With a vector table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.

In my case, under advice of counsel (EricP), I punted !

Since My 66000 has a CALX instruction--a LD to IP and storing of return
address into R0--the HW performs a context switch to the interrupt (or exception) dispatcher. The dispatcher checks that the vector number
is within range of the appropriate table, and CALXs the ISR (or ESR).

Since many ISRs setup delayed cleanup routines, I made the priority
of pending cleanup routines visible to the SVR (supervisor return) instruction;...

This means the table can be anywhere, on any boundary, and any size;
while simplifying "special" control transfers.

The common path through the Dispatchers is 4 instructions, and these
are placed on cache line boundaries; and SW does not have to check for
pending softIRQs or DPCs (or a host of other medium high priority
pending threads) SVR transfers control to the highest pending thread
that can run (affinity) on this core.

There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.

More than a cache line starts to become problematic.

The machine level vector table should be pretty stable, so would have
single cycle access most of the time.

A dedicated vector cache would need to be able have virtual addresses translated to physical ones. So, it might need to borrow an address translation port from the I$ or D$.

Or, you could punt and make an efficient means to transfer control
through a table of SW's making.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Robert Finch on Sun Feb 4 12:21:14 2024

Robert Finch wrote:

Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the vector can be loaded. Using a cache of vector addresses might help.

There are only 256 vectors in a vector table. It’s small. With a vector table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.

There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.

The machine level vector table should be pretty stable, so would have
single cycle access most of the time.

A dedicated vector cache would need to be able have virtual addresses translated to physical ones. So, it might need to borrow an address translation port from the I$ or D$.

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

It sounds like you may be saying exception when you mean exceptions,
interrupts and errors because my ISA reserves just 16 exception codes
of which only 12 were actually used, If you are mixing these concepts
I suggest you to keep them separate in your mind.

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are maskable, exceptions are usually not (FPU exceptions being the... exception).
Exceptions are recognized within instructions and are either faults,
which roll back to a precise state, or traps, which roll forward to a
precise state. Interrupts are recognized between instructions.

Errors are in their own category. Some are synchronous, like detection
of a memory ECC error on read, or asynchronous like detection of a ECC
error on write. Some errors are faults or traps, or aborts, which are
imprecise and may leave parts of a core in a scrambled state.
Faults and traps are potentially restartable, aborts are not.

Firstly, I recommend peeling errors off and handling them separately.
That leaves exceptions and interrupts.

Exceptions are, or appear to be, recognized at commit/retire.
They partly behave like mispredicted branches.
As I only have up to 16 exception codes then all I need is a
simple method for *calculating* an exception jump address.
This eliminates the need to access memory to get an exception vector.

The calculation I use is to scale (shift left) the exception code number
by some amount, 5, 6, 7, 8, and OR with a base virtual address.
The scale and base virtual address defined in a privileged control register.

xcp_jump_addr = xcp_base_addr | (xcp_code << xcp_scale)

This jump address is calculated by Retire based on the exception code
stored in the uOp and like any jump address is jammed into Fetch.

The other part is the auxiliary information defined for exception codes.
All exceptions save the RIP and privilege mode, possibly the stack register. Page faults need to know the faulting address and attempted access.
This is all stuffed into privileged control registers.

Interrupts can have a similar calculated jump address except based
on interrupt *priority* not device id as there can be multiple devices connected at each priority. The priority handler reads the next device id
for that priority and calls the assigned device Interrupt Service Routine
(ISR) defined by the OS driver when it attached to the interrupt priority.

Where exactly information gets put on entry depends on where I need it
to be to return as efficiently as possible. Whereas the entry for
exceptions and interrupts may be different, I want one unified method
for a simple, straight forward return to the prior state.

So one designs the Return from Exception or Interrupt (REI) mechanism first, define where you want the return RIP, RSP, protection code, etc, how to
deal with return from nested priority maskable interrupts vs exceptions.
Then have each of exception or interrupt HW construct that context on entry.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sun Feb 4 20:09:34 2024

Robert Finch wrote:

On 2024-02-04 11:05 a.m., MitchAlsup1 wrote:

Robert Finch wrote:

Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so
the vector can be loaded. Using a cache of vector addresses might help.

There are only 256 vectors in a vector table. It’s small. With a
vector table cache, the vector would be immediately available at the
commit stage most of the time, single cycle access. The table would
store the vector and the current process id. If the process id of the
vector does not match the current process id then the vector table
entry would need to be loaded. The machine would stall until the
vector is loaded. Modifying the vector table entries would need to
invalidate the vector table entry in the cache.

In my case, under advice of counsel (EricP), I punted !

Since My 66000 has a CALX instruction--a LD to IP and storing of return
address into R0--the HW performs a context switch to the interrupt (or
exception) dispatcher. The dispatcher checks that the vector number
is within range of the appropriate table, and CALXs the ISR (or ESR).

Since many ISRs setup delayed cleanup routines, I made the priority
of pending cleanup routines visible to the SVR (supervisor return)
instruction;...

This means the table can be anywhere, on any boundary, and any size;
while simplifying "special" control transfers.

The common path through the Dispatchers is 4 instructions, and these are
placed on cache line boundaries; and SW does not have to check for
pending softIRQs or DPCs (or a host of other medium high priority
pending threads) SVR transfers control to the highest pending thread
that can run (affinity) on this core.

There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.

More than a cache line starts to become problematic.

The machine level vector table should be pretty stable, so would have
single cycle access most of the time.

A dedicated vector cache would need to be able have virtual addresses
translated to physical ones. So, it might need to borrow an address
translation port from the I$ or D$.

Or, you could punt and make an efficient means to transfer control
through a table of SW's making.

I would try punting, but I would likely kick the ball sideways :) There
is a memory indirect call instruction but I realized I could not use it because it is micro-coded and uses (hidden) registers.

That seems to be a fixable problem.

It is just a
ordinary load then jump register. It would trounce on registers being
used by the app if it were used to invoke an ISR. So, I must build
something additional or re-write the indirect call as a state machine perhaps. Either way, the machine is going to stall at an exception.

Reasons like this are why by the time control arrives, the ISR Dispatcher already has its register file loaded. So, it has pointers to various
control structures it needs to make forward progress rapidly.

Maybe I should make it an indirect call cache, more general purpose. I
can see why many RISC machine just jump to the address of the vector
rather than attempting to load it, but Q+ ISR vectors also include a privilege level. I want the CPU to be loading the vector so they may be implemented as simple gateways. There may be more info associated with
them at some point.

A SW dispatcher seems to me more optimal--as long as privilege control
transfer can put enough bits in some easily available place. This gets
rid of control registers pointing at the vector table, privilege table, priority table,... Just deliver control rapidly to SW and let them sort
it out--it is a very big and complicated problem space.....

The cache would be loading only the cache-line with the vector it needs,
so just a single cache line load is required.

I suppose I could add another port to the I$ to do loads through the I$.
It might be simpler than building another cache.

The only LDs My 66000 has to ICache is the CALX and CALA instructions
and these only LD IP or LEA IP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Feb 4 20:03:12 2024

EricP wrote:

Robert Finch wrote:

Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the
vector can be loaded. Using a cache of vector addresses might help.

There are only 256 vectors in a vector table. It’s small. With a vector
table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.

There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.

The machine level vector table should be pretty stable, so would have
single cycle access most of the time.

A dedicated vector cache would need to be able have virtual addresses
translated to physical ones. So, it might need to borrow an address
translation port from the I$ or D$.

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

Here, Here.

It sounds like you may be saying exception when you mean exceptions, interrupts and errors because my ISA reserves just 16 exception codes
of which only 12 were actually used, If you are mixing these concepts
I suggest you to keep them separate in your mind.

And use different terminology (at least to yourself) in talking about them.

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are maskable, exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.
Misaligned memory access is often maskable.
No Translation is sometimes maskable LDs return 0, STs are discarded.

Exceptions are recognized within instructions and are either faults,
which roll back to a precise state, or traps, which roll forward to a
precise state.

I like to pipeline traps as if this instruction raises an exception on
the next instruction.

Interrupts are recognized between instructions.

Indeed.

Errors are in their own category. Some are synchronous, like detection
of a memory ECC error on read, or asynchronous like detection of a ECC
error on write. Some errors are faults or traps, or aborts, which are imprecise and may leave parts of a core

or memory (or a device control register)

in a scrambled state.
Faults and traps are potentially restartable, aborts are not.

Firstly, I recommend peeling errors off and handling them separately.

What EricP calls errors I call checks. All checks should be directed
to the highest privilege level available (HyperVisor) as it may want
to terminate a rouge GuestOS and restart it from a checkpoint.

That leaves exceptions and interrupts.

I would add Supervisor calls to exceptions and interrupts.

Exceptions are, or appear to be, recognized at commit/retire.
They partly behave like mispredicted branches.
As I only have up to 16 exception codes then all I need is a
simple method for *calculating* an exception jump address.
This eliminates the need to access memory to get an exception vector.

Yes, Indeed. How to recognize that somebody else needs to run is a
HW problem, who to run is a SW problem--do not get them mixed up.
{{Thanks to EricP for being so patient with me here}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sun Feb 4 21:09:17 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

Here, Here.

Should that not be, instead, 'Hear! Hear!'?

If you insist.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sun Feb 4 20:51:37 2024

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

Here, Here.

Should that not be, instead, 'Hear! Hear!'?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Mon Feb 5 01:37:59 2024

Robert Finch wrote:

On 2024-02-04 3:03 p.m., MitchAlsup1 wrote:

And use different terminology (at least to yourself) in talking about them. >>

I do not talk to myself very often.

You should try it--very therapeutic !

Yes, Indeed. How to recognize that somebody else needs to run is a HW
problem, who to run is a SW problem--do not get them mixed up.
{{Thanks to EricP for being so patient with me here}}

*****
Okay, scrapped loading the vector from a table, too complex. Q+ now just jumps to the vector address calculated from the cause code. Reduced the calculation to just 16 branch points. Added a branch point for
‘alternate cause’ when the cause code is greater than 15, which just
lets software process things.

Q+ looks more like an early micro. On reset the branch table is located
in the last 256 bytes of memory, but it is relocatable.

I have a single control register that points at the SW stack {HV, GHV, GOS, user}
the memory found there has all the information needed--including:: The Root Pointers,
thread pointers, interrupt tables, ... so I come out of reset with all of these things
initialized (in ROM) so, as seen by HW, there is no special state after reset.

It is completely possible to exit reset and be running multi-GuestOS, multi-threaded
code instantaneously.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Feb 5 10:45:58 2024

MitchAlsup1 wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

Here, Here.

Should that not be, instead, 'Hear! Hear!'?

If you insist.

I am pretty sure that 'hear! hear!' originated in the British Parliament
(which has amazingly non-regulated discussions!), but that it has been written/CCed as 'here! here!' in many places/over many years.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Mon Feb 5 10:42:38 2024

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored. Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Better to have specific instructions that fault on integer overflow.

In the past changing that mask bit required a pipeline flush.
That flush can be eliminated but costs extra logic in Decode
that maps the integer instructions + (future) mask bit state to
the same uOps as you would have with specific instructions.

That mask register leads to other questions like what happens to it
on interrupts or exceptions.

In the end that overflow mask bit costs more to implement and costs
more at run time to manage than specfic integer overflow instructions.

Misaligned memory access is often maskable.

I don't have this because it implies an ISA defined global mask enable
register which becomes part of the ABI. But if I did then it should be enabled/disabled from both user and super mode, which leads to all the
same issues as overflow mask bits.

No Translation is sometimes maskable LDs return 0, STs are discarded.

I tentatively have Load Conditional LDC and Store Conditional STC
instructions (nothing to do with atomics) which test if the address == 0
and skip the LD or ST if so.

Those memory conditional instructions complement other register conditional instructions Move Conditional MVC and Load Immediate Conditional LIC.

This was to address the criticisms on the limited usefulness of the
RISC CMOV reg-reg instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Feb 5 18:57:09 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.

What it requires are languages that define what to do on integer overflow.
ADA is a candidate, C is not.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

Better to have specific instructions that fault on integer overflow.

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

That flush can be eliminated but costs extra logic in Decode
that maps the integer instructions + (future) mask bit state to
the same uOps as you would have with specific instructions.

That mask register leads to other questions like what happens to it
on interrupts or exceptions.

It is part of the auto-saved thread-state--just like the current IP,
current safe-stack pointer, ...

In the end that overflow mask bit costs more to implement and costs
more at run time to manage than specfic integer overflow instructions.

Misaligned memory access is often maskable.

I don't have this because it implies an ISA defined global mask enable register which becomes part of the ABI. But if I did then it should be enabled/disabled from both user and super mode, which leads to all the
same issues as overflow mask bits.

No Translation is sometimes maskable LDs return 0, STs are discarded.

I tentatively have Load Conditional LDC and Store Conditional STC instructions (nothing to do with atomics) which test if the address == 0
and skip the LD or ST if so.

Those memory conditional instructions complement other register conditional instructions Move Conditional MVC and Load Immediate Conditional LIC.

This was to address the criticisms on the limited usefulness of the
RISC CMOV reg-reg instructions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 5 20:11:39 2024

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.

What it requires are languages that define what to do on integer overflow. >ADA is a candidate, C is not.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

That's entirely architecture and language specific.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 5 21:03:13 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored. >>

What it requires are languages that define what to do on integer overflow. >>ADA is a candidate, C is not.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

That's entirely architecture and language specific.

When a language has specifications for overflow, et.al. ISA needs an easy
means to perform those activities.

When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Feb 5 21:46:56 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception). >>>>>

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.

What it requires are languages that define what to do on integer overflow. >>>ADA is a candidate, C is not.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

That's entirely architecture and language specific.

When a language has specifications for overflow, et.al. ISA needs an easy >means to perform those activities.

When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.

For us (burroughs) it was simple. If the application cared about
overflow, it could check for it after arithmetic operations with
a branch-on-overflow instruction. The state toggle was sticky and
only reset by the branch.

Mostly COBOL.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Feb 5 22:26:10 2024

Scott Lurndal wrote:

When a language has specifications for overflow, et.al. ISA needs an easy >>means to perform those activities.

When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.

For us (burroughs) it was simple. If the application cared about
overflow, it could check for it after arithmetic operations with
a branch-on-overflow instruction. The state toggle was sticky and
only reset by the branch.

Mostly COBOL.

I was annoyed when writing C on VAX that::

for( i = 1; i ; i <<= 1 )
{ }

overflowed instead of terminating the loop.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Feb 6 09:32:48 2024

MitchAlsup1 wrote:

Scott Lurndal wrote:

When a language has specifications for overflow, et.al. ISA needs an
easy
means to perform those activities.

When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.

For us (burroughs) it was simple. If the application cared about
overflow, it could check for it after arithmetic operations with
a branch-on-overflow instruction. The state toggle was sticky and
only reset by the branch.

Mostly COBOL.

I was annoyed when writing C on VAX that::

for( i = 1; i ; i <<= 1 )
{ }

overflowed instead of terminating the loop.

This sounds like VAX C either was not setting the overflow flags on entry,
or wasn't using CALL-RET instructions so wasn't following the standard ABI.

The intent was that overflow would be set as desired by your routine entry.
The Program Status Word PSW had the integer status bits and some trap
control flags: single step, integer overflow, decimal overflow,
float underflow. The CALLG/CALLS automatically saved the caller frame
including the PSW trap flags. The new routine entry mask specified a
bit mask of registers to save and the initial state of two trap flags
for integer and decimal overflow.
The RET instruction restored the prior call frame and two trap flags.

The problem comes when a routine wants both wrapping and trapping behaviors
as then you need to keep toggling the enable between the two which caused performance problems.

If an ISA has separate overflow trapping instructions then C can emit
wrapping ones, Rust/Ada/Fortran can emit trapping ones, and all is well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Tue Feb 6 09:31:09 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and
restored.

What it requires are languages that define what to do on integer overflow. ADA is a candidate, C is not.

Funny you mention that because it was VAX Ada85 where I first encountered
the problem with the VAX overflow enable flag. VAX Fortran77 optionally
allowed overflow detection but we turned that off for other reasons.

Ada85 requires detecting signed integer overflow, but VAX address
calculations are wrapping. This meant it should have toggled the overflow enable as needed for each instruction. Of course that would have killed performance so they didn't, they just left overflow traps on.

It was quite easy to get VAX Ada85, or *any other* integer overflow trapping language, to generate an extraneous overflow trap by simply declaring an
array with its bounds set just so.

When an array had a non-zero index base like [100..200] then rather than
always subtracting the lower bound from the index, compilers used index biased-base which offsets the array address by the lower base index amount. This is perfectly safe because the array index is checking independently.

By just setting the array index range accordingly it would cause the
address base-bias calculation to wrap and erroneously trigger an overflow
trap on an otherwise perfectly legal declaration.

The *only* solution is to have separate wrapping and trapping instructions.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

That is language defined.
Modula 2 has all four signed, unsigned, trapping and wrapping integers.
And if it was up to me all languages would have too.

Unsigned integers just have a different base value than signed.
Whether an operation is wrapping, trapping, or saturating
is an independent property from its base value.

Better to have specific instructions that fault on integer overflow.

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

That flush can be eliminated but costs extra logic in Decode
that maps the integer instructions + (future) mask bit state to
the same uOps as you would have with specific instructions.

That mask register leads to other questions like what happens to it
on interrupts or exceptions.

It is part of the auto-saved thread-state--just like the current IP,
current safe-stack pointer, ...

And I want to avoid that whenever possible.
If I don't create things that need to be managed in the first place,
then I don't need to constantly tend them.

Mis-alignment faults is a useful feature for detecting potential performance problems. The question is does it need to be defined in the ISA in some
global control register and I say no.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Wed Feb 7 16:38:06 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).

Integer OVERFLOW is very often maskable.

This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and
restored.

What it requires are languages that define what to do on integer
overflow.
ADA is a candidate, C is not.

Funny you mention that because it was VAX Ada85 where I first encountered
the problem with the VAX overflow enable flag. VAX Fortran77 optionally allowed overflow detection but we turned that off for other reasons.

Ada85 requires detecting signed integer overflow, but VAX address calculations are wrapping. This meant it should have toggled the overflow enable as needed for each instruction. Of course that would have killed performance so they didn't, they just left overflow traps on.

It was quite easy to get VAX Ada85, or *any other* integer overflow
trapping
language, to generate an extraneous overflow trap by simply declaring an array with its bounds set just so.

When an array had a non-zero index base like [100..200] then rather than always subtracting the lower bound from the index, compilers used index biased-base which offsets the array address by the lower base index amount. This is perfectly safe because the array index is checking independently.

By just setting the array index range accordingly it would cause the
address base-bias calculation to wrap and erroneously trigger an overflow trap on an otherwise perfectly legal declaration.

The *only* solution is to have separate wrapping and trapping instructions.

Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.

Signed integer calculations can overflow, unsigned cannot.

That is language defined.
Modula 2 has all four signed, unsigned, trapping and wrapping integers.
And if it was up to me all languages would have too.

The Mill have all these as generic ops, with the ability to do signed/unsigned/wrapping/saturating/etc. This is really the best way and
Mill can afford it due to the belt slot carrying the size/type metadata
so it does not have to be part of the opcode.

Unsigned integers just have a different base value than signed.
Whether an operation is wrapping, trapping, or saturating
is an independent property from its base value.

Better to have specific instructions that fault on integer overflow.

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Thu Feb 8 10:50:09 2024

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Feb 8 18:11:24 2024

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Feb 9 15:14:03 2024

MitchAlsup1 wrote:

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary work)
will finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

To what extent does it need to be compatible?

Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.

The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.

If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.

And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and callee?
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
In which case it doesn't need to have status flag compatibility either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Feb 9 20:54:57 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary work)
will finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

To what extent does it need to be compatible?

A compiler has to be able to recompile old dusty deck source code that
accesses the old-prescribed intrinsic functions:: both 2008 and 2019

lowerFlags()
raiseFlags()
testFlags()
testsavedFlags()
saveallFlags()
restoreFlags()

I don't see how one can be compatible with this part of the specification
if the implementation does not have flags !?! And I don't see how an implementation with flags can have the desired performance of an imple- mentation without flags !?!

Up to this point IEEE 754 has been upwards compatible (except for that
2008 MAX and MIN with NaN thing)

Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.

The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.

People are still compiling FROTRAN codes from 1965 !!

If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.

And interrogated properly.

And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and callee?

As far as I know, no, nor the other direction.

I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.

Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.

In which case it doesn't need to have status flag compatibility either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Fri Feb 9 21:53:53 2024

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

In the past changing that mask bit required a pipeline flush.

Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

It is still 5 years away, so yes, I won't discuss anything publically.

Just know that we do realize that the world have changed (AI chips, 16/8
bit FP etc).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Feb 9 17:30:30 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

To what extent does it need to be compatible?

A compiler has to be able to recompile old dusty deck source code that accesses the old-prescribed intrinsic functions:: both 2008 and 2019

lowerFlags()
raiseFlags()
testFlags()
testsavedFlags()
saveallFlags()
restoreFlags()

I don't see how one can be compatible with this part of the specification
if the implementation does not have flags !?! And I don't see how an implementation with flags can have the desired performance of an imple- mentation without flags !?!

Up to this point IEEE 754 has been upwards compatible (except for that
2008 MAX and MIN with NaN thing)

Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.

The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.

People are still compiling FROTRAN codes from 1965 !!

If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.

And interrogated properly.

And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and
callee?

As far as I know, no, nor the other direction.

I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.

Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.

In which case it doesn't need to have status flag compatibility either.

It does need flags but they don't need to be done with same single
accumulator style with sticky OR. This breaks up that dependency chain.

Lets assume a new ISA, with separate instructions for the new fpu.
First we split up the control options and status flags.

It could have a status register per float data register.
The 6 status bits would be padded to 8 bits and held together in
groups of 8. To support status spill and reload there would be
instructions to save and restore the status group for each set
of 8 float registers, FS0..FS7, FS8..FS15, etc.

Each float register status bits would be set on float register load FLD.
This would be the denorm bit and invalid operand bit, the rest are 0.

Operations would use the status bits of the source operand registers
and optionally merge its new generated status with the OR of input status
and stored with the output register. This optionally allows code to
propagate status along all or part of a calculation.

There could be an instruction that moves a float status register FSx
into old status register if needed for compatibility.

That leaves control bits in the new instruction for rounding, exception mask, and other control bits (merge status, denorms are zero, flush to zero,
new features?) along with data type bits (FP16, FP32, FP64, FP128).

The risc ISA's with fixed size instructions will probably squawk
over the number of control/mask/type bits in the new instructions.
(But who's fault is it anyway for using fixed size instructions
that can't adapt to changing customer requirements.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Feb 9 23:36:18 2024

BGB wrote:

On 2/9/2024 2:14 PM, EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

To what extent does it need to be compatible?

Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.

The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.

If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.

And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and callee? >> I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
In which case it doesn't need to have status flag compatibility either.

What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or per-operator.

Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Sat Feb 10 11:04:32 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

Terje Mathisen wrote:

EricP wrote:

You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.

IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.

The ieee754 working group have been fully aware of this problem for >>>>> at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.

Terje

Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.

I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.

To what extent does it need to be compatible?

A compiler has to be able to recompile old dusty deck source code that accesses the old-prescribed intrinsic functions:: both 2008 and 2019

    lowerFlags()
    raiseFlags()
    testFlags()
    testsavedFlags()
    saveallFlags()
    restoreFlags()

I don't see how one can be compatible with this part of the specification
if the implementation does not have flags !?! And I don't see how an implementation with flags can have the desired performance of an imple- mentation without flags !?!

Mitch, you know almost infinitely more about the hardward restrictions,
but my own mental model for how this could be solved would in fact
require new more bit(s) that a compiler runtime could set and which
would tell the fpu to not bother with those global flags.

I don't see any possibility to double the number of instructions, just
to get a new set which is optimized for heavily
multithreaded/vector/SIMD code, it would have to be a global enable,
including all the awkward problems with interoperability with dusty deck binaries.

Up to this point IEEE 754 has been upwards compatible (except for that
2008 MAX and MIN with NaN thing)

Still compatible: The 2008 min/max are still there: We had to introduce
two brand new instructions in order to fix it, but it is up to
compilers/users to decide which set to use.

Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.

The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.

Way too much overhead probably.

People are still compiling FROTRAN codes from 1965 !!

If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.

And interrogated properly.

And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and
callee?

As far as I know, no, nor the other direction.

I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.

Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.

In which case it doesn't need to have status flag compatibility either.

Right.

Shades of "fastmath" and "subnormal_is_zero", where the latter isn't
needed as Mitch have shown us.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Sat Feb 10 10:30:45 2024

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

BGB wrote:

What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.

Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...

What's wrong with that switch statement approach?

The question is how often does this come up?
Because if its just one person who interval arithmetic every once in a
while then why can't they have two (or whatever) sets of routines?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Feb 10 18:56:21 2024

EricP wrote:

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

BGB wrote:

What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.

Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...

What's wrong with that switch statement approach?

Could be slower; is vastly larger:: 13 inst for the former, versus
about 21 for the later.

The question is how often does this come up?

Code designed to test the implementation (Verification) uses these things
a lot. Actual user codes: I don't think I have ever seen one used, they typically test before a calculation rather than after.

Because if its just one person who interval arithmetic every once in a
while then why can't they have two (or whatever) sets of routines?

IA is another animal all together.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Feb 10 19:57:42 2024

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

Say:
   [[rounding(truncate)]] double x, y, z;
   ...
   z=3*x+y;    //performed with truncate rounding.
Or:
   typedef [[rounding(truncate)]] double dbl_trc;
   ...
   z=3*(dbl_trc)x+(dbl_trc)y;    //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
    for( i = RNE; i < RMI; i++ )
    {
         setRoundingMode( i );
         a[j][i] = b[j] + c[j];
    }
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...

How would you express::

double function1( double x, double y )
{
return x + y;
}

for( j = 0; j < MAX; j++ )
    for( i = RNE; i < RMI; i++ )
    {
         setRoundingMode( i );
         a[j][i] = function1( b[j], c[j] );
    }
}

where modification of the rounding mode is out of scope ???

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Feb 11 09:02:19 2024

MitchAlsup1 wrote:

EricP wrote:

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

BGB wrote:

What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.

Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...

What's wrong with that switch statement approach?

Could be slower; is vastly larger:: 13 inst for the former, versus
about 21 for the later.

Right, but if you cared about overall speed wouldn't do it that way.
You'd only do that if you had a few statements to select.

For larger code blocks you'd compile a whole routine for each round mode,
with a pragma or command line /OPT or type attributes.
Then the switch would call the right routine for that round mode.

For float library functions this implies that if round mode matters
(meaning, if by design it was intended to affect the results)
then there would have to be multiple copies of each function.
Eg sin_rmpi, sin_rmni.
But I don't know which library functions should be affected by round mode.

Currently with dynamic round mode, those float functions which are
NOT supposed to be affected by current round mode setting should be
overriding the inherited RM to set the one their algorithm was design for.
Do they actually do that? Or maybe they document "this must only
be called with RM set to such-n-such". Probably they do neither.

The question is how often does this come up?

Code designed to test the implementation (Verification) uses these things
a lot. Actual user codes: I don't think I have ever seen one used, they typically test before a calculation rather than after.

Compiling a test routine for each mode works fine for this.

Because if its just one person who interval arithmetic every once in a
while then why can't they have two (or whatever) sets of routines?

IA is another animal all together.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sun Feb 11 09:09:16 2024

MitchAlsup1 wrote:

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...

How would you express::

double function1( double x, double y )
{
return x + y;
}

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = function1( b[j], c[j] );
}
}

where modification of the rounding mode is out of scope ???

Compile multiple copies of function1 with different round mode pragma's
and switch the calls.

This is the kind of thing generic functions were designed to handle,
where you have multiple instantiations of one generic function.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Terje Mathisen on Sun Feb 11 10:18:12 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

EricP wrote:

I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.

Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.

In which case it doesn't need to have status flag compatibility either.

Right.

Shades of "fastmath" and "subnormal_is_zero", where the latter isn't
needed as Mitch have shown us.

Terje

What is this "subnormal_is_zero" option that is not needed?
On x86 there are two control bit options, "denormals are zero" DAZ and
"flush to zero" FTZ.

DAZ affects input operands that are denormal turning them
into zero before the operation and NOT setting the Denormal status flag.
FTZ affects output results turning an underflow result which would produce
a denormal into a zero and SETTING the Precision and Denormal status.

A separate DAZ control option really is only needed on the float load FLD instruction to load denorms into register as zero.
After that the FTZ control would prevent them from being created.

DAZ and FTZ are currently dynamically settable.
I can't think of any reason code would want to change these dynamically.
In other words, an algorithm is either designed to deal with denorms or
it isn't and the code would be written accordingly.

So it looks like DAZ could be dropped, and FTZ could be in the operation instructions like round mode, which gets rid of the last of the control bits, except the exception mask bits which can be dealt with separately.

Rather than having exception mask bits on every operation instruction
I was thinking there could be instructions to test or trap on a status mask.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Sun Feb 11 12:54:09 2024

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)

The GPU also has no float status flags and no exceptions.
From what I read, apparently this can make it difficult to find
errors in code as it has to be run in simulation.
Having the status attached to each float register might solve this issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to EricP on Sun Feb 11 18:41:27 2024

EricP wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

EricP wrote:

I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.

Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.

In which case it doesn't need to have status flag compatibility either.

Right.

Shades of "fastmath" and "subnormal_is_zero", where the latter isn't
needed as Mitch have shown us.

Terje

What is this "subnormal_is_zero" option that is not needed?
On x86 there are two control bit options, "denormals are zero" DAZ and
"flush to zero" FTZ.

When thinking about 754, I try to not be influenced too much by x86
which is the architecture I know best of course.

DAZ affects input operands that are denormal turning them
into zero before the operation and NOT setting the Denormal status flag.
FTZ affects output results turning an underflow result which would produce
a denormal into a zero and SETTING the Precision and Denormal status.

A separate DAZ control option really is only needed on the float load FLD instruction to load denorms into register as zero.
After that the FTZ control would prevent them from being created.

Correct.

IMHO all of this could and possibly should be dropped: Just do it
correctly, with full subnormal support.

DAZ and FTZ are currently dynamically settable.
I can't think of any reason code would want to change these dynamically.
In other words, an algorithm is either designed to deal with denorms or
it isn't and the code would be written accordingly.

So it looks like DAZ could be dropped, and FTZ could be in the operation instructions like round mode, which gets rid of the last of the control
bits,
except the exception mask bits which can be dealt with separately.

Rather than having exception mask bits on every operation instruction
I was thinking there could be instructions to test or trap on a status
mask.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sun Feb 11 19:42:59 2024

EricP wrote:

BGB wrote:

On 2/9/2024 5:36 PM, MitchAlsup1 wrote:

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

???

double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}

By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)

GPUs work with kernels--short programs of 10-30 instructions in total
length. (GPGPUs are different). So selecting a particular rounding
mode at compile time is workable. When control arrives the registers
are already setup to run the kernel, and when control departs, HW
takes the register results and moves them into buffers for the next
stage in the pipeline.

GPUs work with data such that a bit of error here or there does not
bother the player of the game because a bit of flicker from frame
to frame give the hint of background noise. That noise is called
'flicker'.

The GPU also has no float status flags and no exceptions.

Just what is supposed to be a sane response when 11,256 FP overflows
happen in a single clock ?? {CRAY-like vector processors have this
same problem at smaller scale, and the serialized flags of IEEE 754
are an unkind burden here.}

From what I read, apparently this can make it difficult to find
errors in code as it has to be run in simulation.

Just how one debugs a 'program' with 1,000,000 degrees of active
parallelism ?? It is an open question--what we do know is that
single stepping is far to inadequate to be useful.

Having the status attached to each float register might solve this issue.

Adding 5-bits to a 32-bit register where you have 1024 registers in a
WARP is also expensive, and does nothing for the integer calculation cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to EricP on Sun Feb 11 20:11:00 2024

In article <az5yN.221446$yEgf.97184@fx09.iad>, ThatWouldBeTelling@thevillage.com (EricP) wrote:

What is this "subnormal_is_zero" option that is not needed?
On x86 there are two control bit options, "denormals are zero" DAZ
and "flush to zero" FTZ.

Very annoyingly, those control bits only affect SSE registers and their successors. The legacy x87 registers ignore them.

Since Microsoft's x86-32 compiler uses both sets of floating point
registers and operations, you get potential behaviour changes if you use
DAZ and FTZ and a compiler update changes which registers get used.

<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/control 87-controlfp-control87-2?view=msvc-170>

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to EricP on Mon Feb 12 18:33:20 2024

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)

https://arxiv.org/abs/1903.07486 offers quite a lot of insight.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Mon Feb 12 20:04:48 2024

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)

https://arxiv.org/abs/1903.07486 offers quite a lot of insight.

Fascinating--Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Terje Mathisen on Mon Feb 12 20:23:35 2024

Terje Mathisen <terje.mathisen@tmsw.no> writes:

MitchAlsup1 wrote:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

EricP wrote:

I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out >>>>> and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.

Here, Here.

Should that not be, instead, 'Hear! Hear!'?

If you insist.

I am pretty sure that 'hear! hear!' originated in the British
Parliament (which has amazingly non-regulated discussions!), but that
it has been written/CCed as 'here! here!' in many places/over many
years.

I am used to seeing that punctuated 'hear, hear'.

According to sources on the incredibly infallible internet, the
phrase did indeed first occur in British Parliament, and originally
was "hear him, hear him".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mac@21:1/5 to All on Tue Feb 20 17:21:45 2024

How would you express::

for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}

With grawlix

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	37:35:54
Calls:	10,392
Calls today:	3
Files:	14,064
Messages:	6,417,162

Re: Caching exception vectors

Who's Online

Recent Visitors

System Info