Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the vector can be loaded. Using a cache of vector addresses might help.
There are only 256 vectors in a vector table. It’s small. With a vector table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.
There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.
The machine level vector table should be pretty stable, so would have
single cycle access most of the time.
A dedicated vector cache would need to be able have virtual addresses translated to physical ones. So, it might need to borrow an address translation port from the I$ or D$.
Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the vector can be loaded. Using a cache of vector addresses might help.
There are only 256 vectors in a vector table. It’s small. With a vector table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.
There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.
The machine level vector table should be pretty stable, so would have
single cycle access most of the time.
A dedicated vector cache would need to be able have virtual addresses translated to physical ones. So, it might need to borrow an address translation port from the I$ or D$.
On 2024-02-04 11:05 a.m., MitchAlsup1 wrote:
Robert Finch wrote:
Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so
the vector can be loaded. Using a cache of vector addresses might help.
There are only 256 vectors in a vector table. It’s small. With a
vector table cache, the vector would be immediately available at the
commit stage most of the time, single cycle access. The table would
store the vector and the current process id. If the process id of the
vector does not match the current process id then the vector table
entry would need to be loaded. The machine would stall until the
vector is loaded. Modifying the vector table entries would need to
invalidate the vector table entry in the cache.
In my case, under advice of counsel (EricP), I punted !
Since My 66000 has a CALX instruction--a LD to IP and storing of return
address into R0--the HW performs a context switch to the interrupt (or
exception) dispatcher. The dispatcher checks that the vector number
is within range of the appropriate table, and CALXs the ISR (or ESR).
Since many ISRs setup delayed cleanup routines, I made the priority
of pending cleanup routines visible to the SVR (supervisor return)
instruction;...
This means the table can be anywhere, on any boundary, and any size;
while simplifying "special" control transfers.
The common path through the Dispatchers is 4 instructions, and these are
placed on cache line boundaries; and SW does not have to check for
pending softIRQs or DPCs (or a host of other medium high priority
pending threads) SVR transfers control to the highest pending thread
that can run (affinity) on this core.
There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.
More than a cache line starts to become problematic.
The machine level vector table should be pretty stable, so would have
single cycle access most of the time.
A dedicated vector cache would need to be able have virtual addresses
translated to physical ones. So, it might need to borrow an address
translation port from the I$ or D$.
Or, you could punt and make an efficient means to transfer control
through a table of SW's making.
I would try punting, but I would likely kick the ball sideways :) There
is a memory indirect call instruction but I realized I could not use it because it is micro-coded and uses (hidden) registers.
It is just a
ordinary load then jump register. It would trounce on registers being
used by the app if it were used to invoke an ISR. So, I must build
something additional or re-write the indirect call as a state machine perhaps. Either way, the machine is going to stall at an exception.
Maybe I should make it an indirect call cache, more general purpose. I
can see why many RISC machine just jump to the address of the vector
rather than attempting to load it, but Q+ ISR vectors also include a privilege level. I want the CPU to be loading the vector so they may be implemented as simple gateways. There may be more info associated with
them at some point.
The cache would be loading only the cache-line with the vector it needs,
so just a single cache line load is required.
I suppose I could add another port to the I$ to do loads through the I$.
It might be simpler than building another cache.
Robert Finch wrote:
Q+ uses vector tables for exception handling. Exceptions are processed
at the commit stage of the processor. So, loading a vector from memory
at the commit stage has some issues. The pipeline must be stalled so the
vector can be loaded. Using a cache of vector addresses might help.
There are only 256 vectors in a vector table. It’s small. With a vector
table cache, the vector would be immediately available at the commit
stage most of the time, single cycle access. The table would store the
vector and the current process id. If the process id of the vector does
not match the current process id then the vector table entry would need
to be loaded. The machine would stall until the vector is loaded.
Modifying the vector table entries would need to invalidate the vector
table entry in the cache.
There are four vector tables, one for each operating mode, but that is
still only 1024 vectors to cache.
The machine level vector table should be pretty stable, so would have
single cycle access most of the time.
A dedicated vector cache would need to be able have virtual addresses
translated to physical ones. So, it might need to borrow an address
translation port from the I$ or D$.
I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.
It sounds like you may be saying exception when you mean exceptions, interrupts and errors because my ISA reserves just 16 exception codes
of which only 12 were actually used, If you are mixing these concepts
I suggest you to keep them separate in your mind.
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are maskable, exceptions are usually not (FPU exceptions being the... exception).
Exceptions are recognized within instructions and are either faults,
which roll back to a precise state, or traps, which roll forward to a
precise state.
Interrupts are recognized between instructions.
Errors are in their own category. Some are synchronous, like detection
of a memory ECC error on read, or asynchronous like detection of a ECC
error on write. Some errors are faults or traps, or aborts, which are imprecise and may leave parts of a core
in a scrambled state.
Faults and traps are potentially restartable, aborts are not.
Firstly, I recommend peeling errors off and handling them separately.
That leaves exceptions and interrupts.
Exceptions are, or appear to be, recognized at commit/retire.
They partly behave like mispredicted branches.
As I only have up to 16 exception codes then all I need is a
simple method for *calculating* an exception jump address.
This eliminates the need to access memory to get an exception vector.
mitchalsup@aol.com (MitchAlsup1) writes:
EricP wrote:
I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.
Here, Here.
Should that not be, instead, 'Hear! Hear!'?
EricP wrote:
I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.
Here, Here.
On 2024-02-04 3:03 p.m., MitchAlsup1 wrote:
I do not talk to myself very often.
And use different terminology (at least to yourself) in talking about them. >>
Yes, Indeed. How to recognize that somebody else needs to run is a HW
problem, who to run is a SW problem--do not get them mixed up.
{{Thanks to EricP for being so patient with me here}}
*****
Okay, scrapped loading the vector from a table, too complex. Q+ now just jumps to the vector address calculated from the cause code. Reduced the calculation to just 16 branch points. Added a branch point for
‘alternate cause’ when the cause code is greater than 15, which just
lets software process things.
Q+ looks more like an early micro. On reset the branch table is located
in the last 256 bytes of memory, but it is relocatable.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
EricP wrote:
I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out
and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.
Here, Here.
Should that not be, instead, 'Hear! Hear!'?
If you insist.
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
Misaligned memory access is often maskable.
No Translation is sometimes maskable LDs return 0, STs are discarded.
MitchAlsup1 wrote:
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually
model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Better to have specific instructions that fault on integer overflow.
In the past changing that mask bit required a pipeline flush.
That flush can be eliminated but costs extra logic in Decode
that maps the integer instructions + (future) mask bit state to
the same uOps as you would have with specific instructions.
That mask register leads to other questions like what happens to it
on interrupts or exceptions.
In the end that overflow mask bit costs more to implement and costs
more at run time to manage than specfic integer overflow instructions.
Misaligned memory access is often maskable.
I don't have this because it implies an ISA defined global mask enable register which becomes part of the ABI. But if I did then it should be enabled/disabled from both user and super mode, which leads to all the
same issues as overflow mask bits.
No Translation is sometimes maskable LDs return 0, STs are discarded.
I tentatively have Load Conditional LDC and Store Conditional STC instructions (nothing to do with atomics) which test if the address == 0
and skip the LD or ST if so.
Those memory conditional instructions complement other register conditional instructions Move Conditional MVC and Load Immediate Conditional LIC.
This was to address the criticisms on the limited usefulness of the
RISC CMOV reg-reg instructions.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.
What it requires are languages that define what to do on integer overflow. >ADA is a candidate, C is not.
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Signed integer calculations can overflow, unsigned cannot.
mitchalsup@aol.com (MitchAlsup1) writes:
EricP wrote:
MitchAlsup1 wrote:What it requires are languages that define what to do on integer overflow. >>ADA is a candidate, C is not.
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored. >>
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Signed integer calculations can overflow, unsigned cannot.
That's entirely architecture and language specific.
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>>> model dependent. Exceptions are synchronous and associated with aInteger OVERFLOW is very often maskable.
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception). >>>>>
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and restored.
What it requires are languages that define what to do on integer overflow. >>>ADA is a candidate, C is not.
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Signed integer calculations can overflow, unsigned cannot.
That's entirely architecture and language specific.
When a language has specifications for overflow, et.al. ISA needs an easy >means to perform those activities.
When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.
When a language has specifications for overflow, et.al. ISA needs an easy >>means to perform those activities.
When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.
For us (burroughs) it was simple. If the application cared about
overflow, it could check for it after arithmetic operations with
a branch-on-overflow instruction. The state toggle was sticky and
only reset by the branch.
Mostly COBOL.
Scott Lurndal wrote:
When a language has specifications for overflow, et.al. ISA needs an
easy
means to perform those activities.
When a language does not, it should not be manipulating those means. So
ISA needs a easy means to avoid stimulating those activities.
For us (burroughs) it was simple. If the application cared about
overflow, it could check for it after arithmetic operations with
a branch-on-overflow instruction. The state toggle was sticky and
only reset by the branch.
Mostly COBOL.
I was annoyed when writing C on VAX that::
for( i = 1; i ; i <<= 1 )
{ }
overflowed instead of terminating the loop.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP,
INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and
restored.
What it requires are languages that define what to do on integer overflow. ADA is a candidate, C is not.
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Signed integer calculations can overflow, unsigned cannot.
Better to have specific instructions that fault on integer overflow.
In the past changing that mask bit required a pipeline flush.
Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.
That flush can be eliminated but costs extra logic in Decode
that maps the integer instructions + (future) mask bit state to
the same uOps as you would have with specific instructions.
That mask register leads to other questions like what happens to it
on interrupts or exceptions.
It is part of the auto-saved thread-state--just like the current IP,
current safe-stack pointer, ...
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Exceptions are defined by the ISA, PAGE_FAULT, DIV_ZERO, SINGLE_STEP, >>>>> INTEGER_OVERFLOW, etc. Interrupts are part of IO subsystem and usually >>>>> model dependent. Exceptions are synchronous and associated with a
specific instruction, interrupts are asynchronous. Interrupts are
maskable,
exceptions are usually not (FPU exceptions being the... exception).
Integer OVERFLOW is very often maskable.
This causes many problems: it requires a global register to hold the
mask bit which therefore becomes part of the ABI and therefore is
inherited and preserved by calls, and therefore must be saved and
restored.
What it requires are languages that define what to do on integer
overflow.
ADA is a candidate, C is not.
Funny you mention that because it was VAX Ada85 where I first encountered
the problem with the VAX overflow enable flag. VAX Fortran77 optionally allowed overflow detection but we turned that off for other reasons.
Ada85 requires detecting signed integer overflow, but VAX address calculations are wrapping. This meant it should have toggled the overflow enable as needed for each instruction. Of course that would have killed performance so they didn't, they just left overflow traps on.
It was quite easy to get VAX Ada85, or *any other* integer overflow
trapping
language, to generate an extraneous overflow trap by simply declaring an array with its bounds set just so.
When an array had a non-zero index base like [100..200] then rather than always subtracting the lower bound from the index, compilers used index biased-base which offsets the array address by the lower base index amount. This is perfectly safe because the array index is checking independently.
By just setting the array index range accordingly it would cause the
address base-bias calculation to wrap and erroneously trigger an overflow trap on an otherwise perfectly legal declaration.
The *only* solution is to have separate wrapping and trapping instructions.
Also code that wants integer overflow detection is intermixed with
code that does not, so it has to keep toggling the enable bit.
Signed integer calculations can overflow, unsigned cannot.
That is language defined.
Modula 2 has all four signed, unsigned, trapping and wrapping integers.
And if it was up to me all languages would have too.
Unsigned integers just have a different base value than signed.
Whether an operation is wrapping, trapping, or saturating
is an independent property from its base value.
Better to have specific instructions that fault on integer overflow.
In the past changing that mask bit required a pipeline flush.
Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
In the past changing that mask bit required a pipeline flush.
Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.
Terje
Terje Mathisen wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
In the past changing that mask bit required a pipeline flush.
Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary work)
will finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.
MitchAlsup1 wrote:
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary work)
will finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.
To what extent does it need to be compatible?
Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.
The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.
If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.
And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and callee?
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
In which case it doesn't need to have status flag compatibility either.
Terje Mathisen wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
In the past changing that mask bit required a pipeline flush.
Seems excessive--just drop the bit in the pipeline and let it flow.
Just like the FP exception enables in a multi-threaded core.
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for at
least a couple of decades, I am hopeful that our next iteration (which
just started a couple of months ago with the preliminary work) will
finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.
To what extent does it need to be compatible?
A compiler has to be able to recompile old dusty deck source code that accesses the old-prescribed intrinsic functions:: both 2008 and 2019
lowerFlags()
raiseFlags()
testFlags()
testsavedFlags()
saveallFlags()
restoreFlags()
I don't see how one can be compatible with this part of the specification
if the implementation does not have flags !?! And I don't see how an implementation with flags can have the desired performance of an imple- mentation without flags !?!
Up to this point IEEE 754 has been upwards compatible (except for that
2008 MAX and MIN with NaN thing)
Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.
The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.
People are still compiling FROTRAN codes from 1965 !!
If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.
And interrogated properly.
And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and
callee?
As far as I know, no, nor the other direction.
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.
In which case it doesn't need to have status flag compatibility either.
On 2/9/2024 2:14 PM, EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for
at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.
To what extent does it need to be compatible?
Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.
The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.
If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.
And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and callee? >> I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
In which case it doesn't need to have status flag compatibility either.
What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or per-operator.
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Terje Mathisen wrote:
EricP wrote:
You still have to save the flags initial state and restore it
and that has memory and data flow dependencies.
IEEE 754 requires such global flags but there is no reason for
anything else to follow that ill conceived model.
The ieee754 working group have been fully aware of this problem for >>>>> at least a couple of decades, I am hopeful that our next iteration
(which just started a couple of months ago with the preliminary
work) will finally allow a more sane/optimizable alternative.
Terje
Any poop on what kinds of changes they are discussing?
Or maybe they prefer deliberations to be private.
I just wonder how they intend to make it backwards compatible with
code using the 754-specified enables and mask access functions.
To what extent does it need to be compatible?
A compiler has to be able to recompile old dusty deck source code that accesses the old-prescribed intrinsic functions:: both 2008 and 2019
lowerFlags()
raiseFlags()
testFlags()
testsavedFlags()
saveallFlags()
restoreFlags()
I don't see how one can be compatible with this part of the specification
if the implementation does not have flags !?! And I don't see how an implementation with flags can have the desired performance of an imple- mentation without flags !?!
Up to this point IEEE 754 has been upwards compatible (except for that
2008 MAX and MIN with NaN thing)
Say the new design has its own registers, own instructions, ABI,
control and status routines, selected by a compile switch /NEWFPU.
And a routine attribute __newfpucall.
The only issues would come up when a new fpu routine called an old one,
and that might just consist of shuffling information between registers.
People are still compiling FROTRAN codes from 1965 !!
If the data registers are the same as the existing ones then that
just leaves making sure the old control and status are set properly.
And interrogated properly.
And while technically the status register is part of the ABI,
does any code use the status bits to pass values between caller and
callee?
As far as I know, no, nor the other direction.
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.
In which case it doesn't need to have status flag compatibility either.
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
BGB wrote:
What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...
BGB wrote:
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
BGB wrote:
What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...
What's wrong with that switch statement approach?
The question is how often does this come up?
Because if its just one person who interval arithmetic every once in a
while then why can't they have two (or whatever) sets of routines?
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...
EricP wrote:
BGB wrote:
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
BGB wrote:
What could probably work well in my case:
No explicit control registers;
Things like rounding behavior are specified relative to the type or
per-operator.
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...
What's wrong with that switch statement approach?
Could be slower; is vastly larger:: 13 inst for the former, versus
about 21 for the later.
The question is how often does this come up?
Code designed to test the implementation (Verification) uses these things
a lot. Actual user codes: I don't think I have ever seen one used, they typically test before a calculation rather than after.
Because if its just one person who interval arithmetic every once in a
while then why can't they have two (or whatever) sets of routines?
IA is another animal all together.
BGB wrote:
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
Say:
[[rounding(truncate)]] double x, y, z;
...
z=3*x+y; //performed with truncate rounding.
Or:
typedef [[rounding(truncate)]] double dbl_trc;
...
z=3*(dbl_trc)x+(dbl_trc)y; //per-operator rounding.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Granted, yeah, this is probably a case where a dynamic rounding mode
makes more sense...
How would you express::
double function1( double x, double y )
{
return x + y;
}
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = function1( b[j], c[j] );
}
}
where modification of the rounding mode is out of scope ???
MitchAlsup1 wrote:
EricP wrote:
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.
In which case it doesn't need to have status flag compatibility either.
Right.
Shades of "fastmath" and "subnormal_is_zero", where the latter isn't
needed as Mitch have shown us.
Terje
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
Terje Mathisen wrote:
MitchAlsup1 wrote:
EricP wrote:
I would expect most code consider the status bits to be dont_care on
call entry and are ignored after return.
Status bits are there to catch the "if anything bad happened" over the
last 61 million FP calculations, without having to decorate the code
with millions of checks.
In which case it doesn't need to have status flag compatibility either.
Right.
Shades of "fastmath" and "subnormal_is_zero", where the latter isn't
needed as Mitch have shown us.
Terje
What is this "subnormal_is_zero" option that is not needed?
On x86 there are two control bit options, "denormals are zero" DAZ and
"flush to zero" FTZ.
DAZ affects input operands that are denormal turning them
into zero before the operation and NOT setting the Denormal status flag.
FTZ affects output results turning an underflow result which would produce
a denormal into a zero and SETTING the Precision and Denormal status.
A separate DAZ control option really is only needed on the float load FLD instruction to load denorms into register as zero.
After that the FTZ control would prevent them from being created.
DAZ and FTZ are currently dynamically settable.
I can't think of any reason code would want to change these dynamically.
In other words, an algorithm is either designed to deal with denorms or
it isn't and the code would be written accordingly.
So it looks like DAZ could be dropped, and FTZ could be in the operation instructions like round mode, which gets rid of the last of the control
bits,
except the exception mask bits which can be dealt with separately.
Rather than having exception mask bits on every operation instruction
I was thinking there could be instructions to test or trap on a status
mask.
BGB wrote:
On 2/9/2024 5:36 PM, MitchAlsup1 wrote:
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
???
double x, y, z;
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
x = b[j];
y = c[j];
switch(i)
{
case 0:
z=(([[rounding(nearest)]] double)x)+
(([[rounding(nearest)]] double)y);
break;
case 1:
z=(([[rounding(truncate)]] double)x)+
(([[rounding(truncate)]] double)y);
break;
case 2:
z=(([[rounding(pos_inf)]] double)x)+
(([[rounding(pos_inf)]] double)y);
break;
case 3:
z=(([[rounding(neg_inf)]] double)x)+
(([[rounding(neg_inf)]] double)y);
break;
}
a[j][i] = z;
}
By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)
The GPU also has no float status flags and no exceptions.
From what I read, apparently this can make it difficult to find
errors in code as it has to be run in simulation.
Having the status attached to each float register might solve this issue.
What is this "subnormal_is_zero" option that is not needed?
On x86 there are two control bit options, "denormals are zero" DAZ
and "flush to zero" FTZ.
By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
By the way, from what I read this seems to be how Nvidea CUDA works,
with round mode and control flags like flush-to-zero set at compile
time on individual instructions, or invoked through intrinsics.
(They don't publish their ISA details, just high level documents.)
https://arxiv.org/abs/1903.07486 offers quite a lot of insight.
MitchAlsup1 wrote:
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
EricP wrote:
I'll pass on my answer to your question but also say that it is the
wrong question. The question you need to answer is not how you get
into an exception/interrupt/error handler but how you plan to get out >>>>> and return to the previous state. That tells you what information
you need to construct and where during entry. So design the return
mechanism first, then design the entry mechanism to support it.
Here, Here.
Should that not be, instead, 'Hear! Hear!'?
If you insist.
I am pretty sure that 'hear! hear!' originated in the British
Parliament (which has amazingly non-regulated discussions!), but that
it has been written/CCed as 'here! here!' in many places/over many
years.
How would you express::
for( j = 0; j < MAX; j++ )
for( i = RNE; i < RMI; i++ )
{
setRoundingMode( i );
a[j][i] = b[j] + c[j];
}
}
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 37:35:54 |
Calls: | 10,392 |
Calls today: | 3 |
Files: | 14,064 |
Messages: | 6,417,162 |