Forum: >>> Magnum BBS <<<

Microarchitectural support for counting

From Anton Ertl@21:1/5 to All on Thu Oct 3 14:00:55 2024

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

So I thought up the idea of doing a similar thing in the caches (as a
hardware mechanism), and Rene Mueller indicated that he had been
thinking in the same direction, but his hardware people were not
interested.

In any case, here's how this might work. First, you need an
add-to-memory instruction that does not need to know anything about
the result (so the existing AMD64 instruction is not enough, thanks to
it producing a flags result). Have cache consistency states
"accumulator64", "accumulator32", "accumulator16", "accumulator8" (in
addition to MOESI), which indicate that the cache line contains
naturally aligned 64-bit, 32-bit, 16-bit, or 8-bit counters
respectively. Not all of these states need to be supported. You also
add a state shared-plus-accumulators (SPA).

The architectural value of the values in all these cache lines is the
sum of the SPA line (or main memory) plus all the accumulator lines.

An n-bit add-to-memory adds that value to an accumulator-n line. If
there is no such line, such a line is allocated, initialized to zero,
and the add then stores the increment in the corresponding part. When allocating the accumulator line, an existing line may be forced to
switch to SPA, and/or may be moved to an outwards in the cache. But
if the add applies to some local line in exclusive or modified state,
it's probably better to just update that line without any accumulator
stuff.

If there is a read access to such memory, all the accumulator lines
are summed up and added with an SPA line (or main memory); this is
relatively expensive, so this whole thing makes most sense if the
programmer can arrange to have many additions relative to reads or
writes. The SPA line is shared, so we keep its contents and the
contents of the accumulator lines unchanged.

For writes various options are possible; the catchall would be to add
all the accumulator lines for that address to one of the SPA lines of
that memory, overwrite the memory there, and broadcast the new line
contents to the other SPA lines or invalidate them, and zero or
invalidate all the accumulator lines. Another option is to write the
value to one SPA copy (and invalidate the other SPA lines), and zero
the corresponding bytes in the accumulator lines; this only works if
there are no accumulators wider than the write.

You will typically support the accumulator states only at L1 and maybe
L2; if an accumulator line gets cool enough to be evicted from that,
it can be added to the SPA line or to main memory.

How do we enter these states? Given that every common architecture
needs special instructions for using them, use of these instructions
on a shared cache line or modified or exclusive line of another core
would be a hint that using these states is a good idea.

This is all a rather elaborate mechanism. Are counters in
multi-threaded programs used enough (and read rarely enough) to
justify the cost of implementing it? For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem. But maybe there are some other applications that do more
long-time accumulating that would benefit.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Thu Oct 3 19:34:30 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

So I thought up the idea of doing a similar thing in the caches (as a hardware mechanism), and Rene Mueller indicated that he had been
thinking in the same direction, but his hardware people were not
interested.

In any case, here's how this might work. First, you need an
add-to-memory instruction that does not need to know anything about
the result (so the existing AMD64 instruction is not enough, thanks to
it producing a flags result). Have cache consistency states
"accumulator64", "accumulator32", "accumulator16", "accumulator8" (in addition to MOESI), which indicate that the cache line contains
naturally aligned 64-bit, 32-bit, 16-bit, or 8-bit counters
respectively. Not all of these states need to be supported. You also
add a state shared-plus-accumulators (SPA).

The architectural value of the values in all these cache lines is the
sum of the SPA line (or main memory) plus all the accumulator lines.

An n-bit add-to-memory adds that value to an accumulator-n line. If
there is no such line, such a line is allocated, initialized to zero,
and the add then stores the increment in the corresponding part. When allocating the accumulator line, an existing line may be forced to
switch to SPA, and/or may be moved to an outwards in the cache. But
if the add applies to some local line in exclusive or modified state,
it's probably better to just update that line without any accumulator
stuff.

If there is a read access to such memory, all the accumulator lines
are summed up and added with an SPA line (or main memory); this is
relatively expensive, so this whole thing makes most sense if the
programmer can arrange to have many additions relative to reads or
writes. The SPA line is shared, so we keep its contents and the
contents of the accumulator lines unchanged.

For writes various options are possible; the catchall would be to add
all the accumulator lines for that address to one of the SPA lines of
that memory, overwrite the memory there, and broadcast the new line
contents to the other SPA lines or invalidate them, and zero or
invalidate all the accumulator lines. Another option is to write the
value to one SPA copy (and invalidate the other SPA lines), and zero
the corresponding bytes in the accumulator lines; this only works if
there are no accumulators wider than the write.

You will typically support the accumulator states only at L1 and maybe
L2; if an accumulator line gets cool enough to be evicted from that,
it can be added to the SPA line or to main memory.

How do we enter these states? Given that every common architecture
needs special instructions for using them, use of these instructions
on a shared cache line or modified or exclusive line of another core
would be a hint that using these states is a good idea.

This is all a rather elaborate mechanism. Are counters in
multi-threaded programs used enough (and read rarely enough) to
justify the cost of implementing it? For the HotSpot application, the eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem. But maybe there are some other applications that do more long-time accumulating that would benefit.

- anton

When a modern CPU takes an interrupt it does not suspend the current processing, instead it just starts fetching code from the new process while letting computations in the pipeline continue to completion. The OoOe can
have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out or maybe actually stopped.

You can see how this makes royal mess of interpreting program counters with high context switching involved. On average things will be fine, but when
you zoom in you will find some insane counter values both directions for
code snippets caught in a context swap.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Thu Oct 3 22:05:31 2024

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

<snip>

This is all a rather elaborate mechanism. Are counters in
multi-threaded programs used enough (and read rarely enough) to
justify the cost of implementing it? For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem. But maybe there are some other applications that do more
long-time accumulating that would benefit.

- anton

When a modern CPU takes an interrupt it does not suspend the current >processing, instead it just starts fetching code from the new process while >letting computations in the pipeline continue to completion. The OoOe can >have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out or >maybe actually stopped.

Not necessarily the case. For various reasons, entry to the interrupt
handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Fri Oct 4 14:11:23 2024

Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

When a modern CPU takes an interrupt it does not suspend the current
processing, instead it just starts fetching code from the new process while >> letting computations in the pipeline continue to completion. The OoOe can >> have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out or >> maybe actually stopped.

Not necessarily the case. For various reasons, entry to the interrupt handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.

Yes but pipelining interrupts is trickier than that.

First there is pipelining the super/user mode change. This requires fetch
to have a future copy of the mode which is used for instruction address translation, and a mode flag attached to each instruction or uOp,
each checkpoint saves a mode copy, and retire has the committed mode copy. Privileged instructions are checked by decode to ensure their fetch mode
was correct.

On interrupt, if the core starts fetching instructions from the handler and stuffing them into the instruction queue (ROB) while there are still instructions in flight, and if those older instructions get a branch mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler. Also the older instructions might trigger
an exception, delivery of which would take precedence over the delivery
of the interrupt and again purge the handler. Also the older instructions
might raise the core's interrupt priority, masking the interrupt that
it just tried to accept.

The interrupt controller can't complete the hand-off of the interrupt
to a core until it knows that hand-off won't get purged by a mispredict, exception or priority change. So the hand-off becomes like a two-phase
commit where the controller offers an available core an interrupt,
core accepts it tentatively and starts executing the handler,
and core later either commits or rejects the hand-off.
While the interrupt is in limbo the controller marks it as tentative
but keeps its position in the interrupt queue.

This is where your point comes in.
Because the x86/x64 automatically pushes the saved context on the kernel
stack, RIP, RSP, RFLAG, that context store can only happen when the entry
to the interrupt sequence reaches retire, which means all older
instructions must have retired. At that point the core sends a commit
signal to the interrupt controller and begins its stores, and controller removes the interrupt from its queue. If anything purges the hand-off then
core sends a reject signal to controller, which returns the interrupt
to a pending state at its position at the front of its queue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Oct 4 23:09:53 2024

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

When a modern CPU takes an interrupt it does not suspend the current
processing, instead it just starts fetching code from the new process
while
letting computations in the pipeline continue to completion. The OoOe
can
have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained out >>> or
maybe actually stopped.

Not necessarily the case. For various reasons, entry to the interrupt
handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.

Yes but pipelining interrupts is trickier than that.

First there is pipelining the super/user mode change. This requires
fetch
to have a future copy of the mode which is used for instruction address translation, and a mode flag attached to each instruction or uOp,
each checkpoint saves a mode copy, and retire has the committed mode
copy.
Privileged instructions are checked by decode to ensure their fetch mode
was correct.

On interrupt, if the core starts fetching instructions from the handler
and
stuffing them into the instruction queue (ROB) while there are still instructions in flight, and if those older instructions get a branch mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler.

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

Also the older instructions might trigger
an exception, delivery of which would take precedence over the delivery
of the interrupt and again purge the handler. Also the older
instructions
might raise the core's interrupt priority, masking the interrupt that
it just tried to accept.

The interrupt controller can't complete the hand-off of the interrupt
to a core until it knows that hand-off won't get purged by a mispredict, exception or priority change. So the hand-off becomes like a two-phase
commit where the controller offers an available core an interrupt,
core accepts it tentatively and starts executing the handler,
and core later either commits or rejects the hand-off.
While the interrupt is in limbo the controller marks it as tentative
but keeps its position in the interrupt queue.

This is where your point comes in.
Because the x86/x64 automatically pushes the saved context on the kernel stack, RIP, RSP, RFLAG, that context store can only happen when the
entry
to the interrupt sequence reaches retire, which means all older
instructions must have retired. At that point the core sends a commit
signal to the interrupt controller and begins its stores, and controller removes the interrupt from its queue. If anything purges the hand-off
then
core sends a reject signal to controller, which returns the interrupt
to a pending state at its position at the front of its queue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sat Oct 5 11:11:29 2024

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

When a modern CPU takes an interrupt it does not suspend the current
processing, instead it just starts fetching code from the new process
while
letting computations in the pipeline continue to completion. The OoOe >>>> can
have a 1000 instructions in flight. At some point the resources start
getting dedicated to the new process, and the old process is drained
out
or
maybe actually stopped.

Not necessarily the case. For various reasons, entry to the interrupt
handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.

Yes but pipelining interrupts is trickier than that.

First there is pipelining the super/user mode change. This requires
fetch
to have a future copy of the mode which is used for instruction address
translation, and a mode flag attached to each instruction or uOp,
each checkpoint saves a mode copy, and retire has the committed mode
copy.
Privileged instructions are checked by decode to ensure their fetch mode
was correct.

On interrupt, if the core starts fetching instructions from the handler
and
stuffing them into the instruction queue (ROB) while there are still
instructions in flight, and if those older instructions get a branch
mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler.

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

If one can live with the occasional replay of an interrupt hand-off and
handler execute due to mispredict/exception/interrupt_priority_adjust
then the interrupt pipelining looks much simplified.

Also the older instructions might trigger
an exception, delivery of which would take precedence over the delivery
of the interrupt and again purge the handler. Also the older
instructions
might raise the core's interrupt priority, masking the interrupt that
it just tried to accept.

The interrupt controller can't complete the hand-off of the interrupt
to a core until it knows that hand-off won't get purged by a mispredict,
exception or priority change. So the hand-off becomes like a two-phase
commit where the controller offers an available core an interrupt,
core accepts it tentatively and starts executing the handler,
and core later either commits or rejects the hand-off.
While the interrupt is in limbo the controller marks it as tentative
but keeps its position in the interrupt queue.

This is where your point comes in.
Because the x86/x64 automatically pushes the saved context on the kernel
stack, RIP, RSP, RFLAG, that context store can only happen when the
entry
to the interrupt sequence reaches retire, which means all older
instructions must have retired. At that point the core sends a commit
signal to the interrupt controller and begins its stores, and controller
removes the interrupt from its queue. If anything purges the hand-off
then
core sends a reject signal to controller, which returns the interrupt
to a pending state at its position at the front of its queue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to EricP on Sat Oct 5 17:49:37 2024

EricP <ThatWouldBeTelling@thevillage.com> wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

When a modern CPU takes an interrupt it does not suspend the current >>>>> processing, instead it just starts fetching code from the new process >>>>> while
letting computations in the pipeline continue to completion. The OoOe >>>>> can
have a 1000 instructions in flight. At some point the resources start >>>>> getting dedicated to the new process, and the old process is drained >>>>> out
or
maybe actually stopped.

Not necessarily the case. For various reasons, entry to the interrupt >>>> handler may actually have a barrier to ensure that outstanding stores
are committed (store buffer drained) before continuing. This is for
error containment purposes.

Yes but pipelining interrupts is trickier than that.

First there is pipelining the super/user mode change. This requires
fetch
to have a future copy of the mode which is used for instruction address
translation, and a mode flag attached to each instruction or uOp,
each checkpoint saves a mode copy, and retire has the committed mode
copy.
Privileged instructions are checked by decode to ensure their fetch mode >>> was correct.

On interrupt, if the core starts fetching instructions from the handler
and
stuffing them into the instruction queue (ROB) while there are still
instructions in flight, and if those older instructions get a branch
mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler.

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

Such a simplistic circular buffer approach would not work with
HyperThreading, this is a solved problem obviously, and once solved it’s
not a problem anymore.

But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

If one can live with the occasional replay of an interrupt hand-off and handler execute due to mispredict/exception/interrupt_priority_adjust
then the interrupt pipelining looks much simplified.

Also the older instructions might trigger
an exception, delivery of which would take precedence over the delivery
of the interrupt and again purge the handler. Also the older
instructions
might raise the core's interrupt priority, masking the interrupt that
it just tried to accept.

The interrupt controller can't complete the hand-off of the interrupt
to a core until it knows that hand-off won't get purged by a mispredict, >>> exception or priority change. So the hand-off becomes like a two-phase
commit where the controller offers an available core an interrupt,
core accepts it tentatively and starts executing the handler,
and core later either commits or rejects the hand-off.
While the interrupt is in limbo the controller marks it as tentative
but keeps its position in the interrupt queue.

This is where your point comes in.
Because the x86/x64 automatically pushes the saved context on the kernel >>> stack, RIP, RSP, RFLAG, that context store can only happen when the
entry
to the interrupt sequence reaches retire, which means all older
instructions must have retired. At that point the core sends a commit
signal to the interrupt controller and begins its stores, and controller >>> removes the interrupt from its queue. If anything purges the hand-off
then
core sends a reject signal to controller, which returns the interrupt
to a pending state at its position at the front of its queue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sat Oct 5 17:57:12 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

Ok, it's more efficient to insert an interrupt call into the
instruction stream in the decoder: all the in-flight instructions will
be completed instead of going to waste. However, the interrupt will
usually be serviced later, and, as you point out, if the instruction
stream is redirected for some other reason, you may have to replay the interrupt.

As for counting, it seems to me that Brett has understood nothing of
what he cited from my posting.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat Oct 5 22:55:47 2024

On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

On interrupt, if the core starts fetching instructions from the handler
and
stuffing them into the instruction queue (ROB) while there are still
instructions in flight, and if those older instructions get a branch
mispredict, then the purge of mispredicted older instructions will also
purge the interrupt handler.

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

Every instruction needs a way to place itself before or after
any mispredictable branch. Once you know which branch mispredicted, you
know instructions will not retire, transitively. All you really need to
know is if the instruction will retire, or not. The rest of the
mechanics
play out naturally in the pipeline.

But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Can you make this statement again and use different words?

If one can live with the occasional replay of an interrupt hand-off and handler execute due to mispredict/exception/interrupt_priority_adjust
then the interrupt pipelining looks much simplified.

You just have to cover the depth of the pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Anton Ertl on Mon Oct 7 17:17:46 2024

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

Ok, it's more efficient to insert an interrupt call into the
instruction stream in the decoder: all the in-flight instructions will
be completed instead of going to waste. However, the interrupt will
usually be serviced later, and, as you point out, if the instruction
stream is redirected for some other reason, you may have to replay the interrupt.

The CPU can give ring 0 priority, it is OoOe after all.

The interrupt likely has to touch ram, ram that is 150 cycles away. With an
8 wide cpu that is 1200 instructions drained out of your huge 800 slot OoOe buffer.

Yet another reason the instruction counters look like garbage when you try
and use them like a microscope.

I had a $20,000 ICE (in circuit emulator) for the PlayStation 2 as my job
was optimizing game code. I got to look at individual reads and writes and
ALU state changes of instruction completion. (The ALU state is available on undocumented pins so the hardware guys can debug, all this is dumped to a special memory.)

As for counting, it seems to me that Brett has understood nothing of
what he cited from my posting.

My thread is more interesting and useful to readers, whereas you are lost
in the woods. ;)

Your hardware guys are not interested because they know what you want is
not useful. ICE probes could give you more info, but that tech is highly
secret and dangerous for users to get, and is fused off for your
protection. You don’t need such data, and would not understand such info if you had it.

You are being humored just in case what you want is not too much trouble
and could be added, so continue telling us what you really want. And what
you think is broken and why, and people will tell you why you think it’s broken is wrong.

Is there a link to a paper covering your concerns?

- anton

To really profile something accurately you need to batch your loops, so
your counters are not corrupted by code before and after the loop. Just
ignore the first few and last batches, which will give time for the
prefetch to warm up. There will be interrupt spikes, ignore those also.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Mon Oct 7 19:59:59 2024

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

Ok, it's more efficient to insert an interrupt call into the
instruction stream in the decoder: all the in-flight instructions will
be completed instead of going to waste. However, the interrupt will
usually be serviced later, and, as you point out, if the instruction
stream is redirected for some other reason, you may have to replay the
interrupt.

The CPU can give ring 0 priority, it is OoOe after all.

I assume you're referring to the Intel 64-bit x86_64 family,
as other processor families don't have a 'ring 0' per se.

The AArch64 architecture provides the ability to configure
the two processor interrupt signals to be delivered
independently at any one of three privilege (exception)
levels - kernel, hypervisor or secure monitor.

Usually configured to route FIQ (Fast interrupt) to
the most secure privilege level, and IRQ (interrupt)
to the next most privileged level (hypervisor or
bare metal OS).

The interrupt likely has to touch ram, ram that is 150 cycles away. With an
8 wide cpu that is 1200 instructions drained out of your huge 800 slot OoOe >buffer.

Perhaps, perhaps not. If the interrupts are frequent enough, it's likely
that the ISR code will be present at one of the cache levels.

Aarch64 doesn't touch memory at all on an interrupt, all state
is in system registers; other than accessing a code cache line
for the interrupt handler, which is likely to be in L2 or L3
if not already in L1I.

My thread is more interesting and useful to readers,

With that I believe most readers of this newsgroup would
disagree.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Brett on Mon Oct 7 20:07:00 2024

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Your hardware guys are not interested because they know what you want is
not useful. ICE probes could give you more info, but that tech is highly >secret and dangerous for users to get

There are close to a dozen 3rd-party devices that will attach to
the JTAG port and provide extremely low-level hardware state, including individual flops and rams by reading the scan chains. For AArch64,
all the interesting state is directly documented in the ARMv8 ARM
in the context of a JTAG-like implementation.

Hardly "highly secret".

Scan chains are clearly proprietary design data.

, and is fused off for your
protection.

An option at manufacturing time, or later when the chip is integrated
into a platform, the platform vendor has the choice of fusing out the
JTAG/ICE port, which would make sense for a device that needs to be
highly secure (a firewall or crypto appliance, for example).

You don’t need such data, and would not understand such info if
you had it.

Perhaps you might not undertstand it. Likely most others here have direct experience with scan chains, IDEs (or more likely VCS) et cetera.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Oct 7 17:01:37 2024

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

That's worse than a pipeline drain because you toss things you already
invested in, by fetch, decode, rename, schedule, and possibly execute.
And you still have to refill the pipeline with the handler.

Ok, it's more efficient to insert an interrupt call into the
instruction stream in the decoder: all the in-flight instructions will
be completed instead of going to waste. However, the interrupt will
usually be serviced later, and, as you point out, if the instruction
stream is redirected for some other reason, you may have to replay the interrupt.

The way I saw it, the core continues to execute its current stream while
it prefetches the handler prologue into I$L1, then loads its fetch buffer.
At that point fetch injects a special INT_START uOp into the instruction
stream and switches to the handler. The INT_START uOp travels down the
pipeline following right behind the tail of the original stream.
If none of the flow disrupting events occur to the original stream then
the handler just tucks in behind it. When INT_START hits retire then core
send the commit signal to the interrupt controller to confirm the hand-off.

The interrupt handler should start executing at the same time as it would otherwise. What changes is the interrupt is retained in the controller
in a tentative state longer while the handler is fetched, and the current stream continues executing. So the window where an interrupt hand-off
can be disrupted and rejected is larger. But interrupts are asynchronous
and there is no guaranteed delivery latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Mon Oct 7 21:21:24 2024

On Mon, 7 Oct 2024 19:59:59 +0000, Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

The CPU can give ring 0 priority, it is OoOe after all.

I assume you're referring to the Intel 64-bit x86_64 family,
as other processor families don't have a 'ring 0' per se.

The AArch64 architecture provides the ability to configure
the two processor interrupt signals to be delivered
independently at any one of three privilege (exception)
levels - kernel, hypervisor or secure monitor.

My 66000 provides the ability to configure any number of
interrupts (2^32) through any number of interrupt tables
(2^54) to any number of cores (2^16) at any of the 4
privilege levels and any of the 64 priority levels;

AND it requires no SW PIC updates on world-switches,
and control arrives in an already re-entrant state.

Usually configured to route FIQ (Fast interrupt) to
the most secure privilege level, and IRQ (interrupt)
to the next most privileged level (hypervisor or
bare metal OS).

Any interrupt in any table can be programmed to stimulate
any of the 4 (not just 3) privilege levels. User code
can be configured to take its own page faults without
an excursion through OS (except when called on by user
SVC).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Mon Oct 7 21:29:23 2024

On Mon, 7 Oct 2024 21:01:37 +0000, EricP wrote:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

That's worse than a pipeline drain because you toss things you already invested in, by fetch, decode, rename, schedule, and possibly execute.
And you still have to refill the pipeline with the handler.

Ok, it's more efficient to insert an interrupt call into the
instruction stream in the decoder: all the in-flight instructions will
be completed instead of going to waste. However, the interrupt will
usually be serviced later, and, as you point out, if the instruction
stream is redirected for some other reason, you may have to replay the
interrupt.

The way I saw it, the core continues to execute its current stream while
it prefetches the handler prologue into I$L1, then loads its fetch
buffer.
At that point fetch injects a special INT_START uOp into the instruction stream and switches to the handler. The INT_START uOp travels down the pipeline following right behind the tail of the original stream.

While that is one way of doing it and/or conceptualizing it,
given the width of execution (6-instructions ~= 1000-bits)
throwing another 5-bits into the amalgam to keep track of the
switch is insignificant--just like giving the FPU the current
RM on each FP instruction. Thus, saving the issuance of the
"instruction" without actually taking a beat in the pipeline.

The other bits of this field would be used to indicate which
set of branch shadows this set of instructions is under.

If none of the flow disrupting events occur to the original stream then
the handler just tucks in behind it. When INT_START hits retire then
core
send the commit signal to the interrupt controller to confirm the
hand-off.

The interrupt handler should start executing at the same time as it
would otherwise.

The interrupt handler should start executing when its first instruction
is ready to exit DECODE.

What changes is the interrupt is retained in the controller
in a tentative state longer while the handler is fetched, and the
current
stream continues executing. So the window where an interrupt hand-off
can be disrupted and rejected is larger. But interrupts are asynchronous
and there is no guaranteed delivery latency.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Oct 7 23:15:39 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Oct 2024 19:59:59 +0000, Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

The CPU can give ring 0 priority, it is OoOe after all.

I assume you're referring to the Intel 64-bit x86_64 family,
as other processor families don't have a 'ring 0' per se.

The AArch64 architecture provides the ability to configure
the two processor interrupt signals to be delivered
independently at any one of three privilege (exception)
levels - kernel, hypervisor or secure monitor.

My 66000 provides the ability to configure any number of
interrupts (2^32) through any number of interrupt tables
(2^54) to any number of cores (2^16) at any of the 4
privilege levels and any of the 64 priority levels;

While each ARM64 has two interrupt signals, they
are driven by the interrupt controller (GIC) which
maps a very large range of interrupts to one of
the two signals and then asserts the signal when
the interrupt is able to be delivered.

Supports a variable range of interrupts up to the
payload size of the MSI-X data payload (32-bits),
including inter-processor interrupts (SGI),
per-processor interrupts (generated by a core
for itself (e.g. timer interrupts, profiling
interrupts, etc)), wired interrupts (as many
as 2048), peripheral interrupts (2^N (Nmin:16, nMAX:32)
and a complete virtual interrupt space for every
virtual machine.

AND it requires no SW PIC updates on world-switches,
and control arrives in an already re-entrant state.

That's been true for the ARM GIC forever, except obviously
for interrupts that target a non-resident virtual machine.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue Oct 8 01:28:44 2024

On Mon, 7 Oct 2024 23:15:39 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Oct 2024 19:59:59 +0000, Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

The CPU can give ring 0 priority, it is OoOe after all.

I assume you're referring to the Intel 64-bit x86_64 family,
as other processor families don't have a 'ring 0' per se.

The AArch64 architecture provides the ability to configure
the two processor interrupt signals to be delivered
independently at any one of three privilege (exception)
levels - kernel, hypervisor or secure monitor.

My 66000 provides the ability to configure any number of
interrupts (2^32) through any number of interrupt tables
(2^54) to any number of cores (2^16) at any of the 4
privilege levels and any of the 64 priority levels;

While each ARM64 has two interrupt signal

pins*,
they

are driven by the interrupt controller (GIC) which
maps a very large range of interrupts to one of
the two signals and then asserts the signal when
the interrupt is able to be delivered.

Can I ask the latency from a new interrupt arriving
at the GPIC (cycle 1) to when the pin on the core
is asserted by LPIC ??

(*) Even if they are buffered by various PICs and
even if they are not external to the package--
each core has those "pins".

My 66000 does not use pins, but uses a sideband
signaling inside the cache coherence protocol.

Supports a variable range of interrupts up to the
payload size of the MSI-X data payload (32-bits),

How many of those 32-bits are used to denote priority?
How many of those 32-bits are used to denote privilege?

In My 66000's case, no bits from the 32-bit message
are used to denote anything--it is a 32-bit message
from one piece of SW to another piece of SW. Privilege
and Priority are found elsewhere in the interrupt
routing (and I/O translation) structure.

including inter-processor interrupts (SGI),
per-processor interrupts (generated by a core
for itself (e.g. timer interrupts, profiling
interrupts, etc)), wired interrupts (as many
as 2048), peripheral interrupts (2^N (Nmin:16, nMAX:32)
and a complete virtual interrupt space for every
virtual machine.

In My 66000, any process with MMU-granted access
to an interrupt table can send IPIs, and any other
interrupts--just like the core were a device--
even sending interrupts to itself. Just like
device interrupts core interrupts are fire-and-
forget, single instruction events.

All interrupt tables are simultaneously able
to receive new interrupts, hand off pending
interrupts, have enable bits flipped on or off;
and if a new interrupt arrives and the table
is not being watched by any core, the table
manager automatically sends an interrupt to
the next level up so the handler can be scheduled
and process the interrupt. {Table manager is
not a core.}

AND it requires no SW PIC updates on world-switches,
and control arrives in an already re-entrant state.

That's been true for the ARM GIC forever, except obviously
for interrupts that target a non-resident virtual machine.

See, I got that one solved, too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Oct 8 02:12:10 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Oct 2024 23:15:39 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Mon, 7 Oct 2024 19:59:59 +0000, Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

The CPU can give ring 0 priority, it is OoOe after all.

I assume you're referring to the Intel 64-bit x86_64 family,
as other processor families don't have a 'ring 0' per se.

The AArch64 architecture provides the ability to configure
the two processor interrupt signals to be delivered
independently at any one of three privilege (exception)
levels - kernel, hypervisor or secure monitor.

My 66000 provides the ability to configure any number of
interrupts (2^32) through any number of interrupt tables
(2^54) to any number of cores (2^16) at any of the 4
privilege levels and any of the 64 priority levels;

While each ARM64 has two interrupt signal

pins*,

No, they're signals. There are no external pins
on the package to provide those signals (note that
they are per-core/hwthread).

they

are driven by the interrupt controller (GIC) which
maps a very large range of interrupts to one of
the two signals and then asserts the signal when
the interrupt is able to be delivered.

Can I ask the latency from a new interrupt arriving
at the GPIC (cycle 1) to when the pin on the core
is asserted by LPIC ??

Depends on a number of factors:
- Is the interrupt controller in the same clock
domain as the CPU?
- Does the incoming interrupt require a configuration
table lookup (LPIs) or is the interrupt configuration
data held in flops (SPI, SGI, PPI)?
- For those that require a memory access to get the
configuration byte, and it's not currently cached in
the GIC, there is the necessary memory latency
(no translation table, physical addresses are used).
- Is it a virtual interrupt that needs to check residency
of the target guest/VM?

(*) Even if they are buffered by various PICs and
even if they are not external to the package--
each core has those "pins".

Six of one, half dozen of the other. In ARM designs
it is a signal exposed to the support circuitry
supporting the core.

Supports a variable range of interrupts up to the
payload size of the MSI-X data payload (32-bits),

How many of those 32-bits are used to denote priority?
How many of those 32-bits are used to denote privilege?

None. Those attributes are either stored in flops (sgi/ppi/spi) or
DRAM (LPI - cached as necessary). Priority is
8 bits per interrupt (also flops or DRAM depending on interrupt
type). There are three distinct
privilege levels (Group 0 (always secure), group 1 secure
and group 1 nonsecure). Group 0 are signalled using
the FIQ signal and group 1 are signalled using the
IRQ signal (qualified by the current security state
of the core).

So the latency varies based basically on the type of
interrupt. SGI and PPI (probably the most frequently
used for IPI and timers respectively) have the lowest
latency. Which is quite low, it's all comb logic.

LPI can be longer latency if the properties byte isn't
cached for that interrupt. But in a high-frequency case
that requires low latency, it will be likely be cached
in the interrupt controller.

In My 66000's case, no bits from the 32-bit message
are used to denote anything--it is a 32-bit message
from one piece of SW to another piece of SW. Privilege
and Priority are found elsewhere in the interrupt
routing (and I/O translation) structure.

All the major operating systems will expect the payload
to be an interrupt number, within the range of the
host system, and that payload will be used to index
into OS dispatch tables.

In My 66000, any process with MMU-granted access
to an interrupt table can send IPIs,

Sounds like a security nightmare. Certainly
possible denial of service, so the code that
is allowed access needs to be privileged and
well demarced.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Scott Lurndal on Tue Oct 8 10:33:59 2024

Scott Lurndal wrote:

Brett <ggtgp@yahoo.com> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Your hardware guys are not interested because they know what you want is
not useful. ICE probes could give you more info, but that tech is highly
secret and dangerous for users to get

There are close to a dozen 3rd-party devices that will attach to
the JTAG port and provide extremely low-level hardware state, including individual flops and rams by reading the scan chains. For AArch64,
all the interesting state is directly documented in the ARMv8 ARM
in the context of a JTAG-like implementation.

Hardly "highly secret".

Scan chains are clearly proprietary design data.

, and is fused off for your
protection.

An option at manufacturing time, or later when the chip is integrated
into a platform, the platform vendor has the choice of fusing out the JTAG/ICE port, which would make sense for a device that needs to be
highly secure (a firewall or crypto appliance, for example).

You don’t need such data, and would not understand such info if
you had it.

Perhaps you might not undertstand it. Likely most others here have direct experience with scan chains, IDEs (or more likely VCS) et cetera.

I had a (quite expensive) ICE for my 386 computer, by the time the
Pentium rolled out, large parts of that functionality had turned into
the EMON counters, and so available to everyone who had signed an Intel NDA.

Byte July 1994 is where I documented my reverse engineering of those
counters, it is (by far!) the most cited paper/article I have ever
written. :-)

This showed Intel the error of their ways, and all subsequent cpus have documented those counters.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to EricP on Sun Oct 13 15:20:37 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order.

What's wrong with performing an asynchronous interrupt at the ROB
level rather than inserting it at the decoder? Just stop commiting at
some point, record this at the interrupt return address and start
decoding the interrupt code.

That's worse than a pipeline drain because you toss things you already >invested in, by fetch, decode, rename, schedule, and possibly execute.

The question is what you want to optimize.

Design simplicity? I think my approach wins here, too.
Interrupt response latency? Use what I propose.
Maximum throughput? Then follow your approach.

The throughput issue is only relevant if you have lots of interrupts.

The way I saw it, the core continues to execute its current stream while
it prefetches the handler prologue into I$L1, then loads its fetch buffer.
At that point fetch injects a special INT_START uOp into the instruction >stream and switches to the handler. The INT_START uOp travels down the >pipeline following right behind the tail of the original stream.
If none of the flow disrupting events occur to the original stream then
the handler just tucks in behind it. When INT_START hits retire then core >send the commit signal to the interrupt controller to confirm the hand-off.

The interrupt handler should start executing at the same time as it would >otherwise.

Architecturally, an instruction is only executed when it
commits/retires. Only then do I/O devices or other CPUs see any
stores or I/O operations performed in the interrupt handler. With
your approach, if there are long-latency instructions in the pipeline
(say, dependence chains containing multiple cache misses) when the
interrupt strikes, the instructions in your interrupt handler will
have to wait until the preceding instructions retire, which can take
thousands of cycles in the worst case.

By contrast, if you treat an interrupt like a branch misprediction and
cancel all the speculative work, the instructions of the interrupt
handler go through the engine as fast as possible, and you get the
minimum response latency possible in the engine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Dec 25 18:30:43 2024

On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

On 10/5/24 11:11 AM, EricP wrote:

MitchAlsup1 wrote:

[snip]

--------------------------

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.

It depends on what you mean::

a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.

b) if you mean that exceptions take priority over non-exception
instruction streaming, well that is what exceptions ARE. In these
cases, the exception handler inherits the priority of the instruction
stream that raised it--but that is NOT assigning a priority to the
exception.

c) and then there are the cases where a PageFault from GuestOS
page tables is serviced by GuestOS, while a PageFault from
HyperVisor page tables is serviced by HyperVisor. You could
assert that HV has higher priority than GuestOS, but it is
more like HV has privilege over GuestOS while running at the
same priority level.

(There might be cases where normal operation allows deadlines to
be met with lower priority and unusual extended operation requires
high priority/resource allocation. Boosting the priority/resource
budget of a thread/task to meet deadlines seems likely to make
system-level reasoning more difficult. It seems one could also
create an inflationary spiral.)

With substantial support for Switch-on-Event MultiThreading, it
is conceivable that a lower priority interrupt could be held
"resident" after being interrupted by a higher priority interrupt.

I don't know what you mean by 'resident' would "lower priority
ISR gets pushed on stack to allow higher priority ISR to run"
qualify as 'resident' ?

And then there is the slightly easier case: where GuestOS is
servicing an interrupt and ISR takes a PageFault in Hyper-
Visor page tables. HV PF ISR fixes GuestOS ISR PF, and returns
to interrupted interrupt handler. Here, even an instruction
stream incapable (IE & EE=OFF) of taking an Exception takes an
Exception to a different privilege level.

Switch-on-Event helps but is not necessary.

A chunked ROB could support such, but it is not clear that such
is desirable even ignoring complexity factors.

Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems
attractive and even an interrupt handler with few instructions
might have significant run time. Since interrupt blocking is
used to avoid core-localized resource contention, software would
have to know about such SoEMT.

It may take 10,000 cycles to read an I/O control register way
down the PCIe tree, the ISR reads several of these registers,
and constructs a data-structure to be processed by softIRQ (or
DPC) at lower priority. So, allowing the long cycle MMI/O LDs
to overlap with ISR thread setup is advantageous.

(Interrupts seem similar to certain server software threads in
having lower ILP from control dependencies and more frequent high
latency operations, which hints that multithreading may be
desirable.)

Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Wed Dec 25 18:44:19 2024

On Sat, 5 Oct 2024 22:55:47 +0000, MitchAlsup1 wrote:

On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

On interrupt, if the core starts fetching instructions from the handler >>>> and
stuffing them into the instruction queue (ROB) while there are still
instructions in flight, and if those older instructions get a branch
mispredict, then the purge of mispredicted older instructions will also >>>> purge the interrupt handler.

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

Every instruction needs a way to place itself before or after
any mispredictable branch. Once you know which branch mispredicted, you
know instructions will not retire, transitively. All you really need to
know is if the instruction will retire, or not. The rest of the
mechanics play out naturally in the pipeline.

If, instead of nullifying every instruction past a given point, you
make each instruction dependent on HIS-Branch execution (a predicted). Instructions issued under a mispredict shadow, remove THEMSELVES from instruction queues.

If one is doing Predication with then-clauses and else-clauses*,
dropping
both clauses into execution and letting branch resolution choose which instruction execute and which die. At this point, the pipeline is well
setup for using the same structure wrt interrupt hand-over. Should an
exception happen in the application instruction stream, which was
already
in execution at the time of interruption, Any branch mispredict from application instructions stops application instruction stream precisely
and we will get back to that precise point after ISR services the
interrupt/

(*) like My 66000

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Can you make this statement again and use different words?

If one can live with the occasional replay of an interrupt hand-off and
handler execute due to mispredict/exception/interrupt_priority_adjust
then the interrupt pipelining looks much simplified.

You just have to cover the depth of the pipeline.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Dec 25 19:10:09 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

On 10/5/24 11:11 AM, EricP wrote:

MitchAlsup1 wrote:

[snip]

--------------------------

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.

It depends on what you mean::

a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.

AArch64 has 44 different synchronous exception priorities, and within
each priority that describes more than one exception, there
is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in DDI0487K_a).

While it is not common for a particular instruction to generate
multiple execeptions, it is certainly possible (e.g. when
instructions are trapped to a more privileged execution mode).

b) if you mean that exceptions take priority over non-exception
instruction streaming, well that is what exceptions ARE. In these
cases, the exception handler inherits the priority of the instruction
stream that raised it--but that is NOT assigning a priority to the
exception.

c) and then there are the cases where a PageFault from GuestOS
page tables is serviced by GuestOS, while a PageFault from
HyperVisor page tables is serviced by HyperVisor. You could
assert that HV has higher priority than GuestOS, but it is
more like HV has privilege over GuestOS while running at the
same priority level.

It seems unlikely that a translation fault in user mode would need
handling in both the guest OS and the hypervisor during the
execution of an instruction; the
exception to the hypervisor would generally occur when the
instruction trapped by the guest (who updated the guest translation
tables) is restarted.

Other exception causes (such as asynchronous exceptions
like interrupts) would remain pending and be taken (subject
to priority and control enables) when the instruction is
restarted (or the next instruction is dispached for asynchronous
exceptions).

<snip>

Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems

That depends on whether the access is posted or non-posted. Only
the latter affects instruction latency. The bulk of I/O to and
from a PCIe express device is initiated by the device directly
to memory (subject to iommu translation), not by the CPU, so
generally the latency to read a MMIO register high enough
to worry about scheduling other work on the core during
the transfer.

In most cases, it takes 1 or 2 orders of magnitude less than 10,000
cycles to read an I/O control register in a typical PCI express function[***], particularly with modern on-chip PCIe endpoints[*] and CXL[**] (absent
a PCIe Switched fabric such as the now deprecated multi-root
I/O virtualization (MR-IOV)). A PCIe Gen-5 card can turn around
a memory read request rather rapidly if the host I/O bus is
clocked at a significant fraction (or unity) of the processor
clock.

[*] Such as the various bus 0 functions integrated into Intel and
ARM processors for e.g. memory controller, I2C, SPI, etc) or
on-chip network and crypto accelerators.

[**] 150ns round trip additional latency compared with
local DRAM with PCIe GEN5.

[***] which don't need to deal with the PCIe transport
and data link layers

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed Dec 25 20:35:29 2024

On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

--------------------------

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::

INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST

So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.

But voiding doesn't look like it works for exceptions or conflicting interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Nullify instructions from the mispredicted paths. On hand off to ISR,
adjust recovery IP to past the last instruction that executed properly; nullifying between exception and ISR.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Dec 25 20:26:05 2024

On Wed, 25 Dec 2024 19:10:09 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

On 10/5/24 11:11 AM, EricP wrote:

MitchAlsup1 wrote:

[snip]

--------------------------

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.

It depends on what you mean::

a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.

AArch64 has 44 different synchronous exception priorities, and within
each priority that describes more than one exception, there
is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
DDI0487K_a).

Thanks for the link::

However, I would claim that the vast majority of those 44 things
are interrupts and not exceptions (in colloquial nomenclature).

An exception is raised if an instruction cannot execute to completion
and is raised synchronously with the instruction stream (and at a
precise point in the instruction stream.

An interrupt is raised asynchronous to the instruction stream.

Reset is an interrupt and not an exceptions.

Debug that hits an address range is closer to an interrupt than an
exception. <but I digress>

But it appears that ARM has many interrupts classified as exceptions.
Anything not generated from instructions within the architectural
instruction stream is an interrupt, and anything generated from
within an architectural instructions stream is an exception.

It also appears ARM uses priority to sort exceptions into an order,
while most architectures define priority as a mechanism to to choose
when to take hard-control-flow-events rather than what.

Be that as it may...

While it is not common for a particular instruction to generate
multiple execeptions, it is certainly possible (e.g. when
instructions are trapped to a more privileged execution mode).

b) if you mean that exceptions take priority over non-exception
instruction streaming, well that is what exceptions ARE. In these
cases, the exception handler inherits the priority of the instruction >>stream that raised it--but that is NOT assigning a priority to the >>exception.

c) and then there are the cases where a PageFault from GuestOS
page tables is serviced by GuestOS, while a PageFault from
HyperVisor page tables is serviced by HyperVisor. You could
assert that HV has higher priority than GuestOS, but it is
more like HV has privilege over GuestOS while running at the
same priority level.

It seems unlikely that a translation fault in user mode would need
handling in both the guest OS and the hypervisor during the
execution of an instruction;

Neither stated nor inferred. A PageFault is handled singularly by
the level in the system that controls (writes) those PTEs.

There is a significant period of time in may architectures after
control arrives at ISR where the ISR is not allowed to raise a
page fault {Storing registers to a stack}, and since this ISR
might be the PageFault handler, it is not in a position to
handle its own faults. However, HyperVisor can handle GuestOS PageFaults--GuestOS thinks the pages are present with reasonable
access rights, HyperVisor tables are used to swap them in/out.
Other than latency GuestOS ISR does not see the PageFault.

My 66000, on the other hand, when ISR receives control, state
has been saved on a stack, the instruction stream is already
re-entrant, and the register file as it was the last time
this ISR ran.

the
exception to the hypervisor would generally occur when the
instruction trapped by the guest (who updated the guest translation
tables) is restarted.

Other exception causes (such as asynchronous exceptions
like interrupts)

Asynchronous exceptions A R E interrupts, not like interrupts;
they ARE interrupts. If it is not synchronous with instruction
stream it is an interrupt. Only if it is synchronous with the
instruction stream is it an exception.

would remain pending and be taken (subject
to priority and control enables) when the instruction is
restarted (or the next instruction is dispached for asynchronous
exceptions).

<snip>

Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems

That depends on whether the access is posted or non-posted.

Writes can be posted, Reads cannot. Reads must complete for the
ISR to be able to setup the control block softIRQ/DPC will
process shortly. Only after the data structure for softIRQ/DPC
is written can ISR allow control flow to leave.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Thu Dec 26 09:46:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::

INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST

So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.

Would this really save much? The main penalty here would still be
fetching and decoding the alternate instructions. Sure, the
instructions after the join point would not have to be fetched and
decoded, but they would still have to go through the renamer, which
typically is as narrow or narrower than instruction fetch and decode,
so avoiding fetch and decode only helps for power (ok, that's
something), but probably not performance.

And the kind of insertion you imagine makes things more complicated,
and only helps in the rare case of a misprediction.

What alternatives do we have? There still are some branches that are
hard to predict and for which it would be helpful to optimize them.

Classically the programmer or compiler was supposed to turn
hard-to-predict branches into conditional execution (e.g., someone
(IIRC ARM) has an ITE instruction for that, and My 6600 has something
similar IIRC). These kinds of instructions tend to turn the condition
from a control-flow dependency (free when predicted, costly when
mispredicted) into a data-flow dependency (usually some cost, but
usually much lower than a misprediction).

But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
compilers are worse (even with feedback-directed optimization as it
exists, i.e., without prediction accuracy feedback), and
predictability might change between phases or callers.

So it seems to me that this is something that the hardware might use
history data to predict whether a branch is hard to predict (and maybe
also taking into account how the dependencies affect the cost), and to
switch between a branch-predicting implementation and a data-flow implementation of the condition.

I have not followed ISCA and Micro proceedings in recent years, but I
would not be surprised if somebody has already done a paper on such an
idea.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Dec 26 12:32:29 2024

On Wed, 25 Dec 2024 20:35:29 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

--------------------------

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction
queue/rob as you can't edit the order. For a branch mispredict you
might be able to mark a circular range of entries as voided, and
leave the entries to be recovered serially at retire.

Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::

INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST

So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.

Yes, compilers often generate such code.
When coding in asm, I typically know at least something about
probability of branches, so I tend to code it differently:

inst
inst
bc colder_section
inst
inst
inst
merge_flow:
inst
inst
...
ret

colder_section:
inst
inst
inst
br merge_flow

Intel's "efficiency" cores family starting from Tremont has weird
"clustered" front end design. It often prefers [predicted] taken
branches over [predicted] non-taken branches. On front ends like that
my optimization is likely to become pessimization.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Chris M. Thomasson on Thu Dec 26 14:56:30 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 10/3/2024 7:00 AM, Anton Ertl wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

...

For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem.

...

If the per-thread counters are properly padded to a l2 cache line and >properly aligned on cache line boundaries, well, the should not cause
false sharing with other cache lines... Right?

Sure, that's what the first sentence of the second paragraph you cited
(and which I cited again) is about. Next, read the next sentence.

Maybe I should give an example (fully made up on the spot, read the
paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code
and 8 bytes per counter, that's 200KB of counters. But these counters
are shared between all threads, so for code running on many cores you
get true and false sharing.

As mentioned, the usual mitigation is per-core counters. With a
256-core machine, we now have 51.2MB of counters for 1MB of executable
code. Now this is Java, so there might be quite a bit more executable
code and correspondingly more counters. They eventually decided that
the benefit of reduced cache coherence traffic is not worth that cost
(or the cost of a hardware mechanism), as described in the last
paragraph, from which I cited the important parts.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Levine@21:1/5 to All on Thu Dec 26 19:11:01 2024

According to Michael S <already5chosen@yahoo.com>:

Yes, compilers often generate such code.
When coding in asm, I typically know at least something about
probability of branches, so I tend to code it differently:

The first version of FORTRAN had a FREQUENCY statement which let you tell it the
relative likelihood of each of the results of a three-way IF, and the expected number of
iterations of a DO loop. It turned out to be useless, because programmers usually
guessed wrong.

The final straw was a compiler where they realized FREQUENCY was implemented backward and nobody noticed.

Unless you've profiled the code and you data to support your branch guesses, just write it in the clearest way you can.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Dec 26 14:25:37 2024

MitchAlsup1 wrote:

On Sat, 5 Oct 2024 15:11:29 +0000, EricP wrote:

MitchAlsup1 wrote:

On Fri, 4 Oct 2024 18:11:23 +0000, EricP wrote:

--------------------------

Not necessary, you purge all of the younger instructions from the
thread at retirement, but none of the instructions associated with
the new <interrupt> thread at the front.

That's difficult with a circular buffer for the instruction queue/rob
as you can't edit the order. For a branch mispredict you might be able
to mark a circular range of entries as voided, and leave the entries
to be recovered serially at retire.

Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::

INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST

So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.

Yes. Long ago I looked at some academic papers on hardware IF-conversion.
Those papers were in the context of Itanium around 2005 or so,
automatically converting short forward branches onto predication.

There were also papers that looked at HW converting predication
back into short branches because they tie down less resources.

IIRC they were looking at interactions between predication,
aka Guarded Execution, and branch predictors, and how IF-conversion
affects the branch predictor stats.

But voiding doesn't look like it works for exceptions or conflicting
interrupt priority adjustments. In those cases purging the interrupt
handler and rejecting the hand-off looks like the only option.

Nullify instructions from the mispredicted paths. On hand off to ISR,
adjust recovery IP to past the last instruction that executed properly; nullifying between exception and ISR.

Yes, that seems the most straight forward way to do it.
But to nullify *some* of the in-flight instructions and not others,
just the ones in the mispredicted shadow, in the middle of a stream
of other instructions, seems to require much of the logic necessary
to support general OoO predication/guarded-execution.

Branch mispredict could use two mechanisms, one using checkpoint
and rollback for a normal branch mispredict which recovers resources immediately in one clock, and another if there is a pipelined interrupt
already appended which defers resource recovery to retire.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to John Levine on Thu Dec 26 22:35:10 2024

On Thu, 26 Dec 2024 19:11:01 -0000 (UTC)
John Levine <johnl@taugh.com> wrote:

According to Michael S <already5chosen@yahoo.com>:

Yes, compilers often generate such code.
When coding in asm, I typically know at least something about
probability of branches, so I tend to code it differently:

The first version of FORTRAN had a FREQUENCY statement which let you
tell it the relative likelihood of each of the results of a three-way
IF, and the expected number of iterations of a DO loop. It turned
out to be useless, because programmers usually guessed wrong.

The final straw was a compiler where they realized FREQUENCY was
implemented backward and nobody noticed.

Unless you've profiled the code and you data to support your branch
guesses, just write it in the clearest way you can.

My asm coding is mostly various math library routines.
Not a production, just fun.
I find the style illustrated above to be the clearest way of handling
less common conditions. And even I am wrong about commonality, the
performance cost of such mistake on the modern hardware is minimal.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Levine on Thu Dec 26 21:18:26 2024

On Thu, 26 Dec 2024 19:11:01 +0000, John Levine wrote:

According to Michael S <already5chosen@yahoo.com>:

Yes, compilers often generate such code.
When coding in asm, I typically know at least something about
probability of branches, so I tend to code it differently:

The first version of FORTRAN had a FREQUENCY statement which let you
tell it the
relative likelihood of each of the results of a three-way IF, and the expected number of
iterations of a DO loop. It turned out to be useless, because
programmers usually
guessed wrong.

The final straw was a compiler where they realized FREQUENCY was
implemented backward and nobody noticed.

There was a PHD Thesis CMU circa 1978 called the "maximal munching
method"
and was supposed to match the largest executable patterns to the DAG corresponding to the VAX 11/780 instruction set. The method was
producing
fast dense code and everyone was happy until they figured out that a
comparison was backwards and they were actually matching the smallest executable patterns; yet the code ran faster that way than before they
fixed it.

Unless you've profiled the code and you data to support your branch
guesses, just write it in the clearest way you can.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Dec 26 21:54:53 2024

On Thu, 26 Dec 2024 9:46:21 +0000, Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Sooner or later, the pipeline designer needs to recognize the of
occuring
code sequence pictured as::

INST
INST
BC-------\
INST |
INST |
INST |
/----BR |
| INST<----/
| INST
| INST
\--->INST
INST

So that the branch predictor predicts as usual, but DECODER recognizes
the join point of this prediction, so if the prediction is wrong, one
only nullifies the mispredicted instructions and then inserts the
alternate instructions while holding the join point instructions until
the alternate instruction complete.

Would this really save much? The main penalty here would still be
fetching and decoding the alternate instructions. Sure, the
instructions after the join point would not have to be fetched and
decoded, but they would still have to go through the renamer, which
typically is as narrow or narrower than instruction fetch and decode,
so avoiding fetch and decode only helps for power (ok, that's
something), but probably not performance.

When you have the property that FETCH will stumble over the join point
before the branch resolves, the fact you reached the join point means
a branch misprediction is avoided (~16 cycles) and you nullify 4
instructions from reservation stations. FETCH is not disrupted, and
execution continues.

The balance is the mispredict recovery overhead (~16 cycles) compared
to the cost of inserting the un-predicted path into execution (1 cycle
in the illustrated case).

And the kind of insertion you imagine makes things more complicated,
and only helps in the rare case of a misprediction.

PREDication is designed for the unpredictable branches--as a means to
directly express the fact that the <equivalent> branch code is not
expected to be predicted well.

For easy to predict branches, don't recode as PRED--presto; done. So,
rather than having to Hint branches or to guard individual instructions;
I PREDicate short clauses; saving bits in each instruction because these
bits come from the PRED-instruction.

What alternatives do we have? There still are some branches that are
hard to predict and for which it would be helpful to optimize them.

Classically the programmer or compiler was supposed to turn
hard-to-predict branches into conditional execution (e.g., someone
(IIRC ARM) has an ITE instruction for that, and My 6600 has something
similar IIRC). These kinds of instructions tend to turn the condition
from a control-flow dependency (free when predicted, costly when mispredicted) into a data-flow dependency (usually some cost, but
usually much lower than a misprediction).

Conditional execution and merging (CMOV) rarely takes as few
instructions
as branchy code and <almost> always consumes more power. However, there
are a few cases where CMOV works out better than PRED, so: My 66000 has
both.

But programmers are not that great on predicting mispredictions (and programming languages usually don't have ways to express them),
compilers are worse (even with feedback-directed optimization as it
exists, i.e., without prediction accuracy feedback), and
predictability might change between phases or callers.

A conditional branch inside a subroutine is almost always dependent on
who calls the subroutine. Some calls may have a nearly 100% prediction
rate in one direction, other calls a near 100% prediction rate in the
other direction.

One thing that IS different n My 66000 (other than PREDs not needing to
be predicted) is that loops are not predicted--there is a LOOP instr-
uction that performs ADD-CMP-BC back to the top of the loop in 1 cycle.
Since HW can see the iterating register and the terminating limit,
one does not overpredict iterations and then mispredict them away,
instead on predicts loop only so long as the arithmetic supports that prediction.

Thus, in My 66000, looping branches do not contribute to predictor
pollution (updates), leaving the branch predictors to deal with the
harder stuff. In addition we have a LD IP instruction (called CALX)
that loads a value from a table directly into IP, so no jumps here.
And finally:: My 66000 has a Jump Through Table (JTT) instruction,
which performs:: range check, table access, add scaled table entry
to IP and transfer control.

Thus, there is very little indirect prediction (maybe none on smaller implementations), switch tables are all PIC, and the tables are
typically ¼ the size of the equivalents in other 64-bit ISAs.

So, by taking Looping branches, indirect branches, and indirect calls
out of the prediction tables, those that remain should be more
predictable.

So it seems to me that this is something that the hardware might use
history data to predict whether a branch is hard to predict (and maybe
also taking into account how the dependencies affect the cost), and to
switch between a branch-predicting implementation and a data-flow implementation of the condition.

I have not followed ISCA and Micro proceedings in recent years, but I
would not be surprised if somebody has already done a paper on such an
idea.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Levine on Thu Dec 26 22:27:51 2024

John Levine <johnl@taugh.com> schrieb:

According to Michael S <already5chosen@yahoo.com>:

Yes, compilers often generate such code.
When coding in asm, I typically know at least something about
probability of branches, so I tend to code it differently:

The first version of FORTRAN had a FREQUENCY statement which let you tell it the
relative likelihood of each of the results of a three-way IF, and the expected number of
iterations of a DO loop. It turned out to be useless, because programmers usually
guessed wrong.

The final straw was a compiler where they realized FREQUENCY was implemented backward and nobody noticed.

Unless you've profiled the code and you data to support your branch guesses, just write it in the clearest way you can.

There is one partial expeption: Putting error handling, which
should occur very infrequently, in a cold partition can indeed
bring benefits, and the compiler can not always figure it out;
heuristics like "a NULL check is less likely to be taken" can
be wrong.

But then again, defining an "unlikely" macro and having code like

if (unlikely (a>b))
{
/* Do some error handling. */
}

probably increases the readability over the naked condition, so...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Dec 27 16:38:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 25 Dec 2024 19:10:09 +0000, Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

On 10/5/24 11:11 AM, EricP wrote:

MitchAlsup1 wrote:

[snip]

--------------------------

But voiding doesn't look like it works for exceptions or conflicting >>>>> interrupt priority adjustments. In those cases purging the interrupt >>>>> handler and rejecting the hand-off looks like the only option.

Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.

It depends on what you mean::

a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.

AArch64 has 44 different synchronous exception priorities, and within
each priority that describes more than one exception, there
is a sub-prioritization therein. (Section D 1.3.5.5 pp 6080 in
DDI0487K_a).

Thanks for the link::

However, I would claim that the vast majority of those 44 things
are interrupts and not exceptions (in colloquial nomenclature).

I think that nomenclature is often processor specific. Anything
that occurs synchronously during instruction execution as a result
of executing that particular instruction is considered an exception
in AArch64. Many of them are traps to higher exception levels
for various reasons (including hypervisor traps) which can occur
potentially with other exceptions such as TLB faults, etc.

Interrupts, in the ARM sense are _always_ asynchronous, and more
specifically refer to the two signals IRQ and FIQ that the Generic
Interrupt Controller uses to inform a processing thread that it
needs to handle and I/O interrupt.

In Aarch64, they all vector through the same per-exception-level (kernel, hypervisor, secure monitor, realm) vector table.

An exception is raised if an instruction cannot execute to completion
and is raised synchronously with the instruction stream (and at a
precise point in the instruction stream.

That description describes accurately all of the 44 conditions
above - the section is entitled, after all, 'SYNCHRONOUS exception
priorities". Interrupts are by definition asynchronous in that
AArch64 architecture.

An interrupt is raised asynchronous to the instruction stream.

Reset is an interrupt and not an exceptions.

I would argue that reset is a condition and is in this list
as such - sometimes it is synchronous (a result of executing
a special instruction or store to a system register), sometimes
it is asynchronous (via the chipset/SoC). The fact that reset
has the highest priority is noted here specifically.

Debug that hits an address range is closer to an interrupt than an
exception. <but I digress>

It is still synchronous to instruction execution.

But it appears that ARM has many interrupts classified as exceptions. >Anything not generated from instructions within the architectural
instruction stream is an interrupt, and anything generated from
within an architectural instructions stream is an exception.

That's your definition. It certainly doesn't apply to AAarch64
(or the burroughs mainframes, for that matter).

It also appears ARM uses priority to sort exceptions into an order,
while most architectures define priority as a mechanism to to choose
when to take hard-control-flow-events rather than what.

They desire determinism for the software.

It seems unlikely that a translation fault in user mode would need
handling in both the guest OS and the hypervisor during the
execution of an instruction;

Neither stated nor inferred. A PageFault is handled singularly by
the level in the system that controls (writes) those PTEs.

Indeed. And the guest OS owns the PTEs (TTEs) for the guest
user process, and the hypervisor owns the PTEs for the guest
"physical address space view". This is true for ARM, Intel
and AMD.

There is a significant period of time in may architectures after
control arrives at ISR where the ISR is not allowed to raise a
page fault {Storing registers to a stack}, and since this ISR
might be the PageFault handler, it is not in a position to
handle its own faults. However, HyperVisor can handle GuestOS >PageFaults--GuestOS thinks the pages are present with reasonable
access rights, HyperVisor tables are used to swap them in/out.
Other than latency GuestOS ISR does not see the PageFault.

I've written two hypervisors (one on x86, long before hardware
assist (1998) and one using AMD SVM and NPT (mid 2000's). There is a
very clean deliniation between the guest physical address space view
from the guest and guest applications, and the host physical
address space apportioned out to the various guest OS' by
the hypervisor. In some cases the hypervisor can not even
peek into the guest physical address space. They are distinct
and independent (sans paravirtualization).

My 66000, on the other hand, when ISR receives control, state
has been saved on a stack, the instruction stream is already
re-entrant, and the register file as it was the last time
this ISR ran.

The AAarch64 exception entry (for both interrupts and exceptions)
is identical and only a few cycles. The exception routine
(ISR in your nomenclature) can decide for itself what state
to preserve (the processor state and return address are saved
in special per-exception-level system registers automatically
during exception entry and restored by exception return (eret
instruction)).

the
exception to the hypervisor would generally occur when the
instruction trapped by the guest (who updated the guest translation
tables) is restarted.

Other exception causes (such as asynchronous exceptions
like interrupts)

Asynchronous exceptions A R E interrupts, not like interrupts;
they ARE interrupts. If it is not synchronous with instruction
stream it is an interrupt. Only if it is synchronous with the
instruction stream is it an exception.

Your interrupt terminology differs from the ARM version. An
interrupt is considered an asynchronous exception (of which
there are three - IRQ, FIQ and Serror[*]). Both synchronous
exceptions and asynchronous exceptionsuse the
same vector table (indexed by exception level (privilege))
and the ESR_ELx (Exception Status Register) has a 6-bit
exception code that the exception routine uses to vector
to the appropriate handler. Data and Instruction abort
(translation faults) exception codes distinguish between
a translation fault that occured in the lesser privilege
(e.g. user mode trapping to kernel, or guest page fault
trapping to hypervisor).

[*] Asynchronous system error (e.g a posted store that subsequently
failed downstream).

would remain pending and be taken (subject
to priority and control enables) when the instruction is
restarted (or the next instruction is dispached for asynchronous
exceptions).

<snip>

Being able to overlap latency of a memory-mapped I/O access (or
other slow access) with execution of another thread seems

That depends on whether the access is posted or non-posted.

Writes can be posted, Reads cannot. Reads must complete for the
ISR to be able to setup the control block softIRQ/DPC will
process shortly. Only after the data structure for softIRQ/DPC
is written can ISR allow control flow to leave.

As I said, it depends on if it is posted or not. A store
to trigger a doorbell that starts processing a ring of
DMA instructions, for example, has no latency. And the DMA
is all initiated by the endpoint device, not the OS.

All that said, this isn't 1993 PCI, modern chipset and PCIe
latencies are less than they used to be especially on
SoCs where you don't have SERDES overhead.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Anton Ertl on Fri Dec 27 11:16:47 2024

On 10/3/24 10:00, Anton Ertl wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

For profiling, do we really need accurate counters? They just need to
be statistically accurate I would think.

Instead of incrementing a counter, just store a non-zero immediate into
a zero initialized byte array at a per "counter" index. There's no
rmw data dependency, just a store so should have little impact on
pipeline.

A profiling thread loops thru the byte array, incrementing an actual
counter when it sees no zero byte, and resets the byte to zero. You
could use vector ops to process the array.

If the stores were fast enough, you could do 2 or more stores at
hashed indices, different hash for each store. Sort of a counting
Bloom filter. The effective count would be the minimum of the
hashed counts.

No idea how feasible this would be though.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Sat Dec 28 07:20:17 2024

On 12/27/24 11:16, jseigh wrote:

On 10/3/24 10:00, Anton Ertl wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

For profiling, do we really need accurate counters? They just need to
be statistically accurate I would think.

Instead of incrementing a counter, just store a non-zero immediate into
a zero initialized byte array at a per "counter" index. There's no
rmw data dependency, just a store so should have little impact on
pipeline.

A profiling thread loops thru the byte array, incrementing an actual
counter when it sees no zero byte, and resets the byte to zero. You
could use vector ops to process the array.

If the stores were fast enough, you could do 2 or more stores at
hashed indices, different hash for each store. Sort of a counting
Bloom filter. The effective count would be the minimum of the
hashed counts.

No idea how feasible this would be though.

Probably not feasible. The polling frequency wouldn't be high enough.

If the problem is the number of counters, then counting Bloom filters
might be worth looking into, assuming the overhead of incrementing
the counts isn't a problem.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Anton Ertl on Mon Dec 30 14:39:27 2024

Anton Ertl wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 10/3/2024 7:00 AM, Anton Ertl wrote:

Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.

The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.

....

For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem.

....

If the per-thread counters are properly padded to a l2 cache line and
properly aligned on cache line boundaries, well, the should not cause
false sharing with other cache lines... Right?

Sure, that's what the first sentence of the second paragraph you cited
(and which I cited again) is about. Next, read the next sentence.

Maybe I should give an example (fully made up on the spot, read the
paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code
and 8 bytes per counter, that's 200KB of counters. But these counters
are shared between all threads, so for code running on many cores you
get true and false sharing.

As mentioned, the usual mitigation is per-core counters. With a
256-core machine, we now have 51.2MB of counters for 1MB of executable
code. Now this is Java, so there might be quite a bit more executable
code and correspondingly more counters. They eventually decided that
the benefit of reduced cache coherence traffic is not worth that cost
(or the cost of a hardware mechanism), as described in the last
paragraph, from which I cited the important parts.

- anton

They could do this by having each thread log its own profile data
into a thread-local profile bucket. When the bucket is full it
queues its bucket to a "full" list and dequeues a new bucket from
an "empty" list. A dedicated thread processes full buckets into the
profile summary arrays, then puts the empty buckets on the empty list.

A profile bucket is an array of 32-bits values. Each value is
a 16-bit event type and 16-bit item id (or whatever).
Simple events like counting each use of a branch take just one entry.
Other profile events could take multiple entries if they recorded
cpu performance counters or real time timestamps or both.

The atomic accesses are only on full and empty bucket lists heads.
By playing with the bucket sizes you can keep the chances of
core collisions on the list heads to negligible.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Paul A. Clayton on Wed Jan 1 00:34:44 2025

On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:

On 12/25/24 1:30 PM, MitchAlsup1 wrote:

On Wed, 25 Dec 2024 17:50:12 +0000, Paul A. Clayton wrote:

On 10/5/24 11:11 AM, EricP wrote:

MitchAlsup1 wrote:

[snip]

--------------------------

But voiding doesn't look like it works for exceptions or
conflicting
interrupt priority adjustments. In those cases purging the
interrupt
handler and rejecting the hand-off looks like the only option.

Should exceptions always have priority? It seems to me that if a
thread is low enough priority to be interrupted, it is low enough
priority to have its exception processing interrupted/delayed.

It depends on what you mean::

a) if you mean that exceptions are prioritized and the highest
priority exception is the one taken, then OK you are working
in an ISA that has multiple exceptions per instruction. Most
RISC ISAs do not have this property.

The context was any exception taking priority over an interrupt
that was accepted, at least on a speculative path. I.e., the
statement would have been more complete as "Should exceptions
always (or ever) have priority over an accepted interrupt?"

In the parlance I used to document My 66000 architecture, exceptions
happen at instruction boundaries, while interrupts happen between
instructions. Thus CPU is never deciding between an interrupt or an
exception.

Interrupts take on the priority assigned at I/O creation time.
{{Oh and BTW, a single I/O request can take I/O exception to
GuestOS, to HyperVisor, can deliver completion to assigned
supervisor (Guest OS or HV), and deliver I/O failures to
Secure Monitor (or whomever is assigned)}}

Exceptions take on the priority of the currently running thread.
A page fault at priority min does not block any interrupt at
priority > min. A page fault at priority max is not interruptible.

--------------------------------------

Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.

Yes, but multithreading could hide some of those latencies in
terms of throughput.

EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.

With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu Jan 2 14:14:50 2025

MitchAlsup1 wrote:

On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:

On 12/25/24 1:30 PM, MitchAlsup1 wrote:

Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.

Yes, but multithreading could hide some of those latencies in
terms of throughput.

EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.

With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.

It is also possible that the speculation barriers I describe below
will limit the benefits that pipelining exceptions and interrupts
might be able to see.

The issue is that both exception handlers and interrupts usually read and
write Privileged Control Registers (PCR) and/or MMIO device registers very early into the handler. Most MMIO device registers and cpu PCR cannot be speculatively read as that may cause a state transition.
Of course all stores are never speculated and can only be initiated
at commit/retire.

The normal memory coherence rules assume that loads are to memory-like locations that do not state transition on reads and that therefore
memory loads can be harmlessly replayed if needed.
While memory stores are not performed speculatively, an implementation
might speculatively prefetch a cache line as soon as a store is queued
and cause cache lines to ping-pong.

But for loads to many MMIO devices and PCR effectively require a
speculation barrier in front of them to prevent replays.

A SPCB Speculation Barrier instruction could block speculation.
It stalls execution until all older conditional branches are resolved and
all older instructions that might throw an exception have determined
they won't do so.

The core could have an internal lookup table telling it which PCR can be
read speculatively because there are no side effects to doing so.
Those PCR would not require an SPCB to guard them.

For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate
backwards from address translate to all the parts of the core that
we might have to sync with.

This all means that there may be very little opportunity for speculative execution of their handlers, no matter how much hardware one tosses at them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jan 2 19:45:36 2025

On Thu, 2 Jan 2025 19:14:50 +0000, EricP wrote:

MitchAlsup1 wrote:

On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:

On 12/25/24 1:30 PM, MitchAlsup1 wrote:

Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.

Yes, but multithreading could hide some of those latencies in
terms of throughput.

EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.

With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.

It is also possible that the speculation barriers I describe below
will limit the benefits that pipelining exceptions and interrupts
might be able to see.

The issue is that both exception handlers and interrupts usually read
and
write Privileged Control Registers (PCR) and/or MMIO device registers
very
early into the handler. Most MMIO device registers and cpu PCR cannot be speculatively read as that may cause a state transition.
Of course all stores are never speculated and can only be initiated
at commit/retire.

This becomes a question of "who knows what when".

At the point of interrupt recognition (It has been raised, and I am
going
to take that interrupt) the pipeline has instructions retiring from the execution window, and instructions being performed, and instructions
waiting for "things to happen".

After interrupt recognition, you are inserting instructions into the
execution window--but these are not speculative--they are known to
not be under any speculation--they WILL execute to completion--regard-
less of whether speculative instructions from before recognition are
performed or flushed. This property is known until the ISR performs
a predicted branch.

So, it is possible to stream right onto an ISR--but few pipelines do.

The normal memory coherence rules assume that loads are to memory-like locations that do not state transition on reads and that therefore
memory loads can be harmlessly replayed if needed.
While memory stores are not performed speculatively, an implementation
might speculatively prefetch a cache line as soon as a store is queued
and cause cache lines to ping-pong.

But for loads to many MMIO devices and PCR effectively require a
speculation barrier in front of them to prevent replays.

My 66000 architecture specifies that accesses to MMI/O space is
performed
as if the core were performing memory references sequentially
consistent;
obviating a need for SPCB instruction there.

There is only 1 instruction used to read/write control registers. It
reads the operand registers and the control register at the beginning
of execution, but does not write the control register until retirement; obviating a need for SPCB instruction there.

Also note: core[i] can access core[j] control registers, but this access
takes place in MMI/O space (and is sequentially consistent).

A SPCB Speculation Barrier instruction could block speculation.
It stalls execution until all older conditional branches are resolved
and
all older instructions that might throw an exception have determined
they won't do so.

The core could have an internal lookup table telling it which PCR can be
read speculatively because there are no side effects to doing so.
Those PCR would not require an SPCB to guard them.

For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate backwards from address translate to all the parts of the core that
we might have to sync with.

I am curious. Is "unCacheable and MMI/O space" insufficient to figure
out "Hey, it's non-speculative" too ??

This all means that there may be very little opportunity for speculative execution of their handlers, no matter how much hardware one tosses at
them.

Good point, often unseen or unstated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Fri Jan 3 17:24:33 2025

EricP <ThatWouldBeTelling@thevillage.com> writes:

MitchAlsup1 wrote:

On Tue, 31 Dec 2024 2:02:05 +0000, Paul A. Clayton wrote:

On 12/25/24 1:30 PM, MitchAlsup1 wrote:

Sooner or later an ISR has to actually deal with the MMI/O
control registers associated with the <ahem> interrupt.

Yes, but multithreading could hide some of those latencies in
terms of throughput.

EricP is the master proponent of finishing the instructions in the
execution window that are finishable. I, merely, have no problem
in allowing the pipe to complete or take a flush based on the kind
of pipeline being engineered.

With 300-odd instructions in the window this thesis has merit,
with a 5-stage pipeline 1-wide, it does not have merit but is
not devoid of merit either.

It is also possible that the speculation barriers I describe below
will limit the benefits that pipelining exceptions and interrupts
might be able to see.

The issue is that both exception handlers and interrupts usually read and >write Privileged Control Registers (PCR) and/or MMIO device registers very >early into the handler. Most MMIO device registers and cpu PCR cannot be >speculatively read as that may cause a state transition.
Of course all stores are never speculated and can only be initiated
at commit/retire.

The normal memory coherence rules assume that loads are to memory-like >locations that do not state transition on reads and that therefore
memory loads can be harmlessly replayed if needed.
While memory stores are not performed speculatively, an implementation
might speculatively prefetch a cache line as soon as a store is queued
and cause cache lines to ping-pong.

But for loads to many MMIO devices and PCR effectively require a
speculation barrier in front of them to prevent replays.

A SPCB Speculation Barrier instruction could block speculation.
It stalls execution until all older conditional branches are resolved and
all older instructions that might throw an exception have determined
they won't do so.

The core could have an internal lookup table telling it which PCR can be
read speculatively because there are no side effects to doing so.
Those PCR would not require an SPCB to guard them.

For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate >backwards from address translate to all the parts of the core that
we might have to sync with.

MMIO accesses are, by definition, non-cachable, which is typically
designated in either a translation table entry or associated
attribute registers (MTTR, MAIR). Non-cacheable accesses
are not speculatively executed, which provides the
correct semantics for device registers which have side effects
on read accesses.

Granted the granularity of that attribute is usually a translation unit
(page) size.

This all means that there may be very little opportunity for speculative >execution of their handlers, no matter how much hardware one tosses at them.

That's true. ARM goes to some lengths to ensure that the access
to the system register (ICC_IARx_EL1) that contains the current pending interrupt
number for a given hardware thread/core is synchronized appropriately.

"To allow software to ensure appropriate observability of actions
initiated by GIC register accesses, the PE and CPU interface logic
must ensure that reads of this register are self-synchronising when
interrupts are masked by the PE (that is when PSTATE.{I,F} == {0,0}).
This ensures that the effect of activating an interrupt on the signaling
of interrupt exceptions is observed when a read of this register is
architecturally executed so that no spurious interrupt exception
occurs if interrupts are unmasked by an instruction immediately
following the read."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Jan 7 16:26:57 2025

"Paul A. Clayton" <paaronclayton@gmail.com> writes:

On 1/3/25 12:24 PM, Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[snip]

For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate
backwards from address translate to all the parts of the core that
we might have to sync with.

MMIO accesses are, by definition, non-cachable, which is typically
designated in either a translation table entry or associated
attribute registers (MTTR, MAIR). Non-cacheable accesses
are not speculatively executed, which provides the
correct semantics for device registers which have side effects
on read accesses.

It is not clear to me that Memory-Mapped I/O requires
non-cacheable accesses. Some addresses within I/O device
address areas do not have access side effects. I would **GUESS**
that most I/O addresses do not have read side effects.

Generally such controls (cacheable/noncacheable) are on
a page granularity (although some modern instruction
sets include non-temporal move variants that bypass
the cache).

(One obvious exception would be implicit buffers where a read
"pops" a value from a queue allowing the next value to be accessed
at the same address. _Theoretically_ one could buffer such reads
outside of the I/O device such that old values would not be lost
and incorrect speculation could be rolled back — this might be a
form of versioned memory. Along similar lines, values could be
prefetched and cached as long as all modifiers of the values use
cache coherency. There may well be other cases of read side
effects.)

In general writes require hidden buffering for speculation, but
write side effects can affect later reads. One possibility would
be a write that changes which buffer is accessed at a given
address. Such a write followed by a read of such a buffer address
must have the read presented after the write, so caching the read
address would be problematic.

One weak type of write side effect would be similar to releasing
a lock, where with a weaker memory order one needs to ensure that
previous writes are visible before the "lock is released". E.g.,
one might update a command buffer on an I/O device with multiple
writes and lastly update a I/O device pointer to indicate that
the buffer was added to. The ordering required for this is weaker
than sequential consistency.

If certain kinds of side effects are limited to a single device,
then the ordering of accesses to different devices may allow
greater flexibility in ordering. (This seems conceptually similar
to cache coherence vs. consistency where "single I/O device"
corresponds to single address. Cache coherence provides strict
consistency for a single address.)

I seem to recall that StrongARM exploited a distinction between
"bufferable" and "cacheable" marked in PTEs to select the cache
to which an access would be allocated. This presumably means
that the two terms had different consistency/coherence
constraints.

I am very skeptical that an extremely complex system with best
possible performance would be worthwhile. However, I suspect that
some relaxation of ordering and cacheability would be practical
and worthwhile.

Agreed with the first sentence. Ordering rules are generally
defined by the hardware (e.g. PCIe ordering), although various
host chipsets allow partial relaxation in cases where the
device supports the relaxed ordering bit in the TLP header.

I do very much object to requiring memory-mapped I/O as a
concept to require non-cacheability even if existing software
(and hardware) and development mindset makes any relaxation
impractical.

The hardware cost of doing otherwise seems to be a barrier
to any relaxation.

Since x86 allowed a different kind of consistency for non-temporal
stores, it may not be absurd for a new architecture to present
a more complex interface, presumably with the option not to deal
with that complexity. Of course, the most likely result would be
hardware having to support the complexity with not actual benefit
from use.

My current work involves modeling devices (PCI, PCIe and onboard
accelerators) in software. Most device status registers, for
example, don't have side effects on read, but they are changed
by hardware at any time, and cacheing them doesn't make sense.

For PCIe devices that expose a memory BAR backed by regular DRAM
the host can map the entire BAR as cacheable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Paul A. Clayton on Tue Jan 7 17:31:11 2025

Paul A. Clayton wrote:

On 1/3/25 12:24â€¯PM, Scott Lurndal wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

[snip]

For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate
backwards from address translate to all the parts of the core that
we might have to sync with.

MMIO accesses are, by definition, non-cachable, which is typically
designated in either a translation table entry or associated
attribute registers (MTTR, MAIR). Non-cacheable accesses
are not speculatively executed, which provides the
correct semantics for device registers which have side effects
on read accesses.

It is not clear to me that Memory-Mapped I/O requires
non-cacheable accesses. Some addresses within I/O device
address areas do not have access side effects. I would **GUESS**
that most I/O addresses do not have read side effects.

And you would be wrong, at least back in the "old days" when I wrote
drivers for some such devices.

The worst was probably the EGA text/graphics adapter which had a bunch
of write-only ports, making it completely impossible to context switch.

Even IBM realized this so it was shortly after (1-2 years?) replaced by
the VGA adapter which did allow you to query the current status.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	02:54:47
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,755

Microarchitectural support for counting

Who's Online

Recent Visitors

System Info