Forum: >>> Magnum BBS <<<

PCIe MSI-X interrupts

From MitchAlsup1@21:1/5 to All on Fri Jun 21 20:35:32 2024

PCIe has an MSI-X interrupt 'capabillity' which consists of
a number (n) interrupt desctiptors and an associated Pending
Bit Array where each bit in PBA has a corresponding 128-bit
desctiptor. A descriptor contains a 64-bit address, a 32-bit
message, and a 32-bit vector control word.

There are 2-levels of enablement, one at the MSI-X configura-
tion control register and one in each interrupt descriptor at
vector control bit[31].

As the device raises an interrupt, it sets a bit in PBA.

When MSI-X is enabled and a bit in PBA is set (1) and the
vector control bit[31] is enabled, the device sends a
write of the message to the address in the descriptor,
and clears the bit in PBA.

I am assuming that the MSI-X enable bit is used to throttle
a device so that it sends bursts of interrupts to optimize
the caching behavior of the cores handling the interrupts.
run applications->handle k interrupts->run applications.
A home machine would not use this featrue as the interrupt
load is small, but a GB server might more control over when.
But does anybody know ??

a) device dommand to interrupt descriptor mapping {
Thre is no mention of the mapping of commands to the device
and to these interrupt descriptors. Can anyone supply input
or pointers to this mapping.

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.
}
I don't really NEED to know this mapping, but knowing would
significantly enhance my understanding of what is supposed
to be going on, and thus avoid making crippling errors.

b) address space of interrupt service port {
The address in the interrupt descriptor points at a service
port (APIC). Since a service port is "not like memory"*, I
want to mandate this aqddress be in MMI/O space, and since
My 66000 has a full 64-bit address space for MMI/O there is
no burden on the size of MMI/O space--it is already as big
as possible on a 64-bit machine. Plus, MMI/O space has the
property of being sequentially consistent whereas DRAM is
only cache consistent.

Most current architectures just partition a hunk of the
physical address space as MMI/O address space.

(*) memory has the property that a read will return the last
bit pattern written, a service port does not.

I assume that service port addresses map to different cores
(or local APICs of a core). I want to directly support the
notion of a virtual core so while a 'chip' might have a large
number of physical cores, one would want a pool of thousands+
of virtual cores. I want said service ports to support raising
interrupt directly to a physical or virtual core.
}

Apparently, the message part of the MSI-X interrupt can be
interpreted any way that both SW and HW agree. This works
for already defined architectures, and doing it like one
or more others, makes an OS port significantly easier.
However what these messages contain is difficult to find
via Google.

So, it seems to me, that the combination of the 64-bit address
and the 32-bit message must provide::
a) which level of the system to interrupt
{Secure Monitor, HyperVisor, SuperVisor, Application}
b) which core should handle the interrupt
{physical[0..k], virtual[l..m]}
c) what priority level is the interrupt.
{There are 64 unique priority levels}
d) something about why the interrupt was raised
{what remains of the meassage}

I suspect that (a) and (b) are parts of the address while (c)
and (d) are part of the message. Although nothing prevents
(c) from being part of the address.

Once MSI-X is sorted out MSI becomes a subset.

HostBridge has a service port that provides INT[A,B,C,D] to
MSI-X translation, so only MSI-X message are used system-
wide.

------------------------------------------------------------

It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

It seems to me, that the HyperVisor would want to perform ISR
processing of the interrupt (low latency) and then schedule
the softIRQs to the <sleeping> core so when it regains control
the pending I/O stack of "stuff" is proprly cleaned up.

So, shold all initerrupt simple go to HyperVisor and let HV
sort it all out? Or can the <sleeping> virtual core just deal
with it when it is given a next time slice ??

Now, if there were a way to cascade interrupts such that if
an interrupt was routed to a <sleeping virtual core> that
some kind of "poke in the side" of a HyperVisor would cause
HV to find a next times slice for the <sleeping> core ex post
haste, and just let the core deal with the interrupt !!
Presto, any privilege level can handle its own interrupts.
}

Comments ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jun 21 22:00:56 2024

mitchalsup@aol.com (MitchAlsup1) writes:

PCIe has an MSI-X interrupt 'capabillity' which consists of
a number (n) interrupt desctiptors and an associated Pending
Bit Array where each bit in PBA has a corresponding 128-bit
desctiptor. A descriptor contains a 64-bit address, a 32-bit
message, and a 32-bit vector control word.

There are 2-levels of enablement, one at the MSI-X configura-
tion control register and one in each interrupt descriptor at
vector control bit[31].

As the device raises an interrupt, it sets a bit in PBA.

When MSI-X is enabled and a bit in PBA is set (1) and the
vector control bit[31] is enabled, the device sends a
write of the message to the address in the descriptor,
and clears the bit in PBA.

Note that if the interrupt condition is asserted after the
global enable in the MSI-X capability and the vector enable
have both been set to allow delivery, the message will be sent to
the root complex and PBA will not be updated. (P is for
pending, and once the message is sent, it's no longer
pending). PBA is only updated when the interrupt is masked
(either function-wide in the capability or per-vector).

I am assuming that the MSI-X enable bit is used to throttle

In my experience the MSI-X function enable and vector enables
are not modified during runtime, rather the device has control
registers which allow masking of the interrupt (e.g.
for AHCI, the MSI message will only be sent if the port
PxIE (Port n Interrupt Enable) bit corresponding to a
PxIS (Port n Interrupt Status) bit is set).

Granted, AHCI specifies MSI, not MSI-X, but every MSI-X
device I've worked with operates the same way, with
device specific interrupt enables for a particular vector.

a device so that it sends bursts of interrupts to optimize
the caching behavior of the cores handling the interrupts.
run applications->handle k interrupts->run applications.
A home machine would not use this featrue as the interrupt
load is small, but a GB server might more control over when.
But does anybody know ??

Yes, we use MSI-X extensively. See above.

There are a number of mechanisms used for interrupt moderation,
but all generally are independent of the PCI message delivery.
(e.g. RSS spreads interrupts across multiple target cores,
or the Intel 10Ge network adapters interrupt moderation feature).

a) device dommand to interrupt descriptor mapping {
Thre is no mention of the mapping of commands to the device
and to these interrupt descriptors. Can anyone supply input
or pointers to this mapping.

Once the message leaves the device, is received by the
root complex port and is forwarded across the host bridge
to the system fabric, it's completely under control of
the host. On x86, the TLP for the upstream message is
received and forwarded to the specified address (which is
the IOAPIC on Intel and the GIC ITS on Arm64).

The interrupt controller may further mask the interrupt if
desired or if the interrupt priority is lower than the
current running priority.

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller
MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

Note that support for MSI in AHCI is optional (in which case the
legacy level sensitive PCI INTA/B/C/D signals are used).

The AHCI standard specification (ahci_1_3.pdf) is available publically.

}
I don't really NEED to know this mapping, but knowing would
significantly enhance my understanding of what is supposed
to be going on, and thus avoid making crippling errors.

b) address space of interrupt service port {
The address in the interrupt descriptor points at a service
port (APIC). Since a service port is "not like memory"*, I
want to mandate this aqddress be in MMI/O space, and since
My 66000 has a full 64-bit address space for MMI/O there is
no burden on the size of MMI/O space--it is already as big
as possible on a 64-bit machine. Plus, MMI/O space has the
property of being sequentially consistent whereas DRAM is
only cache consistent.

From the standpoint of the PCIexpress root port, the upstream write
generated by the device to send the MSI message to the host
looks just like any other inbound DMA from the device to the
host. It is the responsibility of the host bridge and interconnect to
route the message the appropriate destination (which generally
is an interrupt controller, but just as legally could be a
DRAM address which software polls periodically).

Most current architectures just partition a hunk of the
physical address space as MMI/O address space.

The address field in the MSI-X vector (or MSI-X capability)
is opaque to hardware below the PCIe root port.

Our chips recognize the interrupt controller range of
addresses in the inbound message at the host bridge
and route the message to the interrupt translation service;
the destinations in the interrupt controller are simply
control and status registers in the MMIO space. The
ARM64 interrupt controller supports multiple destinations
with different semantics (SPI and xSPI have one target
register and LPI has a different target register the address
of which is programmed into the MSI-X Vector address field).

(*) memory has the property that a read will return the last
bit pattern written, a service port does not.

I assume that service port addresses map to different cores
(or local APICs of a core).

The IOAPIC handles the message and has configuration registers
that determine which lAPIC should be signalled.

The GIC has configuration tables in memory that can remap
the interrupt to a different vector (e.g. for a guest VM).

I want to directly support the
notion of a virtual core so while a 'chip' might have a large
number of physical cores, one would want a pool of thousands+
of virtual cores. I want said service ports to support raising
interrupt directly to a physical or virtual core.

Take a look at IHI0069 (https://developer.arm.com/documentation/ihi0069/latest/)

}

Apparently, the message part of the MSI-X interrupt can be
interpreted any way that both SW and HW agree.

Yes.

This works
for already defined architectures, and doing it like one
or more others, makes an OS port significantly easier.
However what these messages contain is difficult to find
via Google.

The message is a 32-bit field and it is fully interpreted by
the interrupt controller (The GIC can be configured to support
from 16 to 32-bits data payload in an upstream MSI-X write;
the interpretation of the data is host specific).

On intel and ARM systems, the firmware knows the grungy details
and simply passes the desired payload value to the kernel
via the device tree(linux) or ACPI tables (for windows/linux).

So, it seems to me, that the combination of the 64-bit address
and the 32-bit message must provide::
a) which level of the system to interrupt
{Secure Monitor, HyperVisor, SuperVisor, Application}

No. That's completely a function of the interrupt controller
and how the hardware handles the data payload.

b) which core should handle the interrupt
{physical[0..k], virtual[l..m]}

Again, a function of the interrupt controller.

c) what priority level is the interrupt.
{There are 64 unique priority levels}

Yep, a function of the interrupt controller.

d) something about why the interrupt was raised

The interrupt itself causes the operating system
device driver interrupt function to be invoked. The
device-specific interrupt handler determines both
why the interrupt was raised (e.g. via the PxIS
register in the AHCI/SATA controller) and takes
the appropriate action.

On ARM64, it is common for the data field for
the MSI-X interrupts to number starting at zero
on every device, and they're mapped to a system-wide
unique value by the interrupt controller (e.g.
the GICv4 ITS). If interrupt remapping hardware is
not available then unique data payloads for each
device need to be used.

Note that like any other inbound DMA, the address
in the MSI-X TLP that gets sent to the host bridge is subject
to translation by an IOMMU before getting to the
interrupt controller (or by the device itself if it
supports PCI-e Address Translation Services (ATS)).

{what remains of the meassage}

I suspect that (a) and (b) are parts of the address while (c)
and (d) are part of the message. Although nothing prevents
(c) from being part of the address.

Once MSI-X is sorted out MSI becomes a subset.

HostBridge has a service port that provides INT[A,B,C,D] to
MSI-X translation, so only MSI-X message are used system-
wide.

Note that INTA/B/C/D are level-sensitive. This requires
TWO MSI-X vectors - one that targets an "interrupt set"
register and the other targets and "interrupt clear"
register.

------------------------------------------------------------

It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

On Intel the IOMMU translation tables are not shared with the
AP.

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

This allows each device capable of inbound DMA to identify
themselves uniquely to the interrupt controller and IOMMU.

Both intel and AMD use this convention.

Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

No, you're not allowed to do anything not explicitly allowed
in the PCI express specification. Remember, an MSI-X write
generated by the device is indistinguishable from any other
upstream DMA request initiated by the device.

c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

See again the document referenced above. The interrupt controller
is aware that the guest is not currently scheduled and maintains
a virutal pending state (and can optionally signal the hypervisor
that the guest should be scheduled ASAP).

Most of this is done completely by the hardware, without any
intervention by the hypervisor for the vast majority of
interrupts.

It seems to me, that the HyperVisor would want to perform ISR
processing of the interrupt (low latency) and then schedule
the softIRQs to the <sleeping> core so when it regains control
the pending I/O stack of "stuff" is proprly cleaned up.

So, shold all initerrupt simple go to HyperVisor and let HV
sort it all out? Or can the <sleeping> virtual core just deal
with it when it is given a next time slice ??

The original GIC did something like this (the HV took all
interrupts and there was a hardware mechanism to inject them
into a guest as if they were a hardware interrupt). But
it was too much overhead going through the hypervisor, especially
when the endpoint device support the SRIOV capability. So the
GIC supports handling virtual interrupt delivery completely
in hardware unless the guest is not currently resident on any
virtual CPU.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Scott Lurndal on Fri Jun 21 22:28:19 2024

scott@slp53.sl.home (Scott Lurndal) writes:

mitchalsup@aol.com (MitchAlsup1) writes:

When MSI-X is enabled and a bit in PBA is set (1) and the
vector control bit[31] is enabled, the device sends a
write of the message to the address in the descriptor,
and clears the bit in PBA.

Note that if the interrupt condition is asserted after the
global enable in the MSI-X capability and the vector enable
have both been set to allow delivery, the message will be sent to
the root complex and PBA will not be updated. (P is for
pending, and once the message is sent, it's no longer
pending). PBA is only updated when the interrupt is masked
(either function-wide in the capability or per-vector).

These are the gates to interrupt delivery on a typical
ARM-based system, from closest to the device to furthest.

1) The device interrupt enable register (e.g. AHCI P0IE)
2) The MSI-X Vector enable (in each vector control register)
3) The MSI-X PCI-Express Capability enable (MSI-X enable and
function mask in the MSI-X Capability message control field)
4) The PCI configuration space COMMAND register [BME]
(bus master enable) bit must be set

These first four steps are handled by the PCI endpoint hardware
before posting the upstream write TLP. If (2), (3) or (4)
conditions do not hold, then the PBA bit will be set and
the message will be sent when the conditions allow. If (1)
does not hold, then the device status register will hold the
state until the interrupt is unmasked in the device.

5) The interrupt controller per-interrupt enable bit(s)
(for GIC: the SPI, eSPI enable registers, indexed by interrupt
number, or the LPI properties byte enable bit, indexed
into DRAM table by LPI number (range 8192 - 2^24)). SPI's
are generally used for level-sensitive or latency sensitive
interrupts and are implemented as wires)
6) The interrupt group enable. Interrupts are grouped by delivery
mechanism (there are two CPU interrupt signals, IRQ and FIQ)
and security state.
7) The Target processor enable (in interrupt controller)
8) The interrupt priority is greater than any currently
being processed.
9) The processor PSR interrupt mask.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sat Jun 22 01:51:27 2024

One thing to note::

The PCIe root complex may send out an address, but when the
MMI/O receiver of said address then uses bits in the address
for routing purposes or interpretation purposes, it matters
not that the sender did not know those bits were being used
thusly--those bits actually do carry specific meaning--just
not to the sender.

At some level in architecture, you have to look both ways
and amalgamate the meanings, such that the common meaning
is useful looking in either direction.

Not a miff--just a statement of what architecture is.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to interrupt controller that routed on Sat Jun 22 01:12:50 2024

Scott Lurndal wrote:

First of all, allow me to express my gratitude in such a well
though out response, compared to the miscellaneous ramblings
going on in my head.

mitchalsup@aol.com (MitchAlsup1) writes:

PCIe has an MSI-X interrupt 'capabillity' which consists of
a number (n) interrupt desctiptors and an associated Pending
Bit Array where each bit in PBA has a corresponding 128-bit
desctiptor. A descriptor contains a 64-bit address, a 32-bit
message, and a 32-bit vector control word.

There are 2-levels of enablement, one at the MSI-X configura-
tion control register and one in each interrupt descriptor at
vector control bit[31].

As the device raises an interrupt, it sets a bit in PBA.

When MSI-X is enabled and a bit in PBA is set (1) and the
vector control bit[31] is enabled, the device sends a
write of the message to the address in the descriptor,
and clears the bit in PBA.

Note that if the interrupt condition is asserted after the
global enable in the MSI-X capability and the vector enable
have both been set to allow delivery, the message will be sent to
the root complex and PBA will not be updated. (P is for
pending, and once the message is sent, it's no longer
pending). PBA is only updated when the interrupt is masked
(either function-wide in the capability or per-vector).

So, the interrupt only becomes pending in BPA if it cannot be
sent immediately. Thanks for the clarification.

I am assuming that the MSI-X enable bit is used to throttle

In my experience the MSI-X function enable and vector enables
are not modified during runtime, rather the device has control
registers which allow masking of the interrupt (e.g.
for AHCI, the MSI message will only be sent if the port
PxIE (Port n Interrupt Enable) bit corresponding to a
PxIS (Port n Interrupt Status) bit is set).

So, these degenerated into more masking levels that are not
used very often because other masks can be applied elsewhere.

Granted, AHCI specifies MSI, not MSI-X, but every MSI-X
device I've worked with operates the same way, with
device specific interrupt enables for a particular vector.

a device so that it sends bursts of interrupts to optimize
the caching behavior of the cores handling the interrupts.
run applications->handle k interrupts->run applications.
A home machine would not use this featrue as the interrupt
load is small, but a GB server might more control over when.
But does anybody know ??

Yes, we use MSI-X extensively. See above.

There are a number of mechanisms used for interrupt moderation,
but all generally are independent of the PCI message delivery.
(e.g. RSS spreads interrupts across multiple target cores,
or the Intel 10Ge network adapters interrupt moderation feature).

a) device dommand to interrupt descriptor mapping {
Thre is no mention of the mapping of commands to the device
and to these interrupt descriptors. Can anyone supply input
or pointers to this mapping.

Once the message leaves the device, is received by the
root complex port and is forwarded across the host bridge
to the system fabric, it's completely under control of
the host. On x86, the TLP for the upstream message is
received and forwarded to the specified address (which is
the IOAPIC on Intel and the GIC ITS on Arm64).

The interrupt controller may further mask the interrupt if
desired or if the interrupt priority is lower than the
current running priority.

{note to self:: that is why its a local APIC--it has to be close
enough to see the core's priority.}

Question:: Down below you talk of the various interrupt control-
lers routing an interrupt <finally> to a core. What happens if the
core has changed its priority by the time the interrupt signal
arrives, but before it can change the state of the tables in the
interrupt controller that routed said interrupt here ?

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI)
controller
MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

I see (below) that you (they) migrated all the stuff I though might
be either in the address or data to the "other side" of HostBridge.
Fair enough.

For what reason are there multiple addresses ? instead of a range
of addresses providing a more globally-scoped service port ?
Perhaps it is an address at the interrupt descriptor, and an
address range at the global interrupt controller. Where different
addresses then mean different things.

Note that support for MSI in AHCI is optional (in which case the
legacy level sensitive PCI INTA/B/C/D signals are used).

The AHCI standard specification (ahci_1_3.pdf) is available publically.

}
I don't really NEED to know this mapping, but knowing would
significantly enhance my understanding of what is supposed
to be going on, and thus avoid making crippling errors.

b) address space of interrupt service port {
The address in the interrupt descriptor points at a service
port (APIC). Since a service port is "not like memory"*, I
want to mandate this aqddress be in MMI/O space, and since
My 66000 has a full 64-bit address space for MMI/O there is
no burden on the size of MMI/O space--it is already as big
as possible on a 64-bit machine. Plus, MMI/O space has the
property of being sequentially consistent whereas DRAM is
only cache consistent.

From the standpoint of the PCIexpress root port, the upstream write
generated by the device to send the MSI message to the host
looks just like any other inbound DMA from the device to the
host. It is the responsibility of the host bridge and interconnect to
route the message the appropriate destination (which generally
is an interrupt controller, but just as legally could be a
DRAM address which software polls periodically).

So the message arrive at the top of the PCIe tree is RAW, then
the address gets translated by I/O MMU, and both translated
address and RAW data are passed forward to its fate.

Most current architectures just partition a hunk of the
physical address space as MMI/O address space.

The address field in the MSI-X vector (or MSI-X capability)
is opaque to hardware below the PCIe root port.

Our chips recognize the interrupt controller range of
addresses in the inbound message at the host bridge
and route the message to the interrupt translation service;
the destinations in the interrupt controller are simply
control and status registers in the MMIO space. The
ARM64 interrupt controller supports multiple destinations
with different semantics (SPI and xSPI have one target
register and LPI has a different target register the address
of which is programmed into the MSI-X Vector address field).

What I am trying to do is to figure out a means to route the
message to a virtual core's interrupt table such that:: if that
virtual core happens to be running on any physical core, that
the physical core sees the interrupt without delay, and if
the virtual core is not running, the event is properly logged
so when the virtual core runs on a physical core that those
ISRs are performed before any lower priority work is performed.

{and make this work for any number of physical cores and any
number of virtual cores; where cores can sharing interrupt
tables. For example, Guest OS[k] thinks that is has 13 cores
and shares its interrupt table across 5 of them, but HyperVisor
remains free to time slice Guest OS[k] cores any way it likes.}

(*) memory has the property that a read will return the last
bit pattern written, a service port does not.

I assume that service port addresses map to different cores
(or local APICs of a core).

The IOAPIC handles the message and has configuration registers
that determine which lAPIC should be signalled.

The GIC has configuration tables in memory that can remap
the interrupt to a different vector (e.g. for a guest VM).

GIC = Global Interrupt Controller ?

I want to directly support the
notion of a virtual core so while a 'chip' might have a large
number of physical cores, one would want a pool of thousands+
of virtual cores. I want said service ports to support raising
interrupt directly to a physical or virtual core.

Take a look at IHI0069 (https://developer.arm.com/documentation/ihi0069/latest/)

}

Apparently, the message part of the MSI-X interrupt can be
interpreted any way that both SW and HW agree.

Yes.

This works
for already defined architectures, and doing it like one
or more others, makes an OS port significantly easier.
However what these messages contain is difficult to find
via Google.

The message is a 32-bit field and it is fully interpreted by
the interrupt controller (The GIC can be configured to support
from 16 to 32-bits data payload in an upstream MSI-X write;
the interpretation of the data is host specific).

On intel and ARM systems, the firmware knows the grungy details
and simply passes the desired payload value to the kernel
via the device tree(linux) or ACPI tables (for windows/linux).

So, it seems to me, that the combination of the 64-bit address
and the 32-bit message must provide::
a) which level of the system to interrupt
{Secure Monitor, HyperVisor, SuperVisor, Application}

No. That's completely a function of the interrupt controller
and how the hardware handles the data payload.

b) which core should handle the interrupt
{physical[0..k], virtual[l..m]}

Again, a function of the interrupt controller.

c) what priority level is the interrupt.
{There are 64 unique priority levels}

Yep, a function of the interrupt controller.

d) something about why the interrupt was raised

The interrupt itself causes the operating system
device driver interrupt function to be invoked. The
device-specific interrupt handler determines both
why the interrupt was raised (e.g. via the PxIS
register in the AHCI/SATA controller) and takes
the appropriate action.

On ARM64, it is common for the data field for
the MSI-X interrupts to number starting at zero
on every device, and they're mapped to a system-wide
unique value by the interrupt controller (e.g.
the GICv4 ITS).

I was expecting that.

If interrupt remapping hardware is
not available then unique data payloads for each
device need to be used.

Note that like any other inbound DMA, the address
in the MSI-X TLP that gets sent to the host bridge is subject
to translation by an IOMMU before getting to the
interrupt controller (or by the device itself if it
supports PCI-e Address Translation Services (ATS)).

Obviously.

{what remains of the meassage}

I suspect that (a) and (b) are parts of the address while (c)
and (d) are part of the message. Although nothing prevents
(c) from being part of the address.

Once MSI-X is sorted out MSI becomes a subset.

HostBridge has a service port that provides INT[A,B,C,D] to
MSI-X translation, so only MSI-X message are used system-
wide.

Note that INTA/B/C/D are level-sensitive. This requires
TWO MSI-X vectors - one that targets an "interrupt set"
register and the other targets and "interrupt clear"
register.

Gotcha.

------------------------------------------------------------

It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

On Intel the IOMMU translation tables are not shared with the
AP.

I have seen in the past 3 days AP being used to point at a
random device out on the PCIe tree and of the unprivileged
application layer. Both ends of the spectrum. Which is your
usage ?

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the
address.}

But (now with the new CXL) instead of allocating 200+ pins
to DRAM those pins can be allocated to PCIe links; making any
chip much less dependent on which DRAM technology, which chip-
to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
and about the only other pins chip gets are RESET and ClockIn.

Bunches of these pins can be 'configured' into standard width
PCIe links (at least until one runs out of pins.)

Given that one has a PCIe root complex with around 256-pins
available, does one need multiple roots of such a wide tree ?

This allows each device capable of inbound DMA to identify
themselves uniquely to the interrupt controller and IOMMU.

Both intel and AMD use this convention.

Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

No, you're not allowed to do anything not explicitly allowed
in the PCI express specification. Remember, an MSI-X write
generated by the device is indistinguishable from any other
upstream DMA request initiated by the device.

Why did PCI committee specify a 32-bit container and define the
use on only 1 bit ?? Or are more bits defined but I just haven't
run into any literature concerning those ?

c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

See again the document referenced above. The interrupt controller
is aware that the guest is not currently scheduled and maintains
a virutal pending state (and can optionally signal the hypervisor
that the guest should be scheduled ASAP).

Are you using the word 'signal' as LINUX signal delivery, or as
a proxy for interrupt of some form, or perhaps as an SVC to HV
of some form ?

Most of this is done completely by the hardware, without any
intervention by the hypervisor for the vast majority of
interrupts.

That is the goal.

It seems to me, that the HyperVisor would want to perform ISR
processing of the interrupt (low latency) and then schedule
the softIRQs to the <sleeping> core so when it regains control
the pending I/O stack of "stuff" is proprly cleaned up.

So, shold all initerrupt simple go to HyperVisor and let HV
sort it all out? Or can the <sleeping> virtual core just deal
with it when it is given a next time slice ??

The original GIC did something like this (the HV took all
interrupts and there was a hardware mechanism to inject them
into a guest as if they were a hardware interrupt). But
it was too much overhead going through the hypervisor, especially
when the endpoint device support the SRIOV capability. So the
GIC supports handling virtual interrupt delivery completely
in hardware unless the guest is not currently resident on any
virtual CPU.

Leave HV out of the loop unless something drastic happens.
I/O completion and I/O aborts are not that drastic.

Once again, I thank you greatly for your long and informative
post.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:41:28 2024

mitchalsup@aol.com (MitchAlsup1) writes:

One thing to note::

The PCIe root complex may send out an address, but when the
MMI/O receiver of said address then uses bits in the address
for routing purposes or interpretation purposes, it matters
not that the sender did not know those bits were being used
thusly--those bits actually do carry specific meaning--just
not to the sender.

Said routing does not necessarily mean that the bits in
the addresses are interpreted in any specific way, if the
routing is done by a set of programmed range registers
(i.e. range X to X+Y is routed to destination Z,e.g. the
interrupt controller).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:39:32 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

On Intel the IOMMU translation tables are not shared with the
AP.

I have seen in the past 3 days AP being used to point at a
random device out on the PCIe tree and of the unprivileged
application layer. Both ends of the spectrum. Which is your
usage ?

Sorry, hit send accidentally on the prior response.

AP in our context is 'application processor', i.e. ARMv8 core.

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the
address.}

Each root complex needs to be an unique segment. A single
SRIOV endpoint can consume the entire 8-bit bus space and
the 8-bit dev/function space. In this context, a root complex
can be considered a PCI express controller with one or more
root ports. Each root port should be considered an unique
'segment'.

This is for device discovery, which uses the PCI express
"Extended Configuration Access Method" (aka ECAM) to scan
the PCI configuration spaces of all PCI ports.

But (now with the new CXL) instead of allocating 200+ pins
to DRAM those pins can be allocated to PCIe links; making any
chip much less dependent on which DRAM technology, which chip-
to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
and about the only other pins chip gets are RESET and ClockIn.

Note that bridging to PCI signalling will increase latency
somewhat, even with PCIe gen 6.

Bunches of these pins can be 'configured' into standard width
PCIe links (at least until one runs out of pins.)

Given that one has a PCIe root complex with around 256-pins
available, does one need multiple roots of such a wide tree ?

You basically need a root per device to accomodate SRIOV
devices (like enterprise grade network adapters, high-end
NVMe devices, etc).

This allows each device capable of inbound DMA to identify
themselves uniquely to the interrupt controller and IOMMU.

Both intel and AMD use this convention.

Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

No, you're not allowed to do anything not explicitly allowed
in the PCI express specification. Remember, an MSI-X write
generated by the device is indistinguishable from any other
upstream DMA request initiated by the device.

Why did PCI committee specify a 32-bit container and define the
use on only 1 bit ?? Or are more bits defined but I just haven't
run into any literature concerning those ?

At the time that MSI and MSI-X were added to the PCI Local Bus
specification (-before- PCI Express), the devices already had
local masking - the MSI-X enable bit in the capability is used
to switch between using legacy INTA/B/C/D and MSI-X so that a
PCI card could work on systems that didn't support MSI-X.

The function mask bit in the capability masks the entire function
(all vectors). I've not seen that used in the real world, myself.

The vector mask bits mask each individual vector.

c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

See again the document referenced above. The interrupt controller
is aware that the guest is not currently scheduled and maintains
a virutal pending state (and can optionally signal the hypervisor
that the guest should be scheduled ASAP).

Are you using the word 'signal' as LINUX signal delivery, or as
a proxy for interrupt of some form, or perhaps as an SVC to HV
of some form ?

In the case of the ARM GIC, there is a defined processor private
interrupt that is used to signal the hypervisor - this is what is
used to 'signal' (not in the unix sense) the condition to the
hypervisor. PPIs are also used for timer interrupts, statistical
profiling interrupts, and a few others.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 14:28:58 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Once the message leaves the device, is received by the
root complex port and is forwarded across the host bridge
to the system fabric, it's completely under control of
the host. On x86, the TLP for the upstream message is
received and forwarded to the specified address (which is
the IOAPIC on Intel and the GIC ITS on Arm64).

The interrupt controller may further mask the interrupt if
desired or if the interrupt priority is lower than the
current running priority.

{note to self:: that is why its a local APIC--it has to be close
enough to see the core's priority.}

Question:: Down below you talk of the various interrupt control-
lers routing an interrupt <finally> to a core. What happens if the
core has changed its priority by the time the interrupt signal
arrives, but before it can change the state of the tables in the
interrupt controller that routed said interrupt here ?

Speaking for the ARM64 systems that I'm most recently
familiar with, the concept of priority is associated with
an interrupt (up to 8-bits worth of priority - an implementation
of the GIC is allowed to support as few as three bits).

The interrupt controller is distributed logic; there is a
component called the 'distributor' and another component called
the 'redistributor'. The former is global to the system and
the latter is a per-CPU component. The distributor also contains
a subsystem called the interrupt translation subsystem (ITS) which
supports interrupt virtualization.

The redistributor, being part of the core, handles the delivery
of an interrupt to the core (specifically asserting either the FIQ
or IRQ signals that cause entry to the IRQ or FIQ exception
handlers). The redistributor tracks the current running priority
(which is directly associated with the priority of the current
active interrupt, when not processing an interrupt, the current
running priority is called the IDLE priority and doesn't block
delivery of any interrupts). The redistributor communicates changes to
the RPR to the distributor, which will hold any interrupt that
is not eligable for delivery (for any reason, including lack
of priority). There is no way for software to change the
RPR - it only tracks the priority of the currently executing
interrupt.

+-------------------------+
| PCI Device |
+-------------------------+
| MSI-X message (address: GITS_TRANSLATER control register)
| (payload: Interrupt number (0 to N))
v (sideband: streamid)
+-------------------------+
| Interrupt Translation | (DRAM tables: Device, Collection)
| Service | Lookup streamid in device table.
| | DT refers to Interrupt Translation Table
| | Translate inbound payload based on ITT to an LPI
| | Collection table identifies target core +-------------------------+
| Internal message from ITS to redistributor for target
v
+-------------------------+
| Redistributor | (DRAM table: LPI properties)
| | Lookup LPI properties, contains priority and enable bit
| | If not enabled or priority too low,
| | store in LPI pending table (also DRAM) [*]
| | If enabled, unmasked at the CPU interface
| | and priority higher than RPR, assert FIQ
| | or IRQ signals to core. +-------------------------+
| IRQ/FIQ signals
v
+-------------------------+
| Core | Check PSTATE IRQ and FIQ mask bits
| | IRQ/FIQ can be routed to EL1, EL2 or EL3
| | by per-core control bits. Update
| | core state appropriately and enter ISR +-------------------------+

[*] as core RPR and signal masks change, the ITS re-evaluates pending
[**] LPI properties and pending bits are generally cached in the
redistributor for performance.

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI)
controller
MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

I see (below) that you (they) migrated all the stuff I though might
be either in the address or data to the "other side" of HostBridge.
Fair enough.

For what reason are there multiple addresses ?

A system may have multiple interrupt controllers. In the
case of the ARM64 systems, there may be a case where some
interrupts should be considered level sensitive, in which
case they must use SPI type interrupts which have a different
target register for the MSI-X address field when compared with
LPI type interrupts.

Recall that the PCI spec must accomodate a wide range of system
implementations (including Z-series).

From the standpoint of the PCIexpress root port, the upstream write
generated by the device to send the MSI message to the host
looks just like any other inbound DMA from the device to the
host. It is the responsibility of the host bridge and interconnect to
route the message the appropriate destination (which generally
is an interrupt controller, but just as legally could be a
DRAM address which software polls periodically).

So the message arrive at the top of the PCIe tree is RAW, then
the address gets translated by I/O MMU, and both translated
address and RAW data are passed forward to its fate.

Basically, yes. The 'root complex port' is the interface
beween the host bridge and the endpoint device. A system
may have a configuration option where the upstream message
from the root complex can bypass the IOMMU as well (e.g.
for firmware controlled devices - think SMM).

What I am trying to do is to figure out a means to route the
message to a virtual core's interrupt table such that:: if that
virtual core happens to be running on any physical core, that
the physical core sees the interrupt without delay, and if
the virtual core is not running, the event is properly logged
so when the virtual core runs on a physical core that those
ISRs are performed before any lower priority work is performed.

That's exactly what the redistributor does in the ARM GIC.

It's probably worth reading that document - it would take
a considerable amount of typing for me to summarize the
GICv4.x virtualization features :-).

{and make this work for any number of physical cores and any
number of virtual cores; where cores can sharing interrupt
tables. For example, Guest OS[k] thinks that is has 13 cores
and shares its interrupt table across 5 of them, but HyperVisor
remains free to time slice Guest OS[k] cores any way it likes.}

The arm gic supports all this.

(*) memory has the property that a read will return the last
bit pattern written, a service port does not.

I assume that service port addresses map to different cores
(or local APICs of a core).

The IOAPIC handles the message and has configuration registers
that determine which lAPIC should be signalled.

The GIC has configuration tables in memory that can remap
the interrupt to a different vector (e.g. for a guest VM).

GIC = Global Interrupt Controller ?

Generic, I believe.

It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

On Intel the IOMMU translation tables are not shared with the
AP.

I have seen in the past 3 days AP being used to point at a
random device out on the PCIe tree and of the unprivileged
application layer. Both ends of the spectrum. Which is your
usage ?

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the
address.}

But (now with the new CXL) instead of allocating 200+ pins
to DRAM those pins can be allocated to PCIe links; making any
chip much less dependent on which DRAM technology, which chip-
to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
and about the only other pins chip gets are RESET and ClockIn.

Bunches of these pins can be 'configured' into standard width
PCIe links (at least until one runs out of pins.)

Given that one has a PCIe root complex with around 256-pins
available, does one need multiple roots of such a wide tree ?

This allows each device capable of inbound DMA to identify
themselves uniquely to the interrupt controller and IOMMU.

Both intel and AMD use this convention.

Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

No, you're not allowed to do anything not explicitly allowed
in the PCI express specification. Remember, an MSI-X write
generated by the device is indistinguishable from any other
upstream DMA request initiated by the device.

Why did PCI committee specify a 32-bit container and define the
use on only 1 bit ?? Or are more bits defined but I just haven't
run into any literature concerning those ?

c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

See again the document referenced above. The interrupt controller
is aware that the guest is not currently scheduled and maintains
a virutal pending state (and can optionally signal the hypervisor
that the guest should be scheduled ASAP).

Are you using the word 'signal' as LINUX signal delivery, or as
a proxy for interrupt of some form, or perhaps as an SVC to HV
of some form ?

Most of this is done completely by the hardware, without any
intervention by the hypervisor for the vast majority of
interrupts.

That is the goal.

It seems to me, that the HyperVisor would want to perform ISR
processing of the interrupt (low latency) and then schedule
the softIRQs to the <sleeping> core so when it regains control
the pending I/O stack of "stuff" is proprly cleaned up.

So, shold all initerrupt simple go to HyperVisor and let HV
sort it all out? Or can the <sleeping> virtual core just deal
with it when it is given a next time slice ??

The original GIC did something like this (the HV took all
interrupts and there was a hardware mechanism to inject them
into a guest as if they were a hardware interrupt). But
it was too much overhead going through the hypervisor, especially
when the endpoint device support the SRIOV capability. So the
GIC supports handling virtual interrupt delivery completely
in hardware unless the guest is not currently resident on any
virtual CPU.

Leave HV out of the loop unless something drastic happens.
I/O completion and I/O aborts are not that drastic.

Once again, I thank you greatly for your long and informative
post.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sat Jun 22 19:31:20 2024

Scott Lurndal wrote:

Again allow me to express my gratitute on the quality of your posts !

A couple of dumb questions to illustrate how much more I need to
learn::

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the
address.}

Each root complex needs to be an unique segment. A single
SRIOV endpoint can consume the entire 8-bit bus space and
the 8-bit dev/function space. In this context, a root complex
can be considered a PCI express controller with one or more
root ports. Each root port should be considered an unique
'segment'.

This is for device discovery, which uses the PCI express
"Extended Configuration Access Method" (aka ECAM) to scan
the PCI configuration spaces of all PCI ports.

Within a 'Chip' there are k cores, 1 last level cache, and
1 HostBridge with (say) 256 pins at its disposal. Said
pins can be handed out in powers of 2 of 4-pins each
so multiple PCIe trees of differing widths emanate from
the 256-PCIe-pins.

I guess you are calling each point of emanation a root.
I just bundle them under 1 HostBridge, and consider how
the "handing out" is done to be a HostBridge problem.
But As seen on the on-chip interconnect there is one
HostBridge which accesses all devices attached to this
Chip. Basically, I see on-chip-interconnect with one
HostBridge knowing that the pins will be allocated
"efficiently" for the attached devices.

Thanks for the ECAM pointer, that clears up a raft of
questions.

But (now with the new CXL) instead of allocating 200+ pins
to DRAM those pins can be allocated to PCIe links; making any
chip much less dependent on which DRAM technology, which chip-
to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
and about the only other pins chip gets are RESET and ClockIn.

Note that bridging to PCI signalling will increase latency
somewhat, even with PCIe gen 6.

Unavoidable.

Bunches of these pins can be 'configured' into standard width
PCIe links (at least until one runs out of pins.)

Given that one has a PCIe root complex with around 256-pins
available, does one need multiple roots of such a wide tree ?

You basically need a root per device to accommodate SRIOV
devices (like enterprise grade network adapters, high-end
NVMe devices, etc).

As noted above: I knew more bits than B:D,F were needed,
but not which and where. And if a single SR-IOV device
consumes a whole B:D,F space sobeit. ECAM alignment
identifies those bits and the routings.

I guess reading m post backwards I did not pose any questions.

My thanks again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 20:15:45 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Question:: Down below you talk of the various interrupt control-
lers routing an interrupt <finally> to a core. What happens if the
core has changed its priority by the time the interrupt signal
arrives, but before it can change the state of the tables in the
interrupt controller that routed said interrupt here ?

Speaking for the ARM64 systems that I'm most recently
familiar with, the concept of priority is associated with
an interrupt (up to 8-bits worth of priority - an implementation
of the GIC is allowed to support as few as three bits).

The interrupt controller is distributed logic; there is a
component called the 'distributor' and another component called
the 'redistributor'. The former is global to the system and
the latter is a per-CPU component. The distributor also contains
a subsystem called the interrupt translation subsystem (ITS) which
supports interrupt virtualization.

The redistributor, being part of the core, handles the delivery
of an interrupt to the core (specifically asserting either the FIQ
or IRQ signals that cause entry to the IRQ or FIQ exception
handlers). The redistributor tracks the current running priority
(which is directly associated with the priority of the current
active interrupt, when not processing an interrupt, the current
running priority is called the IDLE priority and doesn't block
delivery of any interrupts). The redistributor communicates changes to
the RPR to the distributor, which will hold any interrupt that
is not eligable for delivery (for any reason, including lack
of priority). There is no way for software to change the
RPR - it only tracks the priority of the currently executing
interrupt.

Thank you for the wonderful ASCII art::

+-------------------------+
| PCI Device |
+-------------------------+
| MSI-X message (address: GITS_TRANSLATER control
register)
| (payload: Interrupt number (0 to N))
v (sideband: streamid) +-------------------------+
| Interrupt Translation | (DRAM tables: Device, Collection)
| Service | Lookup streamid in device table.
| | DT refers to Interrupt Translation Table
| | Translate inbound payload based on ITT to
an LPI
| | Collection table identifies target core +-------------------------+
| Internal message from ITS to redistributor for target
v
+-------------------------+
| Redistributor | (DRAM table: LPI properties)
| | Lookup LPI properties, contains priority
and enable bit
| | If not enabled or priority too low,
| | store in LPI pending table (also DRAM)
[*]
| | If enabled, unmasked at the CPU interface
| | and priority higher than RPR, assert FIQ
| | or IRQ signals to core. +-------------------------+
| IRQ/FIQ signals
v
+-------------------------+
| Core | Check PSTATE IRQ and FIQ mask bits
| | IRQ/FIQ can be routed to EL1, EL2 or EL3
| | by per-core control bits. Update
| | core state appropriately and enter ISR +-------------------------+

[*] as core RPR and signal masks change, the ITS re-evaluates pending
[**] LPI properties and pending bits are generally cached in the
redistributor for performance.

My concept looks like:: Based on a new understanding of where things
want to be due to CXL::

I am assuming that the Last Level Cache (LLC) is placed side by side
with HostBridge on Chip. This facilitates using PCIe for DRAM access
and for CXL caches. LLC also provides the access services to the
I/O MMU (with lots of caching) and maintains the interrupt tables.

+-------------------------+
| PCI Device |
+-------------------------+
| MSI-X message (address: GITS_TRANSLATER control
register)
| (payload: Interrupt number (0 to N))
v (sideband: streamid) +-------------------------+
| HostBridge Translation | IOMMU tables: B:D,F->Originating Context
| Service | Originating Context supplies Root pointers
| | and interrupt table address
| In LLC | HostBridge DRAM accesses are performed
through LLC
| | HostBridge MMI/O accesses routed out into
Chip
+-------------------------+
| MMI/O message from HTS to virtual context stack DRAM
address
|
| If core interrupt table matches MMI/O address ? SNARF message
| the message contains pending priority interrupt
bits.
v
+-------------------------+
| Core | If there is a interrupt at higher priority
| | than I am currently running ? begin
interrupt
| | negotiation (core continues to run
instructions)
| | If negotiation is successful ? Claim
interrupt
| | and context switch to Interrupt
Dispatcher.
+-------------------------+

There is no IRQ-like signal to the core, it is all done by a SNARF of
data to an address cores are watching. When a virtual core gets a new
time slice, as the core is fetching instructions, it also fetches its
pending priority interrupts from its interrupt table (maintained by
LLC),
and will "take" an higher pending interrupt prior to executing any
instructions at lower priority or lower privilege. Hereinafter, core
monitors its interrupt table address to SRARF updates.

Context stack contains pointers to the Thread headers of the 4
privilege levels, a pointer to the associated interrupt table,
and some other stuff--it is a cache line in size (8 DoubleWords)

The pointers to the Thread Headers have access the Root pointers
of those levels.

There is a 2-bit indicator in the context stack indicating which
Root Pointer is used to translate this I/O request.

All indexed via B:D,F (extended or not) and some rather static tables
placed in unCacheable DRAM. unCacheable DRAM is actually cached in
LLC, just not more local to cores. LLC, in essence, SNARFs the
HostBridge MSI-X message recognizes that this is an update to
the interrupt tables, inserts update, and then provides a message
which cores running that interrupt table will SNARF.

No wires (IRQ), just std messages flying across MMI/O space
doing exactly the same things.

For what reason are there multiple addresses ?

A system may have multiple interrupt controllers. In the
case of the ARM64 systems, there may be a case where some
interrupts should be considered level sensitive, in which
case they must use SPI type interrupts which have a different
target register for the MSI-X address field when compared with
LPI type interrupts.

Recall that the PCI spec must accomodate a wide range of system implementations (including Z-series).

Would you consider that "multiple interrupt tables all being
maintained by a single service port inside LLC which then
spews out updates any/all cores can see" to be multiple
interrupt controllers ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 20:18:07 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

One thing to note::

The PCIe root complex may send out an address, but when the
MMI/O receiver of said address then uses bits in the address
for routing purposes or interpretation purposes, it matters
not that the sender did not know those bits were being used
thusly--those bits actually do carry specific meaning--just
not to the sender.

Said routing does not necessarily mean that the bits in
the addresses are interpreted in any specific way, if the
routing is done by a set of programmed range registers
(i.e. range X to X+Y is routed to destination Z,e.g. the
interrupt controller).

I was asking the contrapositive::

Is a system architecture allowed to define certain bits of
the translated address to be used as either routing or
indexing of a table that provides routing information.

Not as seen by request originator or request target, but
by the middle-men of transport ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 20:21:45 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

Again allow me to express my gratitute on the quality of your posts !

A couple of dumb questions to illustrate how much more I need to
learn::

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the
address.}

Each root complex needs to be an unique segment. A single
SRIOV endpoint can consume the entire 8-bit bus space and
the 8-bit dev/function space. In this context, a root complex
can be considered a PCI express controller with one or more
root ports. Each root port should be considered an unique
'segment'.

This is for device discovery, which uses the PCI express
"Extended Configuration Access Method" (aka ECAM) to scan
the PCI configuration spaces of all PCI ports.

Within a 'Chip' there are k cores, 1 last level cache, and
1 HostBridge with (say) 256 pins at its disposal. Said
pins can be handed out in powers of 2 of 4-pins each
so multiple PCIe trees of differing widths emanate from
the 256-PCIe-pins.

Let us start by considering a PCI express device. Electrically
it is connected to a PCI Express controller instance. The
controller is responsible for the transport layer, link
layer and other portions of the PCI express protocols;
including translating PCI Express transaction layer
packets (TLPs) into host bus transactions which are
bridged to the SoC fabric (xbar, ring, mesh, et alia).

An instance of a PCI express controller is called a
Root Complex, and supports one or more Root Ports.
Each root port is electrically connected to a endpoint
or to a mainboard slot into which an endpoint can
be inserted. The controller manages link training
between the root port and the device (primarily for
plug-in devices) and provides interfaces to the three
PCI address spaces:

* Configuration (4096 bytes per function)
* Memory (2^64 maximum size)
* I/O (64KB - legacy for Intel IN/OUT instructions)

The PCI I/O address space is deprecated and not used on modern PCI Express devices. However, a PCI controller is allowed to present
the endpoint I/O space as a region mapped into the physical
address space of the host; the PCI controller will convert
accesses to those physical addresses to IO space TLPs
when posting downstream transactions to the IO space for
legacy PCI cards.

The PCI memory space is a 32 or 64-bit address space decoupled
from the host address space (although it is often mapped 1:1 with
the host address space, it isn't required to be if the PCI
controller instance has the ability to remap the address when
creating the downstream TLP).

The PCI configuration space contains control, status and
discovery registers that define the device. The first
four bytes of the PCI configuration space contain a 16-bit
VENDORID and a 16-bit DEVICEID field. These are read
by the operating system and used to select (and load if
necessary) the driver that handles that type of device.

The PCI configuration space also contains base address
registers (BARs) which describe the amount of address
space that the function consumes. The host programs
a base address into the BAR registers during initialization
(which on intel will be the same physical address range
in the host physical address space). This dates back to
the bus-based legacy PCI where all the functions on the
bus would see the transaction and needed to capture
that transaction (by matching the BAR register(s)).

With the point-to-point nature of PCI Express, the
BARs are primarily used for sizing the aperture and
the values written may or may not correspond to the
host physical address mapped to the aperture (this
mapping is generally implementation specific).

To size a bar, write all-ones to the bar register(s)
and read the value back. Unimplemented bits will read as
zero. Invert the value, add one, and you have the
required size of the aperture for that BAR.

The configuration space also contains two lists that
describe optional capabilities of a function. There
is a list of legacy PCI capabilities (MSI and MSI-X
fall into this bucket, as does the PCI Express
capability which marks the device as PCIe rather than
legacy PCI). For legacy PCI, the configuration space
was 256 bytes (and the legacy capabilities all reside
there). PCIe extended it to 4096 bytes and there is
a 16-bit pointer at offset 0x100 that is the head of
the list of PCI express capabilities (which include
link training stuff, SR-IOV, error reporting capabilities,
power management, etc).

The MSI-X capability includes a couple registers that
locate the MSI-X Vector and Pending arrays - these have
a 3 bit (bar indicator) that selects which BAR holds
the MSI-X registers, and an offset value is applied
to the bar to get to the first vector for the vector
array and the first bit for the PB array.

The configuration space is not directly accessible to
the host - in legacy PCI on Intel systems, the
southbridge had two registers in the IO space (cf8 and cfc)
that functioned as a peek/poke mechanism to access
the configuration space. PCI Express defined a mechanism
that allows a host to map the PCI configuration space
into the physical address space directly (called ECAM
and referred to earlier). The CF8/CFC mechanism
uses the device address (bus, device, function, aka
requster id) programmed by software into CF8 along
with the offset within the 4k space and then reads/writes
CFC to access the data at that address. For ECAM
accesses, there is a base address in the host physical
address space that maps the entire 'root port' configuration
space, addressed as (bus << 20) | (dev << 15) | (func << 12) | config-offset from the base address.

The discovery process starts with software reading the
first 32-bits of each 4K region - on legacy bus-based
PCI, the read would timeout and the controller would
abort it (called master abort) and return a value of
all ones as the result of the read. PCIe requires
the same bahavior.

If software reads 0xffffffff for the first 32-bits, it
adds 4k to the address and tries the next function.

The PCI function field in the RID is 3 bits, so a device
can support up to 8 functions. While legacy PCI supported
32 devices on a bus, PCI Express limits the device number
of an endpoint to zero, so downstream from the root port
a given bus will generally have no more than 8 functions.

The discovery process continues until all 256 buses
below the root port have been scanned. Note that the
root port contains a PCI-to-PCI bridge, which may have
integrated endpoints (RCiep) provided by the root complex;
which show up as devices or functions on bus 0. The
bridge forwards transactions downstream to bus 1 (usually,
but the bus numbers are programmable) which contains the
endpoint device.

During discovery, if a device advertises the PCI Express
SRIOV (Single Root I/O Virtualization) capability, the
device driver needs to configure the SRIOV functionality
including the number of virtual functions exposed by
the device (each being assigned to a guest). SRIOV
supports up to 65535 virtual functions, which consumes
the entire 256-bus space on that root port.

I guess you are calling each point of emanation a root.

More specifically the Root Complex is the PCI express
controller (e.g. Synopsys has PCIe controller IP). The
Root _Port_ is the physical connection to the endpoint
(or to a PCIe switch, but let's not go there now).

Often there is only one Root Port per Root Complex,
but the specification allows for multiple ports.

I just bundle them under 1 HostBridge, and consider how
the "handing out" is done to be a HostBridge problem.
But As seen on the on-chip interconnect there is one
HostBridge which accesses all devices attached to this
Chip. Basically, I see on-chip-interconnect with one
HostBridge knowing that the pins will be allocated
"efficiently" for the attached devices.

You basically need a root per device to accommodate SRIOV
devices (like enterprise grade network adapters, high-end
NVMe devices, etc).

As noted above: I knew more bits than B:D,F were needed,
but not which and where. And if a single SR-IOV device
consumes a whole B:D,F space sobeit. ECAM alignment
identifies those bits and the routings.

Ah, I should have read this one backwards :-)

I guess reading m post backwards I did not pose any questions.

My thanks again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Sat Jun 22 21:29:09 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

One thing to note::

The PCIe root complex may send out an address, but when the
MMI/O receiver of said address then uses bits in the address
for routing purposes or interpretation purposes, it matters
not that the sender did not know those bits were being used
thusly--those bits actually do carry specific meaning--just
not to the sender.

Said routing does not necessarily mean that the bits in
the addresses are interpreted in any specific way, if the
routing is done by a set of programmed range registers
(i.e. range X to X+Y is routed to destination Z,e.g. the
interrupt controller).

I was asking the contrapositive::

Is a system architecture allowed to define certain bits of
the translated address to be used as either routing or
indexing of a table that provides routing information.

Not as seen by request originator or request target, but
by the middle-men of transport ?

From the standpoint of the PCI specification, the host
side is completely unspecified. You could, for example,
use bits <63:60> to specify the socket, or chiplet that
the address should be routed to. Other bits may encode
the PCI controller #, interrupt controller, IOMMU, etc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sat Jun 22 22:14:17 2024

Scott Lurndal wrote:

Snipping a whole lot of information I had a basic knowledge
of but were not part of original tree of question being ask.

But thanks for the details--they help a lot.

You basically need a root per device to accommodate SRIOV
devices (like enterprise grade network adapters, high-end
NVMe devices, etc).

As noted above: I knew more bits than B:D,F were needed,
but not which and where.

Or even the name of what I am searching Google for.....
ECAM for example.

And if a single SR-IOV device

consumes a whole B:D,F space sobeit. ECAM alignment
identifies those bits and the routings.

Ah, I should have read this one backwards :-)

You know, sometimes when reading and writing these posts
What I need to write changes with my knowledge base and
some of the earlier writings become stale wrt what I now
grasp.

My thanks again.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Sat Jun 22 22:46:51 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

One thing to note::

The PCIe root complex may send out an address, but when the
MMI/O receiver of said address then uses bits in the address
for routing purposes or interpretation purposes, it matters
not that the sender did not know those bits were being used >>>>thusly--those bits actually do carry specific meaning--just
not to the sender.

Said routing does not necessarily mean that the bits in
the addresses are interpreted in any specific way, if the
routing is done by a set of programmed range registers
(i.e. range X to X+Y is routed to destination Z,e.g. the
interrupt controller).

I was asking the contrapositive::

Is a system architecture allowed to define certain bits of
the translated address to be used as either routing or
indexing of a table that provides routing information.

Not as seen by request originator or request target, but
by the middle-men of transport ?

From the standpoint of the PCI specification, the host
side is completely unspecified. You could, for example,
use bits <63:60> to specify the socket, or chiplet that
the address should be routed to. Other bits may encode
the PCI controller #, interrupt controller, IOMMU, etc.

Yes, I have information contained in PTEs that convert a std
LD or ST into a KNOWN configuration space access or Memory
mapped I/O space access--known inside the core*, and understood
by the on-die-interconnect to route said request to addressed
device (or at least HostBridge where it, then, figures out
where the device is after it has been configured.) During
this "figuring out" if bits have to be moved about--well that
is a typical HW problem that HW has various mostly cheap
solution for. {{Like concatenating the fields of a x86
segment register into an linear address.}}

At the time of this discussion, I am working out how all
the middlemen between cores and (effectively endpoints)
use the available bits to send stuff where it needs sent.
I know these will not be like those of other architectures
a) because I have no legacy to match
b) because I am trying out "new stuff"
c) there is no concept of wires (INTx) all of these have
been mapped into messages in MMI/O space.

New ways of connecting the dots that should be enough like
what other guys are doing that Linux porting is no harder
than necessary, but novel enough to require "even fewer"
excursions through the HyperVisor to get the dots connected
and maintain those connections.

(*) configuration space accesses are known within the core
because they follow strong ordering, while memory mapped I/O
accesses are known within the core because they are sequen-
tially consistent, unlike DRAM accesses which are only
cache consistent (except when ATOMIC stuff is going on
where the core drops back to sequentially consistent.)
This knowledge of which space enables higher performing
memory systems that automagically drop back to SC when
it matters and without Fence instructions being needed.

Since the core knows the space of the access, so does the
interconnect and orders things appropriately. From the core
end of looking at things, a configuration request is
properly ordered and delivered reliably to the endpoint;
A MMI/O request is properly ordered on the interconnect
and reliably delivered to endpoint. Likewise, endpoint
requests are reliably delivered to MMI/O or DRAM address
spaces. {{given a "special" device as some endpoint, and
with the already defined facilities, said device could
go out and read/write core control registers without
the data ever passing through memory (security stuff).
Of course one would want said device to be very secure
indeed in order to trust it that far....but electrically
is has to be at some "normal place" accessible via normal
protocol and transports.}}

It may seem that I have dumped a lot of requirements on
the last level cache. This may be true--but with the
advent of CXL, DRAM may migrate farther away from the
cores to the point where no pins on the chip are dedicated
to DRAM control, instead a PCIe channel to a DRAM control-
ler down the PCIe tree(s) allows the Chip to connect to
any DRAM technology (DDR4,5,6, HBM, RamBus, ...) any size
of DRAM, any position of DRAM,... just by changing the
popcorn part at the end of the tree.

Thus, LLC has to be able to read/write that DRAM and
CXL caches and then cache its results as is normally
expected. CXL caches extends this to SRAM in addition
to DRAM--its just a different popcorn part.

{{There are also no need to build chip-repeaters
like HyperTransport or whatever Intel call their similar
chip-to-chip transport. These will migrate to CXL for
all the right reason................................}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Mon Jun 24 14:50:34 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

Note that support for MSI in AHCI is optional (in which case the
legacy level sensitive PCI INTA/B/C/D signals are used).

The AHCI standard specification (ahci_1_3.pdf) is available publically.

What happens to SATA tagged command queuing with SRIOV?
The tag mapping would seem to interact with virtual interrupts.

SATA allows up to I think 8 commands queued at once, each with its own
tag number, which can be performed in any order. That tag indicates
which DMA mapping scatter gather set to use and is used to identify
which IO's are complete. A single interrupt can indicate multiple
tags are complete.

In native (non-virtualized) use the device driver assigns a free tag
number to an IO, sets up the DMA scatter/gather list for that tag,
and on completion interrupt, for each done tag it tears down the
DMA scatter/gather list, frees that tag number,
and completes the associated individual IO.

How would that work for SATA on SRIOV? It has to tart up a set of
virtual tags for each virtual device and multiplex them among
multiple virtual devices onto the device physical tag set.
Also each virtual disk device would need to have its own partition base
and range on the physical disk and the SRIOV port would offset the
block numbers into the correct partition range.
On completion interrupt it has to map the physical tag back to the virtual
one and trigger a virtual interrupt to the initiating virtual device.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Mon Jun 24 20:32:04 2024

EricP <ThatWouldBeTelling@thevillage.com> writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored
service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI) controller >> MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

Note that support for MSI in AHCI is optional (in which case the
legacy level sensitive PCI INTA/B/C/D signals are used).

The AHCI standard specification (ahci_1_3.pdf) is available publically.

What happens to SATA tagged command queuing with SRIOV?
The tag mapping would seem to interact with virtual interrupts.

The AHCI specification dates back to legacy PCI, and the
MSI support is optional. Tagged queueing, if I recall
correctly, came later and required drive support.

The host interrupt handler would
need to be prepared to poll all the ports for activity
when invoked.

SATA allows up to I think 8 commands queued at once, each with its own
tag number, which can be performed in any order. That tag indicates
which DMA mapping scatter gather set to use and is used to identify
which IO's are complete. A single interrupt can indicate multiple
tags are complete.

In native (non-virtualized) use the device driver assigns a free tag
number to an IO, sets up the DMA scatter/gather list for that tag,
and on completion interrupt, for each done tag it tears down the
DMA scatter/gather list, frees that tag number,
and completes the associated individual IO.

How would that work for SATA on SRIOV?

There is no standard for AHCI that supports SR-IOV that I'm
aware of. NVMe does have SR-IOV support.

Without SR-IOV, the hypervisor must be the only
entity that communicates with the AHCI controller
and a paravirtualization (linux virtio) driver is
provided to the guest for storage device access.

The NVMe controller hardware interface was
designed to fix many of the shortcomings of the
AHCI implementation, particularly with respect
to virtualization.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Tue Jun 25 00:50:46 2024

Scott,

Can you foresee any problem is I define the positions in the
physical address space as: <ECAM: My 66000-style>::

<1:0> must be 00 for word access
<2:7> standard MMI/O register
<12:8> extended MMI/O register
<25:13> growth space for registers or for functions
<32:26> PCIe Device, Function
<40:32> PCIe Bus
<56:41> PCIe Segment
<63:57> Chip

This effectively gives each B;D,F a 25-bit address space and
65K segments and up to 32 chips on a motherboard. Need more
space for functions? take as many bits as you like from the
left hand side. Need more register room? take bits from the
right hand side. Need more bits for Chip? Steal them from
PCIe segment.

I wanted to move B;D,F up a bit to separate it from the I/O
registers which will likely come out of a memory reference
immediate, and I wanted to position B from D,F across a MMU
translation level boundary.

I am expecting the code touching the MMI/O register to have
a virtual address pointer to B;D,F and use the 16-bit immediate
field of the LD or ST as the register specifier:

ST #command7,[Rdevice,#registername]

--------------------------------------------------------------
I am expecting to use the Chip field to route requests between
chips. It is plausible that physical device sends an interrupt
from its PCIe segment across one-or-more chips before arriving
at the interrupt service port in a particular chips last level
cache. Other than latency its all part of a large coherent DRAM
space.

Is that plausible ? desirable ? or are there reasons to keep
interrupt processing "more local" to the chip hosting the PCIe
root complexes ?? {in any event, that is all under SW control.|

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jun 25 13:40:53 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott,

Can you foresee any problem is I define the positions in the
physical address space as: <ECAM: My 66000-style>::

<1:0> must be 00 for word access
<2:7> standard MMI/O register
<12:8> extended MMI/O register
<25:13> growth space for registers or for functions
<32:26> PCIe Device, Function
<40:32> PCIe Bus
<56:41> PCIe Segment
<63:57> Chip

Mitch,

The BDF will never change; it's been the same since PCI was
introduced in 1992. It's very unlikely that the size of the
PCIe configuration space (4096 bytes) will ever change; if
it does, the current ECAM specification and all Operating
systems will need to change.

The PCIe ECAM specfication requires that the bus/dev/function
fields occupy bits <27:12> and the register address occupy
bits <11:0>.

Anything above bit 27 is outside the PCI specification.

Most systems that support multiple PCIe controllers assign
each port to an unique segment and thus bits <xx:28> encode
the PCIe controller number.

All current operating systems will expect this:

(From PCI_Express_Base_r3.0_10Nov10.pdf)

Table 7-1: Enhanced Configuration Address Mapping

Memory Address PCI Express Configuration Space
A[(20 + n � 1):20] Bus Number 1 <= n <= 8
A[19:15] Device Number
A[14:12] Function Number
A[11:8] Extended Register Number

A[7:2] Register Number

A[1:0] Along with size of the access, used to
generate Byte Enables

I am expecting the code touching the MMI/O register to have
a virtual address pointer to B;D,F and use the 16-bit immediate
field of the LD or ST as the register specifier:

ST #command7,[Rdevice,#registername]

Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
single-copy atomic accesses to all configuration space registers.

--------------------------------------------------------------
I am expecting to use the Chip field to route requests between
chips. It is plausible that physical device sends an interrupt
from its PCIe segment across one-or-more chips before arriving
at the interrupt service port in a particular chips last level
cache. Other than latency its all part of a large coherent DRAM
space.

It's fine to use the chip field to route transactions within
the chip - if you use a 1:1 mapping between the PCI memory
space and the host memory space, then you can program the
chip field directly in the MSI-X vector address.

Or, your host bridge can hold mapping tables that maps the
downstream PCI memory address space addresses to host
addresses, inserting the target chip id based on the
bridge configuration registers (which are defined by
the host, not PCI).

Is that plausible ? desirable ? or are there reasons to keep
interrupt processing "more local" to the chip hosting the PCIe
root complexes ?? {in any event, that is all under SW control.}

That depends on how your interrupt controller is designed. If
you have a multi-socket/multi-chiplet configuration where all
chiplets are identical, and each has its own interrupt controller
(allowing single chiplet implementations), then you'll probably
want to use your CHIP bits in the address to route to the interrupt
controller on the closest chiplet just to reduce interrupt
latency. The interrupt controllers will likely need to cooperating
at the hardware level to maintain a single OS-visible "interrupt space" where each controller handles a subset of the interrupt number space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Tue Jun 25 15:58:51 2024

Yes, I deserve that.
I figured out haw "bad" an idea it was at the bar last night.
Sorry to have wasted so much of your time.....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 27 01:47:49 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott,

Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
single-copy atomic accesses to all configuration space registers.

MMI/O is sequentially consistent while Config is Strongly ordered.

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

Load Multiple (LM) instruction provides ATOMIC access to a series
of sequentially ordered bytes in MMI/O (or Config, or DRAM). So,
if desired, one could read 0x0-0x40 configuration header in a
single LM. This is completely ATOMIC without SW doing anything
other than supplying an address and a count.

Memory to Memory Move (MM) is similar but reads from one place and
writes to another in a single interconnect transaction--that is
ATOMICally. {Basically as long as no page boundaries are crossed,
each memory reference instruction is ATOMIC with respect to
interested 3rd party observations.}

Likewise, LDB, LDH, LDW, LDD and their ST counterparts are unit
ATOMIC.

But I don't know what you mean by "single-copy atomic accesses" ??

<snip>

That depends on how your interrupt controller is designed. If
you have a multi-socket/multi-chiplet configuration where all
chiplets are identical, and each has its own interrupt controller
(allowing single chiplet implementations), then you'll probably
want to use your CHIP bits in the address to route to the interrupt controller on the closest chiplet just to reduce interrupt
latency.

Yes, that was the concern. One might expect that the Guest OS
would send an I/O request to a Guest OS device driver on the
chip local to the PCIe tree the device is on, to minimize all
the latency, not just interrupt delivery. Reading and writing
of MMI/O space is <as they say> slow.

The interrupt controllers will likely need to cooperate
at the hardware level to maintain a single OS-visible "interrupt
space" where each controller handles a subset of the interrupt
number space.

My model has multiple interrupt tables from the get go.

For a start, I assume that each Guest OS has an interrupt table
shared across however many number of virtual of physical cores
the system manages. A HyperVisor has its own Interrupt Table,
and the Secure Monitor has its own table.

A core control register points at this table, and is used when
negotiating for an interrupt, and used to detect Interrupt
table priority escalation (when a priority bit is turned on).
If SW wants a different Interrupt Table (or none) a simple
write to the control register switches the table.*

An interrupt table has interrupts raised when an MSI or MSI-X
interrupt arrives at the Interrupt service provider port.

I have configured the I/O MMU to translate DMA acceses through
one set of MMU tables, and translate Interrupt access through
<a conceptually> different MMU tables, and have access to the
priority of the interrupt without taking MSI-X message bits.

So, DMA can read or write directly through application MMU
tables while the associated Interrupt goes to Guest OS at
priority of SW's choice using the Interrupt table in charge
when the I/O was setup. -----------------------------------------------------------
(*) So a virtual machine with 17 virtual cores and accepting
interrupts on only 5 of them will have 5 with IT set and 12
with IT set invalid.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Thu Jun 27 11:27:20 2024

On Thu, 27 Jun 2024 01:47:49 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It sounds as a politically correct way of saying "default memory
ordering of ARMv8.1-A and later".
I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
Probably stronger than POWER, but I am not sure if POWER ever had memory ordering model formalized.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Thu Jun 27 13:52:59 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott,

Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
single-copy atomic accesses to all configuration space registers.

MMI/O is sequentially consistent while Config is Strongly ordered.

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It's a term-of-art in the ARM architecture document (DDI0487).

A memory access instruction that is single-copy atomic has the following properties:

1. For a pair of overlapping single-copy atomic store instructions, all
of the overlapping writes generated by one of the stores are
Coherence-after the corresponding overlapping writes generated
by the other store.

2. For a single-copy atomic load instruction L1 that overlaps a single-copy
atomic store instruction S2, if one of the overlapping reads generated
by L1 Reads-from one of the overlapping writes generated by S2, then none
of the overlapping writes generated by S2 are Coherence-after the
corresponding overlapping reads generated
by L1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Michael S on Thu Jun 27 09:25:34 2024

Michael S wrote:

On Thu, 27 Jun 2024 01:47:49 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It sounds as a politically correct way of saying "default memory
ordering of ARMv8.1-A and later".
I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
Probably stronger than POWER, but I am not sure if POWER ever had memory ordering model formalized.

Multi-copy atomic is ARM's name for a write-update coherence protocol
as it allows each cache to have its own copy of a single memory location. Single-copy atomic is their name for a write-invalidate protocol
as it ensures that there is one value for each memory location.

Originally ARM's weak cache coherence protocol spec, like Alpha,
did not explicitly exclude multi-copy atomic so software designers had
to consider all the extra race conditions a write-update implementation
might allow. But this was wasted extra effort because no one implements
a write-update protocol, just write-invalidate.
Eventually ARM specified that it was single-copy atomic (write-invalidate).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 27 17:33:16 2024

EricP wrote:

Michael S wrote:

On Thu, 27 Jun 2024 01:47:49 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It sounds as a politically correct way of saying "default memory
ordering of ARMv8.1-A and later".
I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
Probably stronger than POWER, but I am not sure if POWER ever had memory
ordering model formalized.

Multi-copy atomic is ARM's name for a write-update coherence protocol
as it allows each cache to have its own copy of a single memory
location.

Sounds like SNARFing

Single-copy atomic is their name for a write-invalidate protocol
as it ensures that there is one value for each memory location.

Originally ARM's weak cache coherence protocol spec, like Alpha,
did not explicitly exclude multi-copy atomic so software designers had
to consider all the extra race conditions a write-update implementation
might allow. But this was wasted extra effort because no one implements
a write-update protocol, just write-invalidate.
Eventually ARM specified that it was single-copy atomic
(write-invalidate).

Seems to me that if one is sequentially consistent, then one is also
multi-copy ATOMIC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu Jun 27 17:37:12 2024

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott Lurndal wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

Scott,

Note that the ECAM must support 8, 16, 32-bit (optionally 64-bit)
single-copy atomic accesses to all configuration space registers.

MMI/O is sequentially consistent while Config is Strongly ordered.

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It's a term-of-art in the ARM architecture document (DDI0487).

A memory access instruction that is single-copy atomic has the
following properties:

1. For a pair of overlapping single-copy atomic store instructions,
all
of the overlapping writes generated by one of the stores are
Coherence-after the corresponding overlapping writes generated
by the other store.

Writes to a small local do not pass each other in the interconnect.

2. For a single-copy atomic load instruction L1 that overlaps a single-copy
atomic store instruction S2, if one of the overlapping reads
generated
by L1 Reads-from one of the overlapping writes generated by S2,
then none
of the overlapping writes generated by S2 are Coherence-after the
corresponding overlapping reads generated
by L1.

Because the LD saw the intermediate data state where some of the STs
were
complete while others pend.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Jun 28 12:24:50 2024

MitchAlsup1 wrote:

EricP wrote:

Michael S wrote:

On Thu, 27 Jun 2024 01:47:49 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It sounds as a politically correct way of saying "default memory
ordering of ARMv8.1-A and later".
I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
Probably stronger than POWER, but I am not sure if POWER ever had memory >>> ordering model formalized.

Multi-copy atomic is ARM's name for a write-update coherence protocol
as it allows each cache to have its own copy of a single memory
location.

Sorry, I had this backwards.
Multi-copy atomic was ARM's name for what others call store atomicity,
which requires that a core's stores appear to be seen by all other cores
at once. This is the effect write-invalidate protocols produce.

Non-MCA was what weak ordered write-update protocols can cause where
different nodes can see the same location as having different values.

Sounds like SNARFing

Write-update depends on broadcasting all writes if that's what snarf means. Write-update requires write-through caches.
It requires that the coherence network and controllers all apply
all writes from all sources in the same order across all caches.

With a shared snoopy bus and a one level cache all updated synchronously
this can be efficient as the bus itself acts as an ordering mutex.
Outside of that it was not considered to scale well as otherwise it
needs to send a message to each peer node for each write.

Also it must ensure that writes to the same location are applied in the
same order across all nodes. When one introduces multiple layers of caches connected with multiple comms queues that synchronization complicates.
If one considers mesh networks, messages from even the same source
may arrive at different nodes in different order.

I didn't know that any production systems used write-update.
The book "A Primer on Memory Consistency and Cache Coherence" 2nd Ed 2020
has a chapter on write-update protocols and according to it examples of
systems that used write-update are Sun Starfire E10000 and IBM Power5.
Starfire used point-to-point messaging to create a "logical" shared bus.
Power5 used a unidirectional ring network.

Originally ARM's weak cache coherence protocol spec, like Alpha,
did not explicitly exclude multi-copy atomic so software designers had
to consider all the extra race conditions a write-update implementation
might allow. But this was wasted extra effort because no one implements
a write-update protocol, just write-invalidate.
Eventually ARM specified that it was single-copy atomic
(write-invalidate).

Correction: They originally did not explicitly require store atomicity (MCA), implying that a weak ordered write-update protocol might allow a single location to be seen as having different values on different nodes.

Seems to me that if one is sequentially consistent, then one is also multi-copy ATOMIC.

Yes, store atomicity to each locations would be implied by SC
otherwise how could all nodes agree on the order of all updates.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jun 28 20:30:56 2024

EricP wrote:

MitchAlsup1 wrote:

Seems to me that if one is sequentially consistent, then one is also
multi-copy ATOMIC.

Yes, store atomicity to each locations would be implied by SC
otherwise how could all nodes agree on the order of all updates.

Most of the time cores only need to agree about cache consistency
and this can be satisfied by causal consistency.

ATOMIC stuff is where cores starts to require SC,
and all MMI/O should be SC or SC per virtual channel.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri Jun 28 20:26:42 2024

EricP wrote:

MitchAlsup1 wrote:

Sounds like SNARFing

Write-update depends on broadcasting all writes if that's what snarf
means.

General cache coherency policies broadcast a cores address to all
caches in the system, and if that cache contains that same cache
line, it responds with a SHARED back to requestor, or it invalidates
the line. We call this SNOOPing. It works well.

SNARF is a term whereby the owner of data broadcasts the data and
its address, and any cache containing that line will write the
data payload into its cache 9rather than invalidating and then
going back and fetching it anew. For certain kinds of data struct
SNARF is significantly more efficient than Invalidate-Refetch.
A single message around the system performs all the needed updates,
instead of 1 invalidate and K fetches.

SNARF is almost exclusively used as side-band signals hiding under
the cache coherent Interconnect command set.

SNARF is almost never available to software. It is more like micro- Architecture talking to other microArchitecture.

Also note: µA-to-µA is rarely of line size and often uses physical
address bits not available through MMU tables.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Sun Jun 30 00:41:05 2024

On Fri, 28 Jun 2024 20:26:42 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

EricP wrote:

MitchAlsup1 wrote:

Sounds like SNARFing

Write-update depends on broadcasting all writes if that's what snarf
means.

General cache coherency policies broadcast a cores address to all
caches in the system, and if that cache contains that same cache
line, it responds with a SHARED back to requestor, or it invalidates
the line. We call this SNOOPing. It works well.

SNARF is a term whereby the owner of data broadcasts the data and
its address, and any cache containing that line will write the
data payload into its cache 9rather than invalidating and then
going back and fetching it anew. For certain kinds of data struct
SNARF is significantly more efficient than Invalidate-Refetch.
A single message around the system performs all the needed updates,
instead of 1 invalidate and K fetches.

SNARF is almost exclusively used as side-band signals hiding under
the cache coherent Interconnect command set.

SNARF is almost never available to software. It is more like micro- >Architecture talking to other microArchitecture.

Also note: µA-to-µA is rarely of line size and often uses physical
address bits not available through MMU tables.

Stupid question: why is it called "snarf"?

IIRC, Snoopy (Peanuts) "scarfed" his food. I don't recall ever seeing
Snarf (Thundercats) actually eat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to George Neuner on Sun Jun 30 16:16:01 2024

George Neuner wrote:

On Fri, 28 Jun 2024 20:26:42 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

EricP wrote:

MitchAlsup1 wrote:

Sounds like SNARFing

Write-update depends on broadcasting all writes if that's what snarf
means.

General cache coherency policies broadcast a cores address to all
caches in the system, and if that cache contains that same cache
line, it responds with a SHARED back to requestor, or it invalidates
the line. We call this SNOOPing. It works well.

SNARF is a term whereby the owner of data broadcasts the data and
its address, and any cache containing that line will write the
data payload into its cache 9rather than invalidating and then
going back and fetching it anew. For certain kinds of data struct
SNARF is significantly more efficient than Invalidate-Refetch.
A single message around the system performs all the needed updates,
instead of 1 invalidate and K fetches.

SNARF is almost exclusively used as side-band signals hiding under
the cache coherent Interconnect command set.

SNARF is almost never available to software. It is more like micro- >>Architecture talking to other microArchitecture.

Also note: µA-to-µA is rarely of line size and often uses physical >>address bits not available through MMU tables.

Stupid question: why is it called "snarf"?

I don't really know--first heard the term in 1982 as a SNOOP but in
the other direction--instead of taking data away, it put data back.

IIRC, Snoopy (Peanuts) "scarfed" his food. I don't recall ever seeing
Snarf (Thundercats) actually eat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From aph@littlepinkcloud.invalid@21:1/5 to EricP on Mon Jul 1 09:33:37 2024

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Michael S wrote:

On Thu, 27 Jun 2024 01:47:49 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

Exactly what are you intending to mean from "single-copy atomic
accesses" ??

It sounds as a politically correct way of saying "default memory
ordering of ARMv8.1-A and later".
I.e. weaker than x86-64 and SPARC TSO, but stronger than Itanium.
Probably stronger than POWER, but I am not sure if POWER ever had memory
ordering model formalized.

Multi-copy atomic is ARM's name for a write-update coherence protocol
as it allows each cache to have its own copy of a single memory location.

The terminology is not Arm's, it comes from

William W. Collier. 1992. Reasoning about parallel architectures.
Prentice Hall, Englewood Cliffs.

Single-copy atomic is their name for a write-invalidate protocol
as it ensures that there is one value for each memory location.

Originally ARM's weak cache coherence protocol spec, like Alpha,
did not explicitly exclude multi-copy atomic so software designers had
to consider all the extra race conditions a write-update implementation
might allow. But this was wasted extra effort because no one implements
a write-update protocol, just write-invalidate.
Eventually ARM specified that it was single-copy atomic (write-invalidate).

And it's now multi-copy atomic, thank goodness.

Andrew.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Thu Jul 4 17:34:33 2024

On Page Request Services (PRS)

Device performs an Address Translation Request (ATS) to a page
which is not currently present in memory. So the I/O MMU sends
it a PTE which contains the not-present information.

But the system operates with nested paging. one manipulated
by Guest OS and the other manipulated by HyperVisor.

Yet device merely got "not-present"

So when device requests the page be brought in, how does the
I/O MMU know whether to interrupt Guest OS or to interrupt
Hypervisor to bring in the page and restart the command ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Fri Jul 5 18:57:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Page Request Services (PRS)

Device performs an Address Translation Request (ATS) to a page
which is not currently present in memory. So the I/O MMU sends
it a PTE which contains the not-present information.

But the system operates with nested paging. one manipulated
by Guest OS and the other manipulated by HyperVisor.

Yet device merely got "not-present"

So when device requests the page be brought in, how does the
I/O MMU know whether to interrupt Guest OS or to interrupt
Hypervisor to bring in the page and restart the command ??

The stream id which identifies the DMA stream from the device
also identifies the page requests, so they're queued by the
IOMMU based on IOMMU configuration tables (i.e. an inbound
translation or pri request will first lookup the stream ID
to determine the translation table base register for that
stream). If PRI is supported, it will also queue the page
request to a queue corresponding to the hypervisor or
guest that is configured as the owner of that stream
and generate an interrupt to the hv/kernel/guest. The
interrupt can be defered for a page request group and
will be delivered only when the 'last' bit is set in the
request.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Wed Jul 10 17:59:46 2024

On page 43 of:: https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

it states: "Must not indicate an invalidation has completed
until all outstanding Read Requests that reference the
associated translation have retired"

"Must insure that the invalidation completion indication to RC
will arrive at the RC after previously posted writes that use
the stale address."

and

"...If transactions are in a queue waiting to be sent, It is
not necessary for the device to expunge requests from the
queue even if those transaction[s] use an address that is
being invalidated."

The first 2 seem to be PCIe ordering requirements between
EP and RC.

The 3rd seems to say if EP used a translation while it was
valid, then its invalidation does not prevent requests
using the now stale translation.

So, a SATA device could receive a command to read a page
into memory. SATA EP requests ATS for the translation of
the given virtual address to the physical page. Then the
EP creates a queue of write requests filling in the addr
while waiting on data. Once said queue has been filled,
and before the data comes off the disk, an invalidation
arrives and is ACKed. The data is still allowed to write
into memory.

{{But any new command to the SATA device would not be
allowed to use the translation.}}

Is this a reasonable interpretation of that page?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to mitchalsup@aol.com on Wed Jul 10 18:21:14 2024

In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:

On page 43 of:: >https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

it states: "Must not indicate an invalidation has completed
until all outstanding Read Requests that reference the
associated translation have retired"

"Must insure that the invalidation completion indication to RC
will arrive at the RC after previously posted writes that use
the stale address."

and

"...If transactions are in a queue waiting to be sent, It is
not necessary for the device to expunge requests from the
queue even if those transaction[s] use an address that is
being invalidated."

The first 2 seem to be PCIe ordering requirements between
EP and RC.

The 3rd seems to say if EP used a translation while it was
valid, then its invalidation does not prevent requests
using the now stale translation.

So, a SATA device could receive a command to read a page
into memory. SATA EP requests ATS for the translation of
the given virtual address to the physical page. Then the
EP creates a queue of write requests filling in the addr
while waiting on data. Once said queue has been filled,
and before the data comes off the disk, an invalidation
arrives and is ACKed. The data is still allowed to write
into memory.

{{But any new command to the SATA device would not be
allowed to use the translation.}}

Is this a reasonable interpretation of that page?

No, it's saying that the EP can keep using a stale translation UNTIL it
returns the ACK for an invalidation. It does not need to toss those requests--it just needs to delay the ACK. Or it could toss the requests,
and then send the ACK faster, but it's optional if it wants to toss requests.

Once the EP sends the ACK, it can no longer send any transactions
using the old translation.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Kent Dickey on Wed Jul 10 19:02:12 2024

kegs@provalid.com (Kent Dickey) writes:

In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:

On page 43 of:: >>https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

it states: "Must not indicate an invalidation has completed
until all outstanding Read Requests that reference the
associated translation have retired"

"Must insure that the invalidation completion indication to RC
will arrive at the RC after previously posted writes that use
the stale address."

and

"...If transactions are in a queue waiting to be sent, It is
not necessary for the device to expunge requests from the
queue even if those transaction[s] use an address that is
being invalidated."

The first 2 seem to be PCIe ordering requirements between
EP and RC.

The 3rd seems to say if EP used a translation while it was
valid, then its invalidation does not prevent requests
using the now stale translation.

So, a SATA device could receive a command to read a page
into memory. SATA EP requests ATS for the translation of
the given virtual address to the physical page. Then the
EP creates a queue of write requests filling in the addr
while waiting on data. Once said queue has been filled,
and before the data comes off the disk, an invalidation
arrives and is ACKed. The data is still allowed to write
into memory.

{{But any new command to the SATA device would not be
allowed to use the translation.}}

Is this a reasonable interpretation of that page?

No, it's saying that the EP can keep using a stale translation UNTIL it >returns the ACK for an invalidation. It does not need to toss those >requests--it just needs to delay the ACK. Or it could toss the requests,
and then send the ACK faster, but it's optional if it wants to toss requests.

Indeed. And I'd suggest that the official PCI Express
specification is a better source than a set of slides.

From the spec:

a. A Function is required not to indicate the invalidation has completed until
all outstanding Read Requests or Translation Requests that reference the
associated translated address have been retired or nullified.
b. A Function is required to ensure that the Invalidate Completion indication
to the RC will arrive at the RC after any previously posted writes that use
the "stale" address.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Wed Jul 10 22:01:27 2024

On Wed, 10 Jul 2024 19:02:12 +0000, Scott Lurndal wrote:

kegs@provalid.com (Kent Dickey) writes:

In article <922220c8593353c7ed0fda9e656d359d@www.novabbs.org>,
MitchAlsup1 <mitchalsup@aol.com> wrote:

On page 43 of:: >>>https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

it states: "Must not indicate an invalidation has completed
until all outstanding Read Requests that reference the
associated translation have retired"

"Must insure that the invalidation completion indication to RC
will arrive at the RC after previously posted writes that use
the stale address."

and

"...If transactions are in a queue waiting to be sent, It is
not necessary for the device to expunge requests from the
queue even if those transaction[s] use an address that is
being invalidated."

The first 2 seem to be PCIe ordering requirements between
EP and RC.

The 3rd seems to say if EP used a translation while it was
valid, then its invalidation does not prevent requests
using the now stale translation.

So, a SATA device could receive a command to read a page
into memory. SATA EP requests ATS for the translation of
the given virtual address to the physical page. Then the
EP creates a queue of write requests filling in the addr
while waiting on data. Once said queue has been filled,
and before the data comes off the disk, an invalidation
arrives and is ACKed. The data is still allowed to write
into memory.

{{But any new command to the SATA device would not be
allowed to use the translation.}}

Is this a reasonable interpretation of that page?

No, it's saying that the EP can keep using a stale translation UNTIL it >>returns the ACK for an invalidation. It does not need to toss those >>requests--it just needs to delay the ACK. Or it could toss the requests, >>and then send the ACK faster, but it's optional if it wants to toss
requests.

Indeed. And I'd suggest that the official PCI Express
specification is a better source than a set of slides.

From the spec:

I do not have access through the PCIe paywall.

a. A Function is required not to indicate the invalidation has completed until
all outstanding Read Requests or Translation Requests that reference
the
associated translated address have been retired or nullified.
b. A Function is required to ensure that the Invalidate Completion
indication
to the RC will arrive at the RC after any previously posted writes
that use
the "stale" address.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Wed Jul 10 22:51:19 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 10 Jul 2024 19:02:12 +0000, Scott Lurndal wrote:

Indeed. And I'd suggest that the official PCI Express
specification is a better source than a set of slides.

From the spec:

I do not have access through the PCIe paywall.

A google search turned up a couple older ones. Version 4
and up describe ATS.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Sun Jul 28 22:19:44 2024

On Wed, 10 Jul 2024 17:59:46 +0000, MitchAlsup1 wrote:

On page 34 of:: https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

They/He uses the notation VP# (virtual Plane number)

Is that what we have been calling the PCIe "Segment" ?? from ECAM

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Mon Jul 29 03:10:58 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Wed, 10 Jul 2024 17:59:46 +0000, MitchAlsup1 wrote:

On page 34 of:: >https://www.cs.uml.edu/~bill/cs520/slides_15E_PCI_Express_IOV.pdf

They/He uses the notation VP# (virtual Plane number)

Is that what we have been calling the PCIe "Segment" ?? from ECAM

No, 'VP#' is a concept related to multi-root MRIOV (i.e. where
multiple host root complexes are sharing a single SR-IOV capable
endpoint via one or more multi-root capable PCI Express switches.

Each root complex which shares resources of an SR-IOV endpoint
physical function will operate in that virtual plane (as defined
by the VP field in the root complex MRIOV capability).

This is independent of the 'segment' or 'domain' notation to
concatenate ECAM regions for multiple RCs on a single host.

Note that MR IOV is very rare at this point (two decades
after the above PCI-SIG presentation - I think I was at
that meeting, actually).

VP is to qualify BDF on the PCI fabric in config TLPs. Segment/domain are
the mechanisms used to qualify access to the configuration space
by the host via the host ECAM region(s) - they're not really a
PCI express concept; PCIe simpley defines one ECAM per root complex
(intel calls them segments, arm calls them domains).

Concatenating the RC ECAM regions leads to using bits <20+n:20> as
the 'segment' number for host accesess to the concatenated
ECAM.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	42:29:36
Calls:	10,392
Files:	14,064
Messages:	6,417,214

PCIe MSI-X interrupts

Who's Online

System Info